My question relates to the correct syntax for using mi with lagged independent variables and panel data. The independent variable I want to lag has been imputed. Stata will estimate a model without producing an error, but it doesn't appear to be treating the lags properly. Yulia Marchenko addressed this issue here and here, but I'm still not getting it.
Below is a minimal reproducable example illustrating the problem. I regressed logged wages on age and lagged age where age has been imputed. This is panel data in long format with person-year observations. I've used mi set flong.
I thought everything worked until I received a not sorted error message after using mi predict. (I should have noticed that the number of observations is off too.) I couldn't fix the error by sorting the data, so I tried to troubleshoot the problem by generating lagged variables using mi passive.
The mi passive command runs without error but will not generate lags for imputed values. (Yulia and others mention this issue in the posts referenced above.) It seems like there are issues with using mi passive on imputed variables, but I thought it might work with the flong format based on the earlier posts.
Finally, I created the lagged variable without using the mi prefix. In this case, I was able to use mi estimate and mi predict without error, but I'm still not sure if the estimates are based on the correct lags.
The second block of code repeats the initial code that gave me the not sorted error and my alternative attempts using mi passive and generating the lag without the mi prefix. Finally, included all of the output from the second block of code.
My question is if there is any easy fix for my original syntax using mi xtset and the lag operator. If not, will creating the lagged variables without the mi prefix give me the correct results if the data are in flong format.
Run on Stata 15.1
Below is a minimal reproducable example illustrating the problem. I regressed logged wages on age and lagged age where age has been imputed. This is panel data in long format with person-year observations. I've used mi set flong.
I thought everything worked until I received a not sorted error message after using mi predict. (I should have noticed that the number of observations is off too.) I couldn't fix the error by sorting the data, so I tried to troubleshoot the problem by generating lagged variables using mi passive.
The mi passive command runs without error but will not generate lags for imputed values. (Yulia and others mention this issue in the posts referenced above.) It seems like there are issues with using mi passive on imputed variables, but I thought it might work with the flong format based on the earlier posts.
Finally, I created the lagged variable without using the mi prefix. In this case, I was able to use mi estimate and mi predict without error, but I'm still not sure if the estimates are based on the correct lags.
The second block of code repeats the initial code that gave me the not sorted error and my alternative attempts using mi passive and generating the lag without the mi prefix. Finally, included all of the output from the second block of code.
My question is if there is any easy fix for my original syntax using mi xtset and the lag operator. If not, will creating the lagged variables without the mi prefix give me the correct results if the data are in flong format.
Run on Stata 15.1
Code:
clear all
webuse nlswork, clear
set seed 1234
keep idcode year ln_w age
gen missing_age = runiform()
replace age = . if missing_age > 0.90
drop missing_age
mi set flong
mi register imputed age
mi impute mvn age = ln_w, add(2)
mi xtset idcode year
tempfile mi_estimates3
mi estimate, saving(`mi_estimates3', replace): reg ln_w age L.age
mi predict yhat3 using `mi_estimates3', xb
Multiple-imputation estimates Imputations = 2
Linear regression Number of obs = 10,891
Average RVI = 0.0556
Largest FMI = 0.1758
Complete DF = 10888
DF adjustment: Small sample DF: min = 48.78
avg = 1,575.60
max = 4,580.68
Model F test: Equal FMI F( 2, 249.5) = 440.67
Within VCE type: OLS Prob > F = 0.0000
------------------------------------------------------------------------------
ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age |
--. | .01286 .0011315 11.37 0.000 .0106144 .0151055
L1. | .0071693 .0011705 6.12 0.000 .0048167 .0095219
|
_cons | 1.146098 .0187722 61.05 0.000 1.109295 1.1829
------------------------------------------------------------------------------
. mi predict yhat3 using `mi_estimates3', xb
not sorted
r(5);
end of do-file
r(5);
Code:
capture log close log using example, replace clear all webuse nlswork, clear set seed 1234 * Retain minimal variables keep idcode year ln_w age * Replace 10% of age variables with missing data gen missing_age = runiform() replace age = . if missing_age > 0.90 drop missing_age * mi set data using flong mi set flong mi register imputed age * Impute age mi impute mvn age = ln_w, add(2) * mi xtset data using id and year mi xtset idcode year * Generate lagged age without using mi prefix bysort _mi_m idcode (year): gen Lage1 = age[_n-1] * Generate lagged age using mi passive mi passive: by idcode (year): gen Lage2 = age[_n-1] * mi passive does not generate lagged values of imputed observations sum Lage1 Lage2 if _mi_m == 1 sort _mi_m idcode year br _mi_m idcode year age Lage1 Lage2 ** Lagged age generated without using mi * This may be producing the correct estimation tempfile mi_estimates1 mi estimate, saving(`mi_estimates1', replace): reg ln_w age Lage1 mi predict yhat1 using `mi_estimates1', xb ** Lagged age generated with mi passive prefix * This definitely is not producing the correct estimation because Lage2 * doesn't include lags for imputed values tempfile mi_estimates2 mi estimate, saving(`mi_estimates2', replace): reg ln_w age Lage2 mi predict yhat2 using `mi_estimates2', xb ** Lagged age using lagged operator * This also is definitely not right, but I have no idea what Stata is doing tempfile mi_estimates3 mi estimate, saving(`mi_estimates3', replace): reg ln_w age L.age mi predict yhat3 using `mi_estimates3', xb
Code:
. clear all
. webuse nlswork, clear
(National Longitudinal Survey. Young Women 14-26 years of age in 1968)
. set seed 1234
.
. * Retain minimal variables
. keep idcode year ln_w age
.
. * Replace 10% of age variables with missing data
. gen missing_age = runiform()
. replace age = . if missing_age > 0.90
(2,840 real changes made, 2,840 to missing)
. drop missing_age
.
. * mi set data using flong
. mi set flong
. mi register imputed age
(2864 m=0 obs. now marked as incomplete)
.
. * Impute age
. mi impute mvn age = ln_w, add(2)
Performing EM optimization:
note: 2864 observations omitted from EM estimation because of all imputation variables missing
observed log likelihood = -60600.223 at iteration 1
Performing MCMC data augmentation ...
Multivariate imputation Imputations = 2
Multivariate normal regression added = 2
Imputed: m=1 through m=2 updated = 0
Prior: uniform Iterations = 200
burn-in = 100
between = 100
------------------------------------------------------------------
| Observations per m
|----------------------------------------------
Variable | Complete Incomplete Imputed | Total
-------------------+-----------------------------------+----------
age | 25670 2864 2864 | 28534
------------------------------------------------------------------
(complete + incomplete = total; imputed is the minimum across m
of the number of filled-in observations.)
.
. * mi xtset data using id and year
. mi xtset idcode year
panel variable: idcode (unbalanced)
time variable: year, 68 to 88, but with gaps
delta: 1 unit
.
. * Generate lagged age without using mi prefix
. bysort _mi_m idcode (year): gen Lage1 = age[_n-1]
(16,511 missing values generated)
.
. * Generate lagged age using mi passive
. mi passive: by idcode (year): gen Lage2 = age[_n-1]
m=0:
(7,089 missing values generated)
m=1:
(4,711 missing values generated)
m=2:
(4,711 missing values generated)
(4238 values of passive variable Lage2 in m>0 updated to match values in m=0)
.
. * mi passive does not generate lagged values of imputed observations
. sum Lage1 Lage2 if _mi_m == 1
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
Lage1 | 23,823 28.1496 6.209421 3.0167 52.91406
Lage2 | 21,704 28.05275 6.15921 8.967278 48.56565
. sort _mi_m idcode year
. br _mi_m idcode year age Lage1 Lage2
.
. ** Lagged age generated without using mi
. * This may be producing the correct estimation
. tempfile mi_estimates1
. mi estimate, saving(`mi_estimates1', replace): reg ln_w age Lage1
Multiple-imputation estimates Imputations = 2
Linear regression Number of obs = 23,823
Average RVI = 0.0797
Largest FMI = 0.2320
Complete DF = 23820
DF adjustment: Small sample DF: min = 29.99
avg = 7,944.08
max = 23,632.96
Model F test: Equal FMI F( 2, 133.7) = 747.88
Within VCE type: OLS Prob > F = 0.0000
------------------------------------------------------------------------------
ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0066467 .0007628 8.71 0.000 .0051409 .0081525
Lage1 | .0136565 .0007545 18.10 0.000 .0121776 .0151353
_cons | 1.124866 .0163344 68.86 0.000 1.091507 1.158226
------------------------------------------------------------------------------
. mi predict yhat1 using `mi_estimates1', xb
(4711 missing values generated)
.
. ** Lagged age generated with mi passive prefix
. * This definitely is not producing the correct estimation because Lage2
. * doesn't include lags for imputed values
. tempfile mi_estimates2
. mi estimate, saving(`mi_estimates2', replace): reg ln_w age Lage2
Multiple-imputation estimates Imputations = 2
Linear regression Number of obs = 21,704
Average RVI = 0.1746
Largest FMI = 0.4573
Complete DF = 21701
DF adjustment: Small sample DF: min = 8.49
avg = 89.18
max = 244.76
Model F test: Equal FMI F( 2, 34.9) = 597.86
Within VCE type: OLS Prob > F = 0.0000
------------------------------------------------------------------------------
ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .003083 .0011379 2.71 0.025 .0004854 .0056807
Lage2 | .0167302 .0011171 14.98 0.000 .0143388 .0191217
_cons | 1.145859 .0157225 72.88 0.000 1.11489 1.176827
------------------------------------------------------------------------------
. mi predict yhat2 using `mi_estimates2', xb
(6830 missing values generated)
.
. ** Lagged age using lagged operator
. * This also is definitely not right, but I have no idea what Stata is doing
. tempfile mi_estimates3
. mi estimate, saving(`mi_estimates3', replace): reg ln_w age L.age
Multiple-imputation estimates Imputations = 2
Linear regression Number of obs = 10,891
Average RVI = 0.0556
Largest FMI = 0.1758
Complete DF = 10888
DF adjustment: Small sample DF: min = 48.78
avg = 1,575.60
max = 4,580.68
Model F test: Equal FMI F( 2, 249.5) = 440.67
Within VCE type: OLS Prob > F = 0.0000
------------------------------------------------------------------------------
ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age |
--. | .01286 .0011315 11.37 0.000 .0106144 .0151055
L1. | .0071693 .0011705 6.12 0.000 .0048167 .0095219
|
_cons | 1.146098 .0187722 61.05 0.000 1.109295 1.1829
------------------------------------------------------------------------------
. mi predict yhat3 using `mi_estimates3', xb
not sorted
r(5);
end of do-file
r(5);

Comment