My question relates to the correct syntax for using mi with lagged independent variables and panel data. The independent variable I want to lag has been imputed. Stata will estimate a model without producing an error, but it doesn't appear to be treating the lags properly. Yulia Marchenko addressed this issue here and here, but I'm still not getting it.
Below is a minimal reproducable example illustrating the problem. I regressed logged wages on age and lagged age where age has been imputed. This is panel data in long format with person-year observations. I've used mi set flong.
I thought everything worked until I received a not sorted error message after using mi predict. (I should have noticed that the number of observations is off too.) I couldn't fix the error by sorting the data, so I tried to troubleshoot the problem by generating lagged variables using mi passive.
The mi passive command runs without error but will not generate lags for imputed values. (Yulia and others mention this issue in the posts referenced above.) It seems like there are issues with using mi passive on imputed variables, but I thought it might work with the flong format based on the earlier posts.
Finally, I created the lagged variable without using the mi prefix. In this case, I was able to use mi estimate and mi predict without error, but I'm still not sure if the estimates are based on the correct lags.
The second block of code repeats the initial code that gave me the not sorted error and my alternative attempts using mi passive and generating the lag without the mi prefix. Finally, included all of the output from the second block of code.
My question is if there is any easy fix for my original syntax using mi xtset and the lag operator. If not, will creating the lagged variables without the mi prefix give me the correct results if the data are in flong format.
Run on Stata 15.1
Below is a minimal reproducable example illustrating the problem. I regressed logged wages on age and lagged age where age has been imputed. This is panel data in long format with person-year observations. I've used mi set flong.
I thought everything worked until I received a not sorted error message after using mi predict. (I should have noticed that the number of observations is off too.) I couldn't fix the error by sorting the data, so I tried to troubleshoot the problem by generating lagged variables using mi passive.
The mi passive command runs without error but will not generate lags for imputed values. (Yulia and others mention this issue in the posts referenced above.) It seems like there are issues with using mi passive on imputed variables, but I thought it might work with the flong format based on the earlier posts.
Finally, I created the lagged variable without using the mi prefix. In this case, I was able to use mi estimate and mi predict without error, but I'm still not sure if the estimates are based on the correct lags.
The second block of code repeats the initial code that gave me the not sorted error and my alternative attempts using mi passive and generating the lag without the mi prefix. Finally, included all of the output from the second block of code.
My question is if there is any easy fix for my original syntax using mi xtset and the lag operator. If not, will creating the lagged variables without the mi prefix give me the correct results if the data are in flong format.
Run on Stata 15.1
Code:
clear all webuse nlswork, clear set seed 1234 keep idcode year ln_w age gen missing_age = runiform() replace age = . if missing_age > 0.90 drop missing_age mi set flong mi register imputed age mi impute mvn age = ln_w, add(2) mi xtset idcode year tempfile mi_estimates3 mi estimate, saving(`mi_estimates3', replace): reg ln_w age L.age mi predict yhat3 using `mi_estimates3', xb Multiple-imputation estimates Imputations = 2 Linear regression Number of obs = 10,891 Average RVI = 0.0556 Largest FMI = 0.1758 Complete DF = 10888 DF adjustment: Small sample DF: min = 48.78 avg = 1,575.60 max = 4,580.68 Model F test: Equal FMI F( 2, 249.5) = 440.67 Within VCE type: OLS Prob > F = 0.0000 ------------------------------------------------------------------------------ ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | --. | .01286 .0011315 11.37 0.000 .0106144 .0151055 L1. | .0071693 .0011705 6.12 0.000 .0048167 .0095219 | _cons | 1.146098 .0187722 61.05 0.000 1.109295 1.1829 ------------------------------------------------------------------------------ . mi predict yhat3 using `mi_estimates3', xb not sorted r(5); end of do-file r(5);
Code:
capture log close log using example, replace clear all webuse nlswork, clear set seed 1234 * Retain minimal variables keep idcode year ln_w age * Replace 10% of age variables with missing data gen missing_age = runiform() replace age = . if missing_age > 0.90 drop missing_age * mi set data using flong mi set flong mi register imputed age * Impute age mi impute mvn age = ln_w, add(2) * mi xtset data using id and year mi xtset idcode year * Generate lagged age without using mi prefix bysort _mi_m idcode (year): gen Lage1 = age[_n-1] * Generate lagged age using mi passive mi passive: by idcode (year): gen Lage2 = age[_n-1] * mi passive does not generate lagged values of imputed observations sum Lage1 Lage2 if _mi_m == 1 sort _mi_m idcode year br _mi_m idcode year age Lage1 Lage2 ** Lagged age generated without using mi * This may be producing the correct estimation tempfile mi_estimates1 mi estimate, saving(`mi_estimates1', replace): reg ln_w age Lage1 mi predict yhat1 using `mi_estimates1', xb ** Lagged age generated with mi passive prefix * This definitely is not producing the correct estimation because Lage2 * doesn't include lags for imputed values tempfile mi_estimates2 mi estimate, saving(`mi_estimates2', replace): reg ln_w age Lage2 mi predict yhat2 using `mi_estimates2', xb ** Lagged age using lagged operator * This also is definitely not right, but I have no idea what Stata is doing tempfile mi_estimates3 mi estimate, saving(`mi_estimates3', replace): reg ln_w age L.age mi predict yhat3 using `mi_estimates3', xb
Code:
. clear all . webuse nlswork, clear (National Longitudinal Survey. Young Women 14-26 years of age in 1968) . set seed 1234 . . * Retain minimal variables . keep idcode year ln_w age . . * Replace 10% of age variables with missing data . gen missing_age = runiform() . replace age = . if missing_age > 0.90 (2,840 real changes made, 2,840 to missing) . drop missing_age . . * mi set data using flong . mi set flong . mi register imputed age (2864 m=0 obs. now marked as incomplete) . . * Impute age . mi impute mvn age = ln_w, add(2) Performing EM optimization: note: 2864 observations omitted from EM estimation because of all imputation variables missing observed log likelihood = -60600.223 at iteration 1 Performing MCMC data augmentation ... Multivariate imputation Imputations = 2 Multivariate normal regression added = 2 Imputed: m=1 through m=2 updated = 0 Prior: uniform Iterations = 200 burn-in = 100 between = 100 ------------------------------------------------------------------ | Observations per m |---------------------------------------------- Variable | Complete Incomplete Imputed | Total -------------------+-----------------------------------+---------- age | 25670 2864 2864 | 28534 ------------------------------------------------------------------ (complete + incomplete = total; imputed is the minimum across m of the number of filled-in observations.) . . * mi xtset data using id and year . mi xtset idcode year panel variable: idcode (unbalanced) time variable: year, 68 to 88, but with gaps delta: 1 unit . . * Generate lagged age without using mi prefix . bysort _mi_m idcode (year): gen Lage1 = age[_n-1] (16,511 missing values generated) . . * Generate lagged age using mi passive . mi passive: by idcode (year): gen Lage2 = age[_n-1] m=0: (7,089 missing values generated) m=1: (4,711 missing values generated) m=2: (4,711 missing values generated) (4238 values of passive variable Lage2 in m>0 updated to match values in m=0) . . * mi passive does not generate lagged values of imputed observations . sum Lage1 Lage2 if _mi_m == 1 Variable | Obs Mean Std. Dev. Min Max -------------+--------------------------------------------------------- Lage1 | 23,823 28.1496 6.209421 3.0167 52.91406 Lage2 | 21,704 28.05275 6.15921 8.967278 48.56565 . sort _mi_m idcode year . br _mi_m idcode year age Lage1 Lage2 . . ** Lagged age generated without using mi . * This may be producing the correct estimation . tempfile mi_estimates1 . mi estimate, saving(`mi_estimates1', replace): reg ln_w age Lage1 Multiple-imputation estimates Imputations = 2 Linear regression Number of obs = 23,823 Average RVI = 0.0797 Largest FMI = 0.2320 Complete DF = 23820 DF adjustment: Small sample DF: min = 29.99 avg = 7,944.08 max = 23,632.96 Model F test: Equal FMI F( 2, 133.7) = 747.88 Within VCE type: OLS Prob > F = 0.0000 ------------------------------------------------------------------------------ ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | .0066467 .0007628 8.71 0.000 .0051409 .0081525 Lage1 | .0136565 .0007545 18.10 0.000 .0121776 .0151353 _cons | 1.124866 .0163344 68.86 0.000 1.091507 1.158226 ------------------------------------------------------------------------------ . mi predict yhat1 using `mi_estimates1', xb (4711 missing values generated) . . ** Lagged age generated with mi passive prefix . * This definitely is not producing the correct estimation because Lage2 . * doesn't include lags for imputed values . tempfile mi_estimates2 . mi estimate, saving(`mi_estimates2', replace): reg ln_w age Lage2 Multiple-imputation estimates Imputations = 2 Linear regression Number of obs = 21,704 Average RVI = 0.1746 Largest FMI = 0.4573 Complete DF = 21701 DF adjustment: Small sample DF: min = 8.49 avg = 89.18 max = 244.76 Model F test: Equal FMI F( 2, 34.9) = 597.86 Within VCE type: OLS Prob > F = 0.0000 ------------------------------------------------------------------------------ ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | .003083 .0011379 2.71 0.025 .0004854 .0056807 Lage2 | .0167302 .0011171 14.98 0.000 .0143388 .0191217 _cons | 1.145859 .0157225 72.88 0.000 1.11489 1.176827 ------------------------------------------------------------------------------ . mi predict yhat2 using `mi_estimates2', xb (6830 missing values generated) . . ** Lagged age using lagged operator . * This also is definitely not right, but I have no idea what Stata is doing . tempfile mi_estimates3 . mi estimate, saving(`mi_estimates3', replace): reg ln_w age L.age Multiple-imputation estimates Imputations = 2 Linear regression Number of obs = 10,891 Average RVI = 0.0556 Largest FMI = 0.1758 Complete DF = 10888 DF adjustment: Small sample DF: min = 48.78 avg = 1,575.60 max = 4,580.68 Model F test: Equal FMI F( 2, 249.5) = 440.67 Within VCE type: OLS Prob > F = 0.0000 ------------------------------------------------------------------------------ ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- age | --. | .01286 .0011315 11.37 0.000 .0106144 .0151055 L1. | .0071693 .0011705 6.12 0.000 .0048167 .0095219 | _cons | 1.146098 .0187722 61.05 0.000 1.109295 1.1829 ------------------------------------------------------------------------------ . mi predict yhat3 using `mi_estimates3', xb not sorted r(5); end of do-file r(5);
Comment