Estimating Models with Lagged Independent Variables using Multiply Imputed Panel Data

Michael Evangelist

Join Date: Mar 2017
Posts: 10

Estimating Models with Lagged Independent Variables using Multiply Imputed Panel Data

23 Aug 2018, 13:11

My question relates to the correct syntax for using mi with lagged independent variables and panel data. The independent variable I want to lag has been imputed. Stata will estimate a model without producing an error, but it doesn't appear to be treating the lags properly. Yulia Marchenko addressed this issue here and here, but I'm still not getting it.

Below is a minimal reproducable example illustrating the problem. I regressed logged wages on age and lagged age where age has been imputed. This is panel data in long format with person-year observations. I've used mi set flong.

I thought everything worked until I received a not sorted error message after using mi predict. (I should have noticed that the number of observations is off too.) I couldn't fix the error by sorting the data, so I tried to troubleshoot the problem by generating lagged variables using mi passive.

The mi passive command runs without error but will not generate lags for imputed values. (Yulia and others mention this issue in the posts referenced above.) It seems like there are issues with using mi passive on imputed variables, but I thought it might work with the flong format based on the earlier posts.

Finally, I created the lagged variable without using the mi prefix. In this case, I was able to use mi estimate and mi predict without error, but I'm still not sure if the estimates are based on the correct lags.

The second block of code repeats the initial code that gave me the not sorted error and my alternative attempts using mi passive and generating the lag without the mi prefix. Finally, included all of the output from the second block of code.

My question is if there is any easy fix for my original syntax using mi xtset and the lag operator. If not, will creating the lagged variables without the mi prefix give me the correct results if the data are in flong format.

Run on Stata 15.1

Code:

clear all
webuse nlswork, clear
set seed 1234

keep idcode year ln_w age 

gen missing_age = runiform()
replace age   = . if missing_age > 0.90
drop missing_age 

mi set flong
mi register imputed age 

mi impute mvn age = ln_w, add(2)

mi xtset idcode year

tempfile mi_estimates3
mi estimate, saving(`mi_estimates3', replace): reg ln_w age L.age 
mi predict yhat3 using `mi_estimates3', xb




Multiple-imputation estimates                   Imputations       =          2
Linear regression                               Number of obs     =     10,891
                                                Average RVI       =     0.0556
                                                Largest FMI       =     0.1758
                                                Complete DF       =      10888
DF adjustment:   Small sample                   DF:     min       =      48.78
                                                        avg       =   1,575.60
                                                        max       =   4,580.68
Model F test:       Equal FMI                   F(   2,  249.5)   =     440.67
Within VCE type:          OLS                   Prob > F          =     0.0000

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |
         --. |     .01286   .0011315    11.37   0.000     .0106144    .0151055
         L1. |   .0071693   .0011705     6.12   0.000     .0048167    .0095219
             |
       _cons |   1.146098   .0187722    61.05   0.000     1.109295      1.1829
------------------------------------------------------------------------------

. mi predict yhat3 using `mi_estimates3', xb
not sorted
r(5);

end of do-file

r(5);

Code:

capture log close

log using example, replace

clear all
webuse nlswork, clear
set seed 1234

* Retain minimal variables
keep idcode year ln_w age 

* Replace 10% of age variables with missing data
gen missing_age = runiform()
replace age   = . if missing_age > 0.90
drop missing_age 

* mi set data using flong
mi set flong
mi register imputed age 

* Impute age
mi impute mvn age = ln_w, add(2)

* mi xtset data using id and year
mi xtset idcode year

* Generate lagged age without using mi prefix
bysort _mi_m idcode (year): gen Lage1 = age[_n-1]

* Generate lagged age using mi passive
mi passive: by idcode (year): gen Lage2 = age[_n-1]

* mi passive does not generate lagged values of imputed observations
sum Lage1 Lage2 if _mi_m == 1
sort _mi_m idcode year
br _mi_m idcode year age Lage1 Lage2

** Lagged age generated without using mi
* This may be producing the correct estimation
tempfile mi_estimates1
mi estimate, saving(`mi_estimates1', replace): reg ln_w age Lage1
mi predict yhat1 using `mi_estimates1', xb

** Lagged age generated with mi passive prefix
* This definitely is not producing the correct estimation because Lage2 
* doesn't include lags for imputed values
tempfile mi_estimates2
mi estimate, saving(`mi_estimates2', replace): reg ln_w age Lage2
mi predict yhat2 using `mi_estimates2', xb

** Lagged age using lagged operator 
* This also is definitely not right, but I have no idea what Stata is doing
tempfile mi_estimates3
mi estimate, saving(`mi_estimates3', replace): reg ln_w age L.age 
mi predict yhat3 using `mi_estimates3', xb

Code:

. clear all

. webuse nlswork, clear
(National Longitudinal Survey.  Young Women 14-26 years of age in 1968)

. set seed 1234

. 
. * Retain minimal variables
. keep idcode year ln_w age 

. 
. * Replace 10% of age variables with missing data
. gen missing_age = runiform()

. replace age   = . if missing_age > 0.90
(2,840 real changes made, 2,840 to missing)

. drop missing_age 

. 
. * mi set data using flong
. mi set flong

. mi register imputed age 
(2864 m=0 obs. now marked as incomplete)

. 
. * Impute age
. mi impute mvn age = ln_w, add(2)

Performing EM optimization:
note: 2864 observations omitted from EM estimation because of all imputation variables missing
  observed log likelihood = -60600.223 at iteration 1

Performing MCMC data augmentation ... 

Multivariate imputation                     Imputations =        2
Multivariate normal regression                    added =        2
Imputed: m=1 through m=2                        updated =        0

Prior: uniform                               Iterations =      200
                                                burn-in =      100
                                                between =      100

------------------------------------------------------------------
                   |               Observations per m             
                   |----------------------------------------------
          Variable |   Complete   Incomplete   Imputed |     Total
-------------------+-----------------------------------+----------
               age |      25670         2864      2864 |     28534
------------------------------------------------------------------
(complete + incomplete = total; imputed is the minimum across m
 of the number of filled-in observations.)

. 
. * mi xtset data using id and year
. mi xtset idcode year
       panel variable:  idcode (unbalanced)
        time variable:  year, 68 to 88, but with gaps
                delta:  1 unit

. 
. * Generate lagged age without using mi prefix
. bysort _mi_m idcode (year): gen Lage1 = age[_n-1]
(16,511 missing values generated)

. 
. * Generate lagged age using mi passive
. mi passive: by idcode (year): gen Lage2 = age[_n-1]
m=0:
(7,089 missing values generated)
m=1:
(4,711 missing values generated)
m=2:
(4,711 missing values generated)
(4238 values of passive variable Lage2 in m>0 updated to match values in m=0)

. 
. * mi passive does not generate lagged values of imputed observations
. sum Lage1 Lage2 if _mi_m == 1

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       Lage1 |     23,823     28.1496    6.209421     3.0167   52.91406
       Lage2 |     21,704    28.05275     6.15921   8.967278   48.56565

. sort _mi_m idcode year

. br _mi_m idcode year age Lage1 Lage2

. 
. ** Lagged age generated without using mi
. * This may be producing the correct estimation
. tempfile mi_estimates1

. mi estimate, saving(`mi_estimates1', replace): reg ln_w age Lage1

Multiple-imputation estimates                   Imputations       =          2
Linear regression                               Number of obs     =     23,823
                                                Average RVI       =     0.0797
                                                Largest FMI       =     0.2320
                                                Complete DF       =      23820
DF adjustment:   Small sample                   DF:     min       =      29.99
                                                        avg       =   7,944.08
                                                        max       =  23,632.96
Model F test:       Equal FMI                   F(   2,  133.7)   =     747.88
Within VCE type:          OLS                   Prob > F          =     0.0000

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .0066467   .0007628     8.71   0.000     .0051409    .0081525
       Lage1 |   .0136565   .0007545    18.10   0.000     .0121776    .0151353
       _cons |   1.124866   .0163344    68.86   0.000     1.091507    1.158226
------------------------------------------------------------------------------

. mi predict yhat1 using `mi_estimates1', xb
(4711 missing values generated)

. 
. ** Lagged age generated with mi passive prefix
. * This definitely is not producing the correct estimation because Lage2 
. * doesn't include lags for imputed values
. tempfile mi_estimates2

. mi estimate, saving(`mi_estimates2', replace): reg ln_w age Lage2

Multiple-imputation estimates                   Imputations       =          2
Linear regression                               Number of obs     =     21,704
                                                Average RVI       =     0.1746
                                                Largest FMI       =     0.4573
                                                Complete DF       =      21701
DF adjustment:   Small sample                   DF:     min       =       8.49
                                                        avg       =      89.18
                                                        max       =     244.76
Model F test:       Equal FMI                   F(   2,   34.9)   =     597.86
Within VCE type:          OLS                   Prob > F          =     0.0000

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |    .003083   .0011379     2.71   0.025     .0004854    .0056807
       Lage2 |   .0167302   .0011171    14.98   0.000     .0143388    .0191217
       _cons |   1.145859   .0157225    72.88   0.000      1.11489    1.176827
------------------------------------------------------------------------------

. mi predict yhat2 using `mi_estimates2', xb
(6830 missing values generated)

. 
. ** Lagged age using lagged operator 
. * This also is definitely not right, but I have no idea what Stata is doing
. tempfile mi_estimates3

. mi estimate, saving(`mi_estimates3', replace): reg ln_w age L.age 

Multiple-imputation estimates                   Imputations       =          2
Linear regression                               Number of obs     =     10,891
                                                Average RVI       =     0.0556
                                                Largest FMI       =     0.1758
                                                Complete DF       =      10888
DF adjustment:   Small sample                   DF:     min       =      48.78
                                                        avg       =   1,575.60
                                                        max       =   4,580.68
Model F test:       Equal FMI                   F(   2,  249.5)   =     440.67
Within VCE type:          OLS                   Prob > F          =     0.0000

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |
         --. |     .01286   .0011315    11.37   0.000     .0106144    .0151055
         L1. |   .0071693   .0011705     6.12   0.000     .0048167    .0095219
             |
       _cons |   1.146098   .0187722    61.05   0.000     1.109295      1.1829
------------------------------------------------------------------------------

. mi predict yhat3 using `mi_estimates3', xb
not sorted
r(5);

end of do-file

r(5);

Tags: lags, multiple imputation, panel data

Michael Evangelist

Join Date: Mar 2017
Posts: 10

23 Aug 2018, 19:30

I realized that my example was not a good one because the panel wasn't balanced, which was necessary for the comparisons I was trying to make. I now think that the following estimation is working correctly and on the correct number of observations, but I'm still not sure why the predict command produces a sorting error. Predict does seem to work if I take the lagged variable out of the equation.

And in the original example, I'm not clear why mi passive won't generate lags for imputed values in the flong format. Sorry about the error on my part. Hopefully this abbreviated example is more clear. Any advice would be greatly appreciated.

Code:

clear all
webuse nlswork, clear
set seed 1234

* Retain small sample
keep if idcode < 100

* Fill in missing years so we have balanced panel for this example
xtset idcode year

tsfill, full

* Retain minimal variables
keep idcode year ttl_exp 

* Generate outcome
gen lnwage = rnormal()

* mi set data using flong
mi set flong
mi register imputed ttl_exp 

* Impute ttl_exp
mi impute mvn ttl_exp = lnwage, add(2)

* mi xtset data using id and year
mi xtset idcode year

** Lagged ttl_exp using lagged operator 
tempfile mi_estimates
mi estimate, saving(`mi_estimates', replace): reg lnwage ttl_exp L.ttl_exp 
mi predict yhat using `mi_estimates', xb

Code:

. clear all

. webuse nlswork, clear
(National Longitudinal Survey.  Young Women 14-26 years of age in 1968)

. set seed 1234

. 
. * Retain small sample
. keep if idcode < 100
(27959 observations deleted)

. 
. * Fill in missing years so we have balanced panel for this example
. xtset idcode year
       panel variable:  idcode (unbalanced)
        time variable:  year, 68 to 88, but with gaps
                delta:  1 unit

. 
. tsfill, full

. 
. * Retain minimal variables
. keep idcode year ttl_exp 

. 
. * Generate outcome
. gen lnwage = rnormal()

. 
. * mi set data using flong
. mi set flong

. mi register imputed ttl_exp 
(1294 m=0 obs. now marked as incomplete)

. 
. * Impute ttl_exp
. mi impute mvn ttl_exp = lnwage, add(2)

Performing EM optimization:
note: 1294 observations omitted from EM estimation because of all imputation variables missing
  observed log likelihood = -1132.2806 at iteration 1

Performing MCMC data augmentation ... 

Multivariate imputation                     Imputations =        2
Multivariate normal regression                    added =        2
Imputed: m=1 through m=2                        updated =        0

Prior: uniform                               Iterations =      200
                                                burn-in =      100
                                                between =      100

------------------------------------------------------------------
                   |               Observations per m             
                   |----------------------------------------------
          Variable |   Complete   Incomplete   Imputed |     Total
-------------------+-----------------------------------+----------
           ttl_exp |        575         1294      1294 |      1869
------------------------------------------------------------------
(complete + incomplete = total; imputed is the minimum across m
 of the number of filled-in observations.)

. 
. * mi xtset data using id and year
. mi xtset idcode year
       panel variable:  idcode (strongly balanced)
        time variable:  year, 68 to 88
                delta:  1 unit

. 
. ** Lagged ttl_exp using lagged operator 
. tempfile mi_estimates

. mi estimate, saving(`mi_estimates', replace): reg lnwage ttl_exp L.ttl_exp 

Multiple-imputation estimates                     Imputations     =          2
Linear regression                                 Number of obs   =       1780
                                                  Average RVI     =     0.2534
                                                  Largest FMI     =     0.5418
                                                  Complete DF     =       1777
DF adjustment:   Small sample                     DF:     min     =       5.92
                                                          avg     =     312.79
                                                          max     =     919.58
Model F test:       Equal FMI                     F(   2,   19.7) =       1.19
Within VCE type:          OLS                     Prob > F        =     0.3257

------------------------------------------------------------------------------
      lnwage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     ttl_exp |
         --. |  -.0056603   .0070315    -0.80   0.452    -.0229244    .0116038
         L1. |  -.0074503   .0055942    -1.33   0.183    -.0184292    .0035286
             |
       _cons |   .0659318   .0564231     1.17   0.264    -.0560827    .1879462
------------------------------------------------------------------------------

. mi predict yhat using `mi_estimates', xb
not sorted
r(5);

end of do-file

r(5);

Announcement

Estimating Models with Lagged Independent Variables using Multiply Imputed Panel Data

Comment