Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Estimating Models with Lagged Independent Variables using Multiply Imputed Panel Data

    My question relates to the correct syntax for using mi with lagged independent variables and panel data. The independent variable I want to lag has been imputed. Stata will estimate a model without producing an error, but it doesn't appear to be treating the lags properly. Yulia Marchenko addressed this issue here and here, but I'm still not getting it.

    Below is a minimal reproducable example illustrating the problem. I regressed logged wages on age and lagged age where age has been imputed. This is panel data in long format with person-year observations. I've used mi set flong.

    I thought everything worked until I received a not sorted error message after using mi predict. (I should have noticed that the number of observations is off too.) I couldn't fix the error by sorting the data, so I tried to troubleshoot the problem by generating lagged variables using mi passive.

    The mi passive command runs without error but will not generate lags for imputed values. (Yulia and others mention this issue in the posts referenced above.) It seems like there are issues with using mi passive on imputed variables, but I thought it might work with the flong format based on the earlier posts.

    Finally, I created the lagged variable without using the mi prefix. In this case, I was able to use mi estimate and mi predict without error, but I'm still not sure if the estimates are based on the correct lags.

    The second block of code repeats the initial code that gave me the not sorted error and my alternative attempts using mi passive and generating the lag without the mi prefix. Finally, included all of the output from the second block of code.

    My question is if there is any easy fix for my original syntax using mi xtset and the lag operator. If not, will creating the lagged variables without the mi prefix give me the correct results if the data are in flong format.


    Run on Stata 15.1

    Code:
    clear all
    webuse nlswork, clear
    set seed 1234
    
    keep idcode year ln_w age 
    
    gen missing_age = runiform()
    replace age   = . if missing_age > 0.90
    drop missing_age 
    
    mi set flong
    mi register imputed age 
    
    mi impute mvn age = ln_w, add(2)
    
    mi xtset idcode year
    
    tempfile mi_estimates3
    mi estimate, saving(`mi_estimates3', replace): reg ln_w age L.age 
    mi predict yhat3 using `mi_estimates3', xb
    
    
    
    
    Multiple-imputation estimates                   Imputations       =          2
    Linear regression                               Number of obs     =     10,891
                                                    Average RVI       =     0.0556
                                                    Largest FMI       =     0.1758
                                                    Complete DF       =      10888
    DF adjustment:   Small sample                   DF:     min       =      48.78
                                                            avg       =   1,575.60
                                                            max       =   4,580.68
    Model F test:       Equal FMI                   F(   2,  249.5)   =     440.67
    Within VCE type:          OLS                   Prob > F          =     0.0000
    
    ------------------------------------------------------------------------------
         ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
             age |
             --. |     .01286   .0011315    11.37   0.000     .0106144    .0151055
             L1. |   .0071693   .0011705     6.12   0.000     .0048167    .0095219
                 |
           _cons |   1.146098   .0187722    61.05   0.000     1.109295      1.1829
    ------------------------------------------------------------------------------
    
    . mi predict yhat3 using `mi_estimates3', xb
    not sorted
    r(5);
    
    end of do-file
    
    r(5);

    Code:
    capture log close
    
    log using example, replace
    
    clear all
    webuse nlswork, clear
    set seed 1234
    
    * Retain minimal variables
    keep idcode year ln_w age 
    
    * Replace 10% of age variables with missing data
    gen missing_age = runiform()
    replace age   = . if missing_age > 0.90
    drop missing_age 
    
    * mi set data using flong
    mi set flong
    mi register imputed age 
    
    * Impute age
    mi impute mvn age = ln_w, add(2)
    
    * mi xtset data using id and year
    mi xtset idcode year
    
    * Generate lagged age without using mi prefix
    bysort _mi_m idcode (year): gen Lage1 = age[_n-1]
    
    * Generate lagged age using mi passive
    mi passive: by idcode (year): gen Lage2 = age[_n-1]
    
    * mi passive does not generate lagged values of imputed observations
    sum Lage1 Lage2 if _mi_m == 1
    sort _mi_m idcode year
    br _mi_m idcode year age Lage1 Lage2
    
    ** Lagged age generated without using mi
    * This may be producing the correct estimation
    tempfile mi_estimates1
    mi estimate, saving(`mi_estimates1', replace): reg ln_w age Lage1
    mi predict yhat1 using `mi_estimates1', xb
    
    ** Lagged age generated with mi passive prefix
    * This definitely is not producing the correct estimation because Lage2 
    * doesn't include lags for imputed values
    tempfile mi_estimates2
    mi estimate, saving(`mi_estimates2', replace): reg ln_w age Lage2
    mi predict yhat2 using `mi_estimates2', xb
    
    ** Lagged age using lagged operator 
    * This also is definitely not right, but I have no idea what Stata is doing
    tempfile mi_estimates3
    mi estimate, saving(`mi_estimates3', replace): reg ln_w age L.age 
    mi predict yhat3 using `mi_estimates3', xb

    Code:
    . clear all
    
    . webuse nlswork, clear
    (National Longitudinal Survey.  Young Women 14-26 years of age in 1968)
    
    . set seed 1234
    
    . 
    . * Retain minimal variables
    . keep idcode year ln_w age 
    
    . 
    . * Replace 10% of age variables with missing data
    . gen missing_age = runiform()
    
    . replace age   = . if missing_age > 0.90
    (2,840 real changes made, 2,840 to missing)
    
    . drop missing_age 
    
    . 
    . * mi set data using flong
    . mi set flong
    
    . mi register imputed age 
    (2864 m=0 obs. now marked as incomplete)
    
    . 
    . * Impute age
    . mi impute mvn age = ln_w, add(2)
    
    Performing EM optimization:
    note: 2864 observations omitted from EM estimation because of all imputation variables missing
      observed log likelihood = -60600.223 at iteration 1
    
    Performing MCMC data augmentation ... 
    
    Multivariate imputation                     Imputations =        2
    Multivariate normal regression                    added =        2
    Imputed: m=1 through m=2                        updated =        0
    
    Prior: uniform                               Iterations =      200
                                                    burn-in =      100
                                                    between =      100
    
    ------------------------------------------------------------------
                       |               Observations per m             
                       |----------------------------------------------
              Variable |   Complete   Incomplete   Imputed |     Total
    -------------------+-----------------------------------+----------
                   age |      25670         2864      2864 |     28534
    ------------------------------------------------------------------
    (complete + incomplete = total; imputed is the minimum across m
     of the number of filled-in observations.)
    
    . 
    . * mi xtset data using id and year
    . mi xtset idcode year
           panel variable:  idcode (unbalanced)
            time variable:  year, 68 to 88, but with gaps
                    delta:  1 unit
    
    . 
    . * Generate lagged age without using mi prefix
    . bysort _mi_m idcode (year): gen Lage1 = age[_n-1]
    (16,511 missing values generated)
    
    . 
    . * Generate lagged age using mi passive
    . mi passive: by idcode (year): gen Lage2 = age[_n-1]
    m=0:
    (7,089 missing values generated)
    m=1:
    (4,711 missing values generated)
    m=2:
    (4,711 missing values generated)
    (4238 values of passive variable Lage2 in m>0 updated to match values in m=0)
    
    . 
    . * mi passive does not generate lagged values of imputed observations
    . sum Lage1 Lage2 if _mi_m == 1
    
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
           Lage1 |     23,823     28.1496    6.209421     3.0167   52.91406
           Lage2 |     21,704    28.05275     6.15921   8.967278   48.56565
    
    . sort _mi_m idcode year
    
    . br _mi_m idcode year age Lage1 Lage2
    
    . 
    . ** Lagged age generated without using mi
    . * This may be producing the correct estimation
    . tempfile mi_estimates1
    
    . mi estimate, saving(`mi_estimates1', replace): reg ln_w age Lage1
    
    Multiple-imputation estimates                   Imputations       =          2
    Linear regression                               Number of obs     =     23,823
                                                    Average RVI       =     0.0797
                                                    Largest FMI       =     0.2320
                                                    Complete DF       =      23820
    DF adjustment:   Small sample                   DF:     min       =      29.99
                                                            avg       =   7,944.08
                                                            max       =  23,632.96
    Model F test:       Equal FMI                   F(   2,  133.7)   =     747.88
    Within VCE type:          OLS                   Prob > F          =     0.0000
    
    ------------------------------------------------------------------------------
         ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
             age |   .0066467   .0007628     8.71   0.000     .0051409    .0081525
           Lage1 |   .0136565   .0007545    18.10   0.000     .0121776    .0151353
           _cons |   1.124866   .0163344    68.86   0.000     1.091507    1.158226
    ------------------------------------------------------------------------------
    
    . mi predict yhat1 using `mi_estimates1', xb
    (4711 missing values generated)
    
    . 
    . ** Lagged age generated with mi passive prefix
    . * This definitely is not producing the correct estimation because Lage2 
    . * doesn't include lags for imputed values
    . tempfile mi_estimates2
    
    . mi estimate, saving(`mi_estimates2', replace): reg ln_w age Lage2
    
    Multiple-imputation estimates                   Imputations       =          2
    Linear regression                               Number of obs     =     21,704
                                                    Average RVI       =     0.1746
                                                    Largest FMI       =     0.4573
                                                    Complete DF       =      21701
    DF adjustment:   Small sample                   DF:     min       =       8.49
                                                            avg       =      89.18
                                                            max       =     244.76
    Model F test:       Equal FMI                   F(   2,   34.9)   =     597.86
    Within VCE type:          OLS                   Prob > F          =     0.0000
    
    ------------------------------------------------------------------------------
         ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
             age |    .003083   .0011379     2.71   0.025     .0004854    .0056807
           Lage2 |   .0167302   .0011171    14.98   0.000     .0143388    .0191217
           _cons |   1.145859   .0157225    72.88   0.000      1.11489    1.176827
    ------------------------------------------------------------------------------
    
    . mi predict yhat2 using `mi_estimates2', xb
    (6830 missing values generated)
    
    . 
    . ** Lagged age using lagged operator 
    . * This also is definitely not right, but I have no idea what Stata is doing
    . tempfile mi_estimates3
    
    . mi estimate, saving(`mi_estimates3', replace): reg ln_w age L.age 
    
    Multiple-imputation estimates                   Imputations       =          2
    Linear regression                               Number of obs     =     10,891
                                                    Average RVI       =     0.0556
                                                    Largest FMI       =     0.1758
                                                    Complete DF       =      10888
    DF adjustment:   Small sample                   DF:     min       =      48.78
                                                            avg       =   1,575.60
                                                            max       =   4,580.68
    Model F test:       Equal FMI                   F(   2,  249.5)   =     440.67
    Within VCE type:          OLS                   Prob > F          =     0.0000
    
    ------------------------------------------------------------------------------
         ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
             age |
             --. |     .01286   .0011315    11.37   0.000     .0106144    .0151055
             L1. |   .0071693   .0011705     6.12   0.000     .0048167    .0095219
                 |
           _cons |   1.146098   .0187722    61.05   0.000     1.109295      1.1829
    ------------------------------------------------------------------------------
    
    . mi predict yhat3 using `mi_estimates3', xb
    not sorted
    r(5);
    
    end of do-file
    
    r(5);

  • #2
    I realized that my example was not a good one because the panel wasn't balanced, which was necessary for the comparisons I was trying to make. I now think that the following estimation is working correctly and on the correct number of observations, but I'm still not sure why the predict command produces a sorting error. Predict does seem to work if I take the lagged variable out of the equation.

    And in the original example, I'm not clear why mi passive won't generate lags for imputed values in the flong format. Sorry about the error on my part. Hopefully this abbreviated example is more clear. Any advice would be greatly appreciated.

    Code:
    clear all
    webuse nlswork, clear
    set seed 1234
    
    * Retain small sample
    keep if idcode < 100
    
    * Fill in missing years so we have balanced panel for this example
    xtset idcode year
    
    tsfill, full
    
    * Retain minimal variables
    keep idcode year ttl_exp 
    
    * Generate outcome
    gen lnwage = rnormal()
    
    * mi set data using flong
    mi set flong
    mi register imputed ttl_exp 
    
    * Impute ttl_exp
    mi impute mvn ttl_exp = lnwage, add(2)
    
    * mi xtset data using id and year
    mi xtset idcode year
    
    ** Lagged ttl_exp using lagged operator 
    tempfile mi_estimates
    mi estimate, saving(`mi_estimates', replace): reg lnwage ttl_exp L.ttl_exp 
    mi predict yhat using `mi_estimates', xb
    Code:
    . clear all
    
    . webuse nlswork, clear
    (National Longitudinal Survey.  Young Women 14-26 years of age in 1968)
    
    . set seed 1234
    
    . 
    . * Retain small sample
    . keep if idcode < 100
    (27959 observations deleted)
    
    . 
    . * Fill in missing years so we have balanced panel for this example
    . xtset idcode year
           panel variable:  idcode (unbalanced)
            time variable:  year, 68 to 88, but with gaps
                    delta:  1 unit
    
    . 
    . tsfill, full
    
    . 
    . * Retain minimal variables
    . keep idcode year ttl_exp 
    
    . 
    . * Generate outcome
    . gen lnwage = rnormal()
    
    . 
    . * mi set data using flong
    . mi set flong
    
    . mi register imputed ttl_exp 
    (1294 m=0 obs. now marked as incomplete)
    
    . 
    . * Impute ttl_exp
    . mi impute mvn ttl_exp = lnwage, add(2)
    
    Performing EM optimization:
    note: 1294 observations omitted from EM estimation because of all imputation variables missing
      observed log likelihood = -1132.2806 at iteration 1
    
    Performing MCMC data augmentation ... 
    
    Multivariate imputation                     Imputations =        2
    Multivariate normal regression                    added =        2
    Imputed: m=1 through m=2                        updated =        0
    
    Prior: uniform                               Iterations =      200
                                                    burn-in =      100
                                                    between =      100
    
    ------------------------------------------------------------------
                       |               Observations per m             
                       |----------------------------------------------
              Variable |   Complete   Incomplete   Imputed |     Total
    -------------------+-----------------------------------+----------
               ttl_exp |        575         1294      1294 |      1869
    ------------------------------------------------------------------
    (complete + incomplete = total; imputed is the minimum across m
     of the number of filled-in observations.)
    
    . 
    . * mi xtset data using id and year
    . mi xtset idcode year
           panel variable:  idcode (strongly balanced)
            time variable:  year, 68 to 88
                    delta:  1 unit
    
    . 
    . ** Lagged ttl_exp using lagged operator 
    . tempfile mi_estimates
    
    . mi estimate, saving(`mi_estimates', replace): reg lnwage ttl_exp L.ttl_exp 
    
    Multiple-imputation estimates                     Imputations     =          2
    Linear regression                                 Number of obs   =       1780
                                                      Average RVI     =     0.2534
                                                      Largest FMI     =     0.5418
                                                      Complete DF     =       1777
    DF adjustment:   Small sample                     DF:     min     =       5.92
                                                              avg     =     312.79
                                                              max     =     919.58
    Model F test:       Equal FMI                     F(   2,   19.7) =       1.19
    Within VCE type:          OLS                     Prob > F        =     0.3257
    
    ------------------------------------------------------------------------------
          lnwage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
         ttl_exp |
             --. |  -.0056603   .0070315    -0.80   0.452    -.0229244    .0116038
             L1. |  -.0074503   .0055942    -1.33   0.183    -.0184292    .0035286
                 |
           _cons |   .0659318   .0564231     1.17   0.264    -.0560827    .1879462
    ------------------------------------------------------------------------------
    
    . mi predict yhat using `mi_estimates', xb
    not sorted
    r(5);
    
    end of do-file
    
    r(5);

    Comment

    Working...
    X