Generating lagged variables in mi data

paulvonhippel

Join Date: Apr 2014

Posts: 502
#1

Generating lagged variables in mi data

13 Apr 2016, 09:05

I'm having trouble generating lagged variables in mi data. Here is an example where students are tested in reading and math at 3 different times, and I try to generate a variable that lags the scores by one period. I'm trying to follow the advice posted by Stata's Yulia Marchenko in http://www.stata.com/statalist/archi.../msg00213.html . If you look at the output, though, you'll see that the lagged variables are calculated from the observed data, not from the imputed data as I intended.

/* First I impute the data in wide format to account for the correlations among tests. Then I reshape the imputed data into long format again.
This works fine. I'm just including it so that we have some data to work with. */
use "http://www.ats.ucla.edu/stat/stata/faq/mi_longi.dta", clear
reshape wide read math, i(id) j(time)
order *, sequential
mi set wide
mi register imputed math1-math3 read1-read3
mi impute mvn math1-math3 read1-read3, add(2)
mi reshape long math read, i(id) j(time)

/* Now I try to calculate the lagged variables. This is what isn't working for me. */
mi tsset id time
mi xeq: sort id time; gen math_lag = L1.math; gen read_lag = L1.read
list in 1/3
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

13 Apr 2016, 09:26

Are you sure this is exactly what you did? I just copied your code into the do-file editor and ran it on my Stata and it gives results with correct lags:

Code:

. use "http://www.ats.ucla.edu/stat/stata/faq/mi_longi.dta", clear
(a heavily modified version of highschool and beyond (200 cases))

. reshape wide read math, i(id) j(time)
(note: j = 1 2 3)

Data                               long   ->   wide
-----------------------------------------------------------------------------
Number of obs.                      600   ->     200
Number of variables                   7   ->      10
j variable (3 values)              time   ->   (dropped)
xij variables:
                                   read   ->   read1 read2 read3
                                   math   ->   math1 math2 math3
-----------------------------------------------------------------------------

. order *, sequential

. mi set wide

. mi register imputed math1-math3 read1-read3

. mi impute mvn math1-math3 read1-read3, add(2)

Performing EM optimization:
  observed log likelihood = -2517.9387 at iteration 11

Performing MCMC data augmentation ... 

Multivariate imputation                     Imputations =        2
Multivariate normal regression                    added =        2
Imputed: m=1 through m=2                        updated =        0

Prior: uniform                               Iterations =      200
                                                burn-in =      100
                                                between =      100

------------------------------------------------------------------
                   |               Observations per m             
                   |----------------------------------------------
          Variable |   Complete   Incomplete   Imputed |     Total
-------------------+-----------------------------------+----------
             math1 |        195            5         5 |       200
             math2 |        180           20        20 |       200
             math3 |        179           21        21 |       200
             read1 |        194            6         6 |       200
             read2 |        168           32        32 |       200
             read3 |        176           24        24 |       200
------------------------------------------------------------------
(complete + incomplete = total; imputed is the minimum across m
 of the number of filled-in observations.)

. mi reshape long math read, i(id) j(time)

reshaping m=0 data ...
(note: j = 1 2 3)

Data                               wide   ->   long
-----------------------------------------------------------------------------
Number of obs.                      200   ->     600
Number of variables                  10   ->       7
j variable (3 values)                     ->   time
xij variables:
                      math1 math2 math3   ->   math
                      read1 read2 read3   ->   read
-----------------------------------------------------------------------------

reshaping m=1 data ...

reshaping m=2 data ...

assembling results ...

. 
. /* Now I try to calculate the lagged variables. This is what isn't working for me. */
. mi tsset id time
       panel variable:  id (strongly balanced)
        time variable:  time, 1 to 3
                delta:  1 unit

. mi xeq: sort id time; gen math_lag = L1.math; gen read_lag = L1.read

m=0 data:
-> sort id time
-> gen math_lag = L1.math
(225 missing values generated)
-> gen read_lag = L1.read
(238 missing values generated)

m=1 data:
-> sort id time
-> gen math_lag = L1.math
(200 missing values generated)
-> gen read_lag = L1.read
(200 missing values generated)

m=2 data:
-> sort id time
-> gen math_lag = L1.math
(200 missing values generated)
-> gen read_lag = L1.read
(200 missing values generated)

. list in 1/3

     +-----------------------------------------------------------------------------------------------------------------------------------+
     | id   time   female       math   private   read   ses   _mi_miss   math_lag   read_lag    _1_math    _1_read    _2_math    _2_read |
     |-----------------------------------------------------------------------------------------------------------------------------------|
  1. |  1      1   female         40         0     34   low          0          .          .         40         34         40         34 |
  2. |  1      2   female         39         0      .   low          1         40         34         39   42.24202         39   33.26007 |
  3. |  1      3   female   40.05584         0     41   low          0         39          .   40.05584         41   40.05584         41 |
     +-----------------------------------------------------------------------------------------------------------------------------------+

Make sure that this is the exact code you ran. If it is, check that your Stata is up to date, and try again after any updating.

Comment

paulvonhippel

Join Date: Apr 2014

Posts: 502
#3

13 Apr 2016, 09:41

Thanks. You're running the same code that I am, and you're getting the same results. But the results aren't correct, or at least aren't what I'm trying to get.

Look at read_lag. It's based on the observed variable read, which has missing values. What I want are lagged variables that are based on the imputed variables _1_read and _2_read.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

13 Apr 2016, 09:58

I think I misunderstood your original post, and I now see that there is a problem: you are not getting lagged variables generated for the imputed data, only for the original data. If you look at -help mi xeq- you will see that commands that change the data do not work well with -mi xeq- unless you are working in the -longsep- style. But also, to see the results, you need to prefix your -list- command with -mi xeq:- also. I think the following does what you want:

Code:

/* First I impute the data in wide format to account for the correlations among tests. Then I reshape the imputed data into long format again.
This works fine. I'm just including it so that we have some data to work with. */
use "http://www.ats.ucla.edu/stat/stata/faq/mi_longi.dta", clear
reshape wide read math, i(id) j(time)
order *, sequential
capture erase mitest.dta
mi set flongsep mitest
mi register imputed math1-math3 read1-read3
mi impute mvn math1-math3 read1-read3, add(2)
mi reshape long math read, i(id) j(time)

mi xeq: list in 1/3

/* Now I try to calculate the lagged variables. This is what isn't working for me. */
mi tsset id time
mi xeq: sort id time; gen math_lag = L1.math; gen read_lag = L1.read
mi xeq: list in 1/3

Comment

paulvonhippel

Join Date: Apr 2014

Posts: 502
#5

13 Apr 2016, 10:13

Thank you, that does work!
Comment
Yulia Marchenko (StataCorp)

StataCorp Employee

Join Date: Mar 2014

Posts: 35
#6

14 Apr 2016, 10:20

The post that Paul cited described how to create lags in mi data for complete or, in mi's terminology, regular variables. Creating lags of imputed variables is more tricky.

When Paul executes

Code:

. mi xeq: sort id time; gen math_lag = L1.math; gen read_lag = L1.read

Stata generates two _regular_ variables: math_lag and read_lag. In the wide mi style, this means that these variables will only have values in the observed data.

The math_lag and read_lag variables are technically so-called passive variables, variables which are functions of the imputed variables. mi provides the mi passive command to generate and replace passive variables. This is still, however, not sufficient for generating lagged variables of imputed variables because, by mi's definition, passive variables are allowed to vary (or to contain imputed values) only in the incomplete observations of the imputed variables. Lags shift the values of imputed variables and will thus lead to imputed values being stored in the complete observations of imputed variables. As a result, proper lags of imputed variables can exist only in the flong and flongsep mi styles, in which all observations are replicated in each imputation.

The easiest solution for Paul is to switch to the flong style (or the flongsep style as suggested by Clyde) before creating the lagged variables and continue his analysis in this style:

Code:

. mi convert flong, clear

It is also important to leave the lagged variables unregistered in these styles.

Depending on the subsequent use of the lagged variables, another solution that can work in all mi styles is to view lagged variables as imputed variables themselves rather than as being derived from the imputed variables. In this case, we first need to create and register the variables, and then replace their values with lags.

Code:

. mi xeq: generate math_lag = .; generate read_lag = . . mi register imputed math_lag read_lag . mi xeq: sort id time; replace math_lag = L1.math; replace read_lag = L1.read

If we view lagged variables as imputed variables, it is important to distinguish between the existing missing values and the missing values produced by the lag operator. (Depending on the task, this may also be important even if we do not treat lagged variables as imputed.) For example, the lagged variables created with the L1. operator will have missing values in the first osbervation. To distinguish them from the existing missing values, we should replace them with an extended missing value instead of a dot (.):

Code:

. mi xeq: sort id time; replace math_lag = .a in 1; replace read_lag = .a in 1

Other lag operators will require similar changes. For example, for the L2. operator, we would replace the first two observations with .a.

The above can be applied more generally to other functions of imputed variables that are not in one-to-one correspondance with the original imputed variables with respect to the complete and incomplete observations.
1 like
Comment

Announcement