Multiple imputation and panel data

Pavel Korzik

Join Date: Jan 2015

Posts: 4
#1

Multiple imputation and panel data

31 Jan 2015, 07:27

Hello!

I've been trying to do the multiple imputation procedure on the panel data set.

According to common wisdom first I reshaped my data from long to wide format, and then launched the imputation procedure.
The process crashed with the following error message:

Code:

imputing m=1 through m=27 mi impute: VCE is not positive definite The posterior distribution from which mi impute drew the imputations for Tax_1996 is not proper when the VCE estimated from the observed data is not positive definite. This may happen, for example, when the number of parameters exceeds the number of observations. Choose an alternate imputation model.

I have 16 variables for 120 countries observed from 2005 to 2012. Clearly, the problem with MI procedure was caused by the fact, that I got 128 new variables ((2012-2005+1)*16) as the result of reshaping my data from long to wide, and had only 120 countries to observe.

I did some search on the web and found the following information about MI for the panel data:
http://www.stata.com/statalist/archi.../msg00198.html :

"Neither -ice- nor -mi impute- has an imputation method specifically designed for panel data. (The -mi xtset- command does declare panel data but does not change which imputation methods are available.) We do, however, have a FAQ that has a few suggestions for applying -mi impute- to panel data."

http://www.stata.com/support/faqs/st...and-mi-impute/ - nothing helpful for my case

http://www.ats.ucla.edu/stat/stata/f...ngitudinal.htm :

"Once we are familiar with our data, the first step in the imputation process is to reshape the data from long to wide. Having the data in wide form takes care of both the nesting issue (there is now only one row of data per student) and allows us to easily use variables from the other time periods as predictors of missing values, since in wide form, they are just other variables in the dataset (rather than being part of another row in the dataset). We do this using the reshape command, and then check the output from reshape to make sure everything went the way it should, and it has. Note that the variable time is dropped, and that there are now three read variables and three math variables."

http://www.ssc.wisc.edu/sscc/pubs/stata_mi_models.htm :

"Panel/Longitudinal Data

If you have data where units are observed over time, the best predictors of a missing value in one period are likely the values of that variable in the previous and subsequent periods. However, the imputation model can only take advantage of this information if the data set is in wide form (one observation per unit, not one observation per unit per time period). You can convert back to long form after imputing if needed. To convert the data to wide form before imputing, use reshape. To convert back to long form after imputing, use mi reshape. This has the same syntax as reshape, but makes sure the imputations are handled properly. If you're not familiar with reshape, see the Hierarchical Data section of Stata for Researchers."

The only possible solution I could think of to fight the problem described above was to impute the missing data on shorter time intervals. Empirically I found that 3 years period (and, hence only (2012-2005+1)*3 = 24 new variables for 120 countries) was OK for the MI procedure.

As the result I got the imputed data for three periods: 2005-2007, 2008-2010, 2011-2012.

My question is:
Can I merge the MI procedure results from the sub-periods (2005-2007, 2008-2010, 2011-2012) into the single period (2005-2012) and go on with my analysis, or must I perform the imputation and panel data analysis on the same intervals (and, hence, perform panel data analysis three times)?

I wasn't able to find the definite answer on the question above. However, all the authors agree that imputation model and analytical model should be parsimonious, but the clear guidelines on the extent of this parsimony are missing .

Thank you!
Tags: None
Roman Mostazir

Join Date: Apr 2014

Posts: 874
#2

31 Jan 2015, 14:27

You can merge imputed data sets with "mi merge" option http://www.google.co.uk/url?sa=t&rct...85076809,d.d2s and carry out mi estimates. But what I would be cautious about is splitting the time span if time is an important indicator for your variables to be imputed. By doing this you may be violating the underlying MAR (assuming the assumption) assumption where the missing mechanism is associated to observed variables.

Roman
Comment
Pavel Korzik

Join Date: Jan 2015

Posts: 4
#3

01 Feb 2015, 09:30

Originally posted by Roman Mostazir View Post

... But what I would be cautious about is splitting the time span if time is an important indicator for your variables to be imputed. By doing this you may be violating the underlying MAR (assuming the assumption) assumption where the missing mechanism is associated to observed variables.

Thank you for your answer!

It's obvious that time is important in my model - otherwise I wouldn't use panel data.
Comment
Pavel Korzik

Join Date: Jan 2015

Posts: 4
#4

05 Feb 2015, 04:58

Originally posted by Roman Mostazir View Post

... But what I would be cautious about is splitting the time span

Maybe there are more ideas about my problem, except for being cautious about dividing the observation period into smaller intervals?

Because I am cautious about splitting the time span - that's exactly the reason I've asked the question in the first place.

Thank you in advance!
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17718
#5

05 Feb 2015, 05:13

Pavel:
i would focus on the not-that-obvious Roman's remark concerning the mechanism underlying the missingness of your data (are they MCAR, MAR, NMAR?)
As far as I can get your previous message, neither you, nor the sources you quoted gave relevant details on this respect.

Kind regards,
Carlo
(Stata 19.0)
Comment

Pavel Korzik

Join Date: Jan 2015
Posts: 4

05 Feb 2015, 12:37

Originally posted by Carlo Lazzaro View Post

Pavel:
i would focus on the not-that-obvious Roman's remark concerning the mechanism underlying the missingness of your data (are they MCAR, MAR, NMAR?)
As far as I can get your previous message, neither you, nor the sources you quoted gave relevant details on this respect.

I assume my missing data to be MAR. If it were MCAR - I'd use complete case analysis. If it were MNAR - MICE would be of little use anyway.
There are several methods to deal with MAR missingness - maximum likelihood imputation, MVN imputation and MICE.
I chose the MICE over the alternatives.

Could you please give me a clue - how the type of missingness can help me solve my problem?

The situation is following:

Let's assume that I have n countries, 4 time periods and 3 variables in my analytical model:
D - valid data point, M - missing data point

Var1_T1

Var1_T2

Var1_T3

Var1_T4

Var2_T1

Var2_T2

Var2_T3

Var2_T4

Var3_T1

Var3_T2

Var3_T3

Var3_T4

Country 1

...

Country n

Now let's assume that number of countries n is insufficient for the multiple regressions inside the MICE algorithm to converge.

My solution? Split the time frame in half to reduce the number of variables in regressions:

	Var1_T1	Var1_T2	Var2_T1	Var2_T2	Var3_T1	Var3_T2
Country 1	D	D	D	M	D	M
...	...	...	...	...	...	...
Country n	M	D	M	D	D	D

============================================

	Var1_T3	Var1_T4	Var2_T3	Var2_T4	Var3_T3	Var3_T4
Country 1	D	M	D	D	D	M
...	...	...	...	...	...	...
Country n	D	D	D	M	D	M

Now I have two imputation models where regressions inside the MICE algorithm converge.

I understand that imputation results for each missing data point of single time frame imputation model (for now let's consider that regressions inside the MICE algorithm converge) are different from the imputation results of the split time frame models. It's obvious because the data points used for estimation in each case are different.

What I want to do next (and what I am unsure of) - is to merge the results of the split time frame models, and use them for estimation as if they have been produced by the single time frame model.
Are there any ideas about the appropriateness of my approach?

Comment

Carlo Lazzaro

Join Date: Apr 2014

Posts: 17718
#7

06 Feb 2015, 00:31

Pavel:
assuming that the MAR hypothesis is sound, your approach may turn out resonable.
I agree with you that, if data are MCAR, you can use listwise deletion (or complete case analysis) without any relevant harm, let alone the effects on hypotheses testing.
It may well be an artifact of your example, but I'm under the impression that Var3_T4 should be carefully scrutinized in terms of MAR assumption if it has repeated missing values in your database.

Eventually, you may want to take a look at the following thread http://www.statalist.org/forums/foru...missing-values and search for others on the same (or related topic) that pop up on the list from time to time.

Last edited by Carlo Lazzaro; 06 Feb 2015, 00:36.

Kind regards,
Carlo
(Stata 19.0)
Comment

Announcement