Dear all,
Even though I have tried to find the solution to my problem, I still have the following challenge of multiple imputation. Also, my questions below are not directly related to the use of Stata but with statistics. However, I think this problem is something that some Stata users may face as well.
The objective of my project is to predict Y using a set of variables X. I don't care about the estimates but the prediction only.
Characteristics of my data:
Thank you so much,
Pablo.
Even though I have tried to find the solution to my problem, I still have the following challenge of multiple imputation. Also, my questions below are not directly related to the use of Stata but with statistics. However, I think this problem is something that some Stata users may face as well.
The objective of my project is to predict Y using a set of variables X. I don't care about the estimates but the prediction only.
Characteristics of my data:
- My data is a panel with 217 individuals followed during 58 years.
- The X variables are about 1500.
- All the X variables have missing values. I don't have one single complete case.
- Y also have missing values.
- My Y variable has a two-year lag with respect to the X variables. That is, some X variables go up to 2017 whereas my Y variable goes up to 2015. I am interested in predicting (nowcasting) Y for 2016 and 2017. That is, I don't care about predicting accurately missing values of Y for years before 2015.
- My X variables are on different scales. Some are continuous, some are shares or proportion of something else, some are densities, and some are changes over time. None of them is categorical.
- I don't know any qualitative relationship among the X variables besides the fact I can group them in different topics like "exercise habit", "sleep habits", "nutritional habits", among many others.
- It could be argued that all the X variables relate to each other, but that would imply doing multiple imputation of 1500 variables at the same time. Is it reasonable to do so? or, is it better to imput by topics?
- Given that I don't know any theoretical relationship between the X variables, is there any statistical analysis that sheds some light on which variables should I impute together?
- Given that I have too many variables, does it make sense to do multiple imputation by parts rather than all at once? If so, which variables should I impute first? the ones that have less missing values?
Thank you so much,
Pablo.
Comment