Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multiple imputation with no complete cases and too many variables

    Dear all,

    Even though I have tried to find the solution to my problem, I still have the following challenge of multiple imputation. Also, my questions below are not directly related to the use of Stata but with statistics. However, I think this problem is something that some Stata users may face as well.

    The objective of my project is to predict Y using a set of variables X. I don't care about the estimates but the prediction only.

    Characteristics of my data:
    1. My data is a panel with 217 individuals followed during 58 years.
    2. The X variables are about 1500.
    3. All the X variables have missing values. I don't have one single complete case.
    4. Y also have missing values.
    5. My Y variable has a two-year lag with respect to the X variables. That is, some X variables go up to 2017 whereas my Y variable goes up to 2015. I am interested in predicting (nowcasting) Y for 2016 and 2017. That is, I don't care about predicting accurately missing values of Y for years before 2015.
    6. My X variables are on different scales. Some are continuous, some are shares or proportion of something else, some are densities, and some are changes over time. None of them is categorical.
    7. I don't know any qualitative relationship among the X variables besides the fact I can group them in different topics like "exercise habit", "sleep habits", "nutritional habits", among many others.
    Given this characteristics, I am trying to find the answer to the following questions but I have not found anything substantial yet and I was wondering if maybe you guys can give me some light on this.
    1. It could be argued that all the X variables relate to each other, but that would imply doing multiple imputation of 1500 variables at the same time. Is it reasonable to do so? or, is it better to imput by topics?
    2. Given that I don't know any theoretical relationship between the X variables, is there any statistical analysis that sheds some light on which variables should I impute together?
    3. Given that I have too many variables, does it make sense to do multiple imputation by parts rather than all at once? If so, which variables should I impute first? the ones that have less missing values?
    I know that it is my job to do proper research and select the correct imputation model for my project. So far, I have not found anything that deals with the two main characteristics of my dataset: [1] no complete cases and [2] too many variables. If you could point me out to any paper or relevant document, I would highly appreciate it.


    Thank you so much,

    Pablo.



    Best,
    Pablo Bonilla

  • #2
    I fear you have too many variables for (comparatively) too few individuals. It is hard to imagine a model which will fit well. It may overfit. With regards to multiple imputation, you didn't say much about the type of missing (MAR?) and the amount. If it is MNAR or the percentage is huge, no strategy can be considered perfect. Too many missing data, considering all variables have missing values, may turn out an immense challenge, statistically speaking.
    Best regards,

    Marcos

    Comment

    Working...
    X