Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Missing data at level 2 predictor variable - is it possible to do multiple imputation in Stata?

    Hi,

    I have missing data at level 2 for predictor variables, and I can't find a way to do multiple imputation (bearing in mind I am nowhere near a expert in either Stata or multilevel models).

    My study is looking at children's engagement in classrooms, so I have 4 levels, with children being level 2, i.e the structure of the dataset is school> class> child> observation. I have about 20-30 observations per child, and 5 children per classroom. I have 1669 observations in total. I'm mainly interested in the impact of activity on engagement (i.e. of a level 1 predictor variable).

    I also have teacher questionnaires about the children's personality traits (i.e. level 2 predictor). However, I am missing questionnaires for 1 class = 5 children. I have a range of issues with the missing data: the data are missing for all of the same participants, there are 8 of these level 2 variables, and I don't really have a lot that can be used to predict their values. My only other data are children's gender, as age is also missing for those children, as well as the engagement scores, but the latter is my outcome variable.
    I read that there are methods for doing this, but the author didn't elaborate and I think they would probably be beyond me unless there is a neat stata command to do it.

    I tried to do multiple imputation but it gives me level 1 imputations (i.e. different personality scores depending on the observation), which is obviously not appropriate.

    At the moment, the only solution I have found is to do listwise deletion (though I'm aware of the issues), compare the impact on my level 1 variables between the models with and without that classroom (i.e. 1669 and 1570 observations respectively), and treat the results about the level 2 variables with caution (I'm presenting them separately and as exploratory).


    Any thoughts welcome.

    Thank you in advance (and Merry Christmas!)
    Last edited by Soizic Le Courtois; 21 Dec 2020, 10:21.

  • #2
    This appears, at least to me, to be a pretty complex query.

    Originally posted by Soizic Le Courtois View Post
    [...]I have 4 levels, with children being level 2, i.e the structure of the dataset is school> class> child> observation. I have about 20-30 observations per child, and 5 children per classroom. I have 1669 observations in total.
    Not being an expert in multilevel analyses myself, I dare say that 4 levels seem overkill, especially if you are not really interested in estimating the variances due to each level.

    I'm mainly interested in the impact of activity on engagement (i.e. of a level 1 predictor variable).
    [...]
    My only other data are children's gender, as age [...]
    If this is really all you have (observed), I would probably think about how to best use the data-structure to isolate the association you are interested in. Multilevel analyses, in the sense of random-effect models, will not help much here. Given that you have virtually no controls, i.e., observed confounders, I would probably try to wipe out school and class differences, treating those as fixed-effects in the sense of a within-child estimator.

    However, I am missing questionnaires for 1 class = 5 children.
    Would including that one class affect your results both in terms of estimates and in terms of generalizability?

    At the moment, the only solution I have found is to do listwise deletion (though I'm aware of the issues), [...]
    The main issue, especially in linear models, is often just the loss of power.

    Comment


    • #3
      Hi Daniel, thanks for your input. I'm pretty sure multilevel is what I need, if only to account for the nested structure of the data. The question of how much variance (or how little) is explained at different levels also pretty important so that wasn't really my worry here.

      Would including that one class affect your results both in terms of estimates and in terms of generalizability?
      It doesn't make much difference to estimates, and none to the overall conclusion, but the differences in AIC and r2 are pretty big, but that might just be because of the size of the dataset?

      The main issue, especially in linear models, is often just the loss of power.
      I thought bias in estimates was also a big issue?

      In any case, thanks again for your thoughts, I really appreciate it.




      Comment


      • #4
        Originally posted by Soizic Le Courtois View Post
        I'm pretty sure multilevel is what I need, if only to account for the nested structure of the data.
        There are various ways to do that, some as simple as estimating cluster-robust standard errors. The question might not be what you need but more what you want; only can decide that.


        Originally posted by Soizic Le Courtois View Post
        but the differences in AIC and r2 are pretty big, [...]
        Why do you care about AIC (at best relevant for model comparisons) and R-squared (hardly relevant ever)? How do these statistics relate to your research question(s)?


        Originally posted by Soizic Le Courtois View Post
        I thought bias in estimates was also a big issue?
        Bias might or might not be an issue. MI can reduce bias if the missing mechanism is ignorable (missing at random) and modeled correctly; so can simply conditioning on the relevant covariates in linear regression models. From what you have described here, I would be (much) more concerned about omitted variable bias (which is why I would go for fixed-effects/within estimator) than about the bias due to missing values.

        Comment

        Working...
        X