Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multiple imputation for missing categorical variables

    Dear Statalist experts,
    I am currently handling a questionnaire-derived dataset with mostly categorical (nominal and ordinal) variables with some missing data (MAR) in them, where people haven't completed the questionnaire. Due to the nature of the purpose of my final model (predictive diagnostics), it is important that I have as complete a dataset as possible and hence, I am hoping to fill in the data points using multiple imputation via Stata. I tried using MI chained but STATA keeps telling me that I have missing variables within my imputation variables but I thought this problem could be alleviated if I use chained equation; i.e. the iterations should run in a chain/loop simultaneously. The syntax I've used looked like the following:

    mi impute chained (mlogit, include(Q2 Q69e Q77) noimputed augment) Q10, add(3) rseed(23549)

    but I keep getting these error messages:

    either
    r(498) missing imputed values produced
    This may occur when imputation variables are used as independent variables or when independent variables contain missing values.

    or this:

    [convergence not achieved
    convergence not achieved
    mlogit failed to converge on observed data

    As a result, the regression model used to predict the missing value cannot be created. I really welcome any input at all in the matter. Any insights that could possibly resolve the matter would be greatly appreciated. Many thanks.

  • #2
    Why are you using noimputed? The help says the option is rarely used. I would suggest starting nice and simple and then add complexity if you think you need it. augment is a little esoteric too; if you need it it is because you have perfect predictions, and if so that may be adding to your woes.

    Also, how much missing data do you have? There may be limits to the miracles MI can do if there are huge amounts of MD in several variables.
    -------------------------------------------
    Richard Williams, Notre Dame Dept of Sociology
    Stata Version: 17.0 MP (2 processor)

    EMAIL: [email protected]
    WWW: https://www3.nd.edu/~rwilliam

    Comment


    • #3
      And, to be totally cynical, if not nihilistic, I would point out that missing data arising from people not completing the questionnaire is really not likely to be MAR.

      Comment


      • #4
        It may just be because I do not have enough experience with it, but I tend to be leery of MI in general. It seems like the benefits are often trivial, or that the justification for using it may be in doubt.In this case I might want to do some checks to see how similar the people who didn't complete are to the people who did complete on the parts that both completed.
        -------------------------------------------
        Richard Williams, Notre Dame Dept of Sociology
        Stata Version: 17.0 MP (2 processor)

        EMAIL: [email protected]
        WWW: https://www3.nd.edu/~rwilliam

        Comment


        • #5
          Dear Mr Williams and Mr Schechter,

          Many thanks for replying to my conundrum.

          The reason I am assuming the missingness as MAR is because we have carried out interviews on a random sample to find out reasons for missing data, to which evidence varied as to why questions were omitted; for many it was a matter of accidental omission or there were no specific reason per say. Where I do see your argument, I don't think my data is necessarily MNAR either.

          I know I should be wary of MI but at present, I’ve been tasked to proceed with it. Unfortunately, the participant-completed questionnaire had been a large one consisting of 100+ variables where few missing datapoints had occurred for most of the participants. Should I start the regression process now, through listwise deletion I would lose most of my data. Hence, I would like as much as possible to impute and retain data. The number of missing data varied from 0.9% to 10% across the variables. Regarding the rigor of MI as a method, following successful MI, I have proposed a few checks to assess the validity of the imputed dataset in order to ensure that it is logical.

          I am really open to other options but I need to ensure I’ve exhausted all avenues of MI first as been assigned. As advised, I have since attempted the imputation model without the additional functions and missingness in the imputation variables is still a problem. I think after long discussion with the team, for the time frame given we might need to forgo MI and proceed with the regression model as planned. Any suggestions that could help solve the MI problem or any other statistical classification model that could handle missingness in categorical data with dichotomous dependent variables in healthcare research would still be greatly appreciated. Meanwhile, I'll keep searching the web for a general idea of the literature. Thank you again.

          Comment


          • #6
            Based on your description I wouldn't expect you to be having so much trouble, so, without having the data, it is hard to advise you. To further simplify things, maybe you could try dichotomizing your mlogit variable and see if it will work then. Or, if there are some categories with very sparse counts (e.g. only 4 people gave a response of 7) then see if there are logical ways to combine and reduce the number of categories. These are things you might want to do regardless of whether you are using mi or not.
            -------------------------------------------
            Richard Williams, Notre Dame Dept of Sociology
            Stata Version: 17.0 MP (2 processor)

            EMAIL: [email protected]
            WWW: https://www3.nd.edu/~rwilliam

            Comment


            • #7
              Hi Joey,
              The error "r(498) missing imputed values produced
              This may occur when imputation variables are used as independent variables or when independent variables contain missing values." suggests that one of the independent variables you are using also has missing values itself. You can use the option 'force' to go ahead with the imputation and for the independent variable with missing data only complete cases will be used. I hope this helps.

              Comment


              • #8
                Welcome to Statalist Ellie. What is puzzling about that message is mi impute chained is supposed to fill in missing values for multiple variables. But, for some reason, it is having trouble doing so. Hopefully Joey has made some progress on this.
                -------------------------------------------
                Richard Williams, Notre Dame Dept of Sociology
                Stata Version: 17.0 MP (2 processor)

                EMAIL: [email protected]
                WWW: https://www3.nd.edu/~rwilliam

                Comment

                Working...
                X