Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multple Imputation taking forever!!

    Hi everyone, I am new to imputations using Stata but I have found multiple posts that have been very helpful in guiding me with answers to questions I previously had but for the current issue I have really found no answers thus far.

    The issue:
    I am using Stata 12 on Windows 8.1 and I am performing multiple imputation using mi impute chained. My data has around 500,000 observations. The data is survey data with one stratum sampling. Data is set in wide format. I am trying to impute 10 variables using the following command:

    mi impute chained (ologit) Mth Qrt S_hosp income (mlogit) payor race C_hosp (logit) Sex T_hosp L_hosp = completedata_varlist, noisily augment force rseed(1234) add(25)

    It is now been 7 days that my computer has been on continuously trying to finish this command and it is still not done!!!! Currently performing iteration 15,845!!!

    I am not sure what to do at this point and I am not sure I can keep my computer on for more days!!!

    Q1: Is there a way to find out when this imputation is going to finish? Any timeline?

    Q2: Does this problem have to do with the command I am running? Reason I ask is that I ran a similar command on 40,000 observations from the same data and it only took Stata 8 hours to finish the imputation?

    I am happy to provide further information if it would be helpful to anyone trying to answer the above questions.

    Many Thanks,

    Sam




  • #2
    Hard to say much. How much missing data do you have?

    In my experience non-linear models, especially mlogit (but ologit as well) can cause lots of trouble because they do not always converge well. I would try this process noisily to get a better feeling for the problems - if there are any. The force option should probably not be specified routinely, as it seems to point to underlying conceptual problems.

    Given the huge number of observations, are you sure you really need to impute at all? Remember that concerning unbiased estimates, you usually gain little by multiple imputation when compared to listwise deletion (cf. Paul Allison's take on this often overlooked fact). So the main point for MI is basically statistical power which, I guess, might not be a big issue here.

    Best
    Daniel

    Comment


    • #3
      Daniel, Thank you for the prompt response
      To answer your questions:
      1- Percentage of missing data ranges between 2-9% for the variables included except for race where it is missing ~22% of the time.
      2- I have specified the noisily option and that's the only way I am able to track any progress but I can't yet see the light at the end of the tunnel and when this should end!!!
      3- Not imputing means I'll lose ~22% of my data because of listwise deletion when performing regress or logit commands including race as an independent variable... Should I just go old
      school and use a dummy variable instead? My hope is to perform sensitivity analysis if the imputation is done to see if results change drastically between listwise deletion analysis and
      imputed results analysis!!
      Thank you for the Paul Allison link... helps me think more about the process in general...

      Appreciative as always,

      Sam

      Comment


      • #4
        The augment option makes me nervous too.

        If I was starting from scratch, I might try 2 imputations, or use a 10% subset of the data, just to see how well it is working.

        You may eventually get this to run. But I bet the gains from using multiple imputation over listwise will be minimal. I guess you'll see,

        Why is race missing for so many cases? And why is it important for your model? If you think race affects the decisions of others but the race really is unknown and hence can't affect decisions, that might create a justification for creating race = unknown as a category. But otherwise I don't think it is good to add unknown as a category.
        -------------------------------------------
        Richard Williams, Notre Dame Dept of Sociology
        StataNow Version: 19.5 MP (2 processor)

        EMAIL: [email protected]
        WWW: https://www3.nd.edu/~rwilliam

        Comment


        • #5
          As said, 22 percent is kind of a large fraction, but with still 400,000 observations left, I would probably not care too much about power. Details, of course, depend on your specific research questions, but if an expected effect/difference does not prove to be statistically significant different from true zero (i.e. cannot accurately be estimated) giving such a large sample, how likely is this effect then relevant from an economic/political/practical perspective?

          If you are worried about bias, maybe because you expect missingnes of race to be correlated with the outcome, then MI or (FI)ML might be worth the shot.

          Edit:
          Also, think carefully about the question Richard poses about the reasons (and consequences) of missing race from your research question.

          Best
          Daniel
          Last edited by daniel klein; 11 Mar 2016, 11:07.

          Comment


          • #6
            Richard gives excellent advice about the practicalities of multiple imputation here. I have some thoughts about his comments on the race variable.

            Sam doesn't tell us what the substance of his project is, but his profile says he works in epidemiology. My experience as an epidemiologist who often works with clinical data (which may or may not be what Sam is doing here) is that race is a treacherous variable to use.

            1. As Sam is experiencing, it is often missing.

            2. Even when not missing, in longitudinal data it is often takes contradictory values at different times.

            3. 1 and 2 may in part be because, although race is supposed to be self-reported information, it often is actually gathered as an assessment by a provider or administrative personnel and is therefore, on its face, invalid. But usually there is no indication in the data as to whether it is a bona fide self report or not.

            4. If the US Census approach to classifying race and ethnicity is followed (which is what most US health care facilities do), even self reports often go awry because many people have difficulty understanding ethnicity as a category orthogonal to race and therefore give clearly erroneous responses. Also problematic with the US Census classification is that people who identify as multiracial, a rapidly growing population here in the US, are essentially forced into randomizing their responses.

            5. Notwithstanding all of the above, there are relatively few outcome variables in health that do not show large differences by "race" however scrambled its coding may be, so you usually can't just omit it from modeling.

            6. In unpublished work I have done on this topic comparing various types of health and health services outcomes according to groups defined by various schemes for recoding race to deal with missing or contradictory values, the results are usually quite sensitive to the methodology used.

            7. In other unpublished work I have done on this topic, comparing outcomes of those who do not report race with those reporting each of the self-reported categories of race, the non-report group is often substantially different from any of the other groups, and sometimes its outcomes are at the highest or lowest end, so that this is not just a matter of unreported being a mixture of the valid categories.

            In short, race is the bane of the epidemiologist's existence.

            Comment


            • #7
              Thank you all for your invaluable input.

              Richard: Thank you for your suggestion... After reading the posts, I decided to abort the process and start another per your suggestions... The race variable is self-reported in the database.

              Daniel: My data as Clyde guessed is healthcare/clinical data. I am trying to look for resource utilization variation among different patient populations and there is an abundance of literature out there that confirms differences in utilization as well as delivery by providers in people with different ethnicities/races. So I agree with Clyde, "race is the bane of my existence."

              Clyde: Appreciate your experience and insight in this matter. I am still learning about all of this. My data is cross-sectional data and race is self-reported by individuals. I know the matter is more complicated than it sounds but I am trying my best to dis-entangle it using the best statistics I can employ. Sensitivity of such data to the methodology used is exactly why I wanted to do both analyses with and without imputed data.

              Best Regards,

              Sam

              Comment


              • #8
                BTW -- what flavor of Stata are you using, and what's the clock speed of your computer? I only have Stata IC at home. I've noticed that MI runs faster on my ancient desktop with a higher clock speed than on my relatively new laptop that has four cores and more RAM but a a lower clock speed, since with only being able to use a single core, clock speed matters a lot.

                Comment


                • #9
                  Ben,
                  Excellent question. I have Stata 12 SE with a computer clock speed of 3.9 GHZ (intel core i7)... I'm in the market for an MP version as my computer is quad core and would like to try and benefit of the quad speed with Stata MP if possible. And yes clock speed does matter and so do the RAM if you're dealing with large numbers of data as in my case here.

                  Best,
                  Sam

                  Comment


                  • #10
                    Egads! 3.9 GHz I7 and it's still taking that long? I'm guessing you also maxed out the RAM. I dunno how much MP improves MI, but it seems like a task that should scale well.

                    Comment


                    • #11
                      There is another reaason, besides the ones discussed, to start with a list-wise approach. . Imputation is just the start of an analysis. With such a large data set one would want to explore on a subset and validate on another. By "explore", I mean try transformations of predictor, interactions, shrinkage. Such exploration is difficult with MI because one must impute each predictor variable. this There will also be natural stratifying factors, with gender a major one. Separate analyses for males and females would be illuminating. Thus the sample size can be reduced by simultaneously selecting subsets for exploration/validation, stratified by gender. If one later wants to do a full MI analysis, then it can concentrate on the models that have done well in the prior work.

                      Sam, your original command in italics is very difficult to read. in future posts, please, as requested in FAQ 12, put commands (and results and data listings) betwee CODE delimiters.
                      Steve Samuels
                      Statistical Consulting
                      [email protected]

                      Stata 14.2

                      Comment


                      • #12
                        I have similar problem. Although the basic imputations (m=50) does not take much time. But any diagnostics takes uneneding time with "mi xeq".. My sample size is about 100,000 obs. and I have 80% missing observations. (This huge missingness is simply becasue certain survey questions are mistakenly asked for a tiny subset of the entire population!).
                        Is there any way to do "mi xeq" faster? However "mi xeq" works for m=5, but for 50 imputations it is just not showing any results for days! Is there any other way I can check the reliability of my imputation model in reasonable time?

                        Comment

                        Working...
                        X