Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Difference between using "mvn" and "chained" multiple imputation? (Or any other MI type, for that matter)?

    I've used multiple imputation in a survival analysis, where I had a substantial amount of missing data on two covariates related to the type of contract and the diagnosis of subjects. (There are no missing observations on the dependent variable, though). I used multivariate normal regression, but was wondering if in order to use a certain type of MI (multivariate normal regression, chained equations etc.), specific conditions apply. Don't wanna mess things up.
    I'm ignorant on the topic so it might take a very long explanation. In that case, feel free to point me towards sources where I can read and learn about it if you don't feel like typing a lot (but if you do, by all means, do it).
    Thanks in advance.

  • #2
    Eduard:
    I would start off from -mi- entry in Stata .pdf manual.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Hello Eduard,

      Carlo has already paved the way to what really matters, first and foremost: getting to grips with the Stata manual, where great information is found galore.

      That said, if I didn't misunderstood the main concepts (cf. Acock, A Gentle Introduction to Stata, StataPress), multiple imputation under chained method tends to be mostly indicated when the variables are highly skewed, or there are too many count or categorical variables in the model. In short, when the assumption of normality of the regressors becomes far fetched. To end, let's keep in mind its counterpart - mvn - stands for 'multivariate normal".

      Also, and according to the manual,

      mi impute mvn subsamples the chain, whereas mi impute chained runs multiple independent chains;
      Hopefully that helps,

      Marcos
      Last edited by Marcos Almeida; 08 Feb 2016, 11:06.
      Best regards,

      Marcos

      Comment


      • #4
        You can find a thorough discussion of chained equations in Ian R.White, Patrick Royston and Angela M.Wood 2011 Multiple imputation using chained equations:Issues and guidance for practice. Statist. Med. 2011, 30 377–399 377. There is a website at the University of Wisconsin (https://www.ssc.wisc.edu/sscc/pubs/stata_mi_intro.htm) which provides detailed examples.
        Richard T. Campbell
        Emeritus Professor of Biostatistics and Sociology
        University of Illinois at Chicago

        Comment


        • #5
          Thanks to everyone for the help, you guys are life-savers. So, let me ask:
          1) If the covariables I'm trying to impute for the final model (the Cox model) are categorical ones, can I use multivariate normal regression (mvn)? Tthe two covariables are the "type of contract" that people are employed under and the diagnosis of the condition. This means they are both categorical variables, since they are either type of contract A or B or one type of diagnosis in a list of say 10.
          I'm asking because I got the impression that multivariate normal regression is only adequated for continuous variables, am I wrong?
          I used it because I was actually following "A Gentle Introduction to Stata", and that's the imputation method they use in the example given, but then I started reading about the different methods and it confused me a bit.

          2) I also tried using chained multiple imputation but it took forever to process it (way longer than with multivariate normal regression, I'm talking 3 hours or more of me waiting, so I assumed the program had got stuck). Afterwards, the log read like this:


          Code:
          mi impute chained (mlogit) typeofcontract diagnosis, add(5) rseed(2121)
          
          Conditional models:
                        icd9: mlogit diagnosis i.typeofcontract
                tipocontrato: mlogit typeofcontract i.diagnosis
          
          Performing chained iterations ...
          error occurred during imputation of typeofcontract diagnosis on m = 2
          --Break--
          r(1);

          3) And a follow-up question, how do you estimate standard errors using Rubin's Formula to take into account variations within and between the imputed datasets?

          Again, really appreciate your help, thanks in advance!
          Last edited by Eduard López; 09 Feb 2016, 16:16.

          Comment


          • #6
            First, although there is a fair amount of evidence that MVN is robust in the sense that imputations based on a MVN assumption are often adequate even if the assumption is violated. I wouldn't use MVN in this particular case. Second, your imputation attempt probably failed because you are using just two variables, each of which has high proportions of missing data to impute each other. You need to find other appropriate variables to add to your imputation model. The added variables can have missing values, but you need more information to do decent imputations than your current setup allows.
            Richard T. Campbell
            Emeritus Professor of Biostatistics and Sociology
            University of Illinois at Chicago

            Comment


            • #7
              1) Would you then use chained equations, Dick?

              2) I'm also having a bit of trouble with the interpretation of Stata's Cox Model output. I want to know where to find the number of observations for the model (let's imagine that in this case we are using the complete case analysis). Is it the number before "total observations", before "failures in single-record/single-failure data", or after " Number of obs = "?
              And in that case, is that number supposed to add, together with the total number of subjects having a missing observation in any variable of the model, the total number of subjects in the database?

              Thanks a lot.
              Last edited by Eduard López; 11 Feb 2016, 09:34.

              Comment


              • #8
                Sorry for the double post, but I wanted to ask something related to this (if this warrants a separate thread, please let me know and I will create it).

                1) If my variable of interest (the duration of a condition) has NO missing data, is it a valid strategy to codify the missing data from each covariate as a "residual category", in order for the Cox models (I'm actually running two, one for each sex) not to drop those episodes when running in complete-case analysis?

                My rationale for this is that I would not need to impute anything, as what I'm interested in is the variable of interest, and that has no missing values. So, simply because there is no missing data in the variable of interest for the Cox models, I would not really need to impute anything (each episode is contributing with its accurate value of the variable of interest to the final models!).

                2) Were I to do that, run the Cox models and find out the "new" HRs without dropping the high number of episodes due to missing data in the covariables... Would the total number of observations in the Cox models amount to the total number of episodes in my database for each sex (as each Cox model is not "dropping" any episode, because I codified the missing data for each covariable in a "residual category"?).


                Thanks a lot in advance, and sorry if the questions sound a bit silly.

                Comment


                • #9
                  There are too many questions, and some of them relate to the - unavoidable - background theory, i.e., it's "a hit or miss", so to speak

                  That said, I think you should perform some sort of sensitivity analysis. You could, for example, compare your "complete cases" analysis with the analysis under imputed data.

                  Also, I don't know what you meant by "codifying the missing data for each covariable in a residual category', but in Stata you can easily check the patterns of your missing data, make a summary of them and even create variables so as to identify the observations where there are missings.

                  Please type:

                  Code:
                  . help misstable
                  Hopefully that helps.

                  Best,

                  Marcos
                  Best regards,

                  Marcos

                  Comment


                  • #10
                    Thanks a lot Marcos.

                    What I mean with that sentence is that, imagine the variable "diagnosis" has 12 categories and a lot of missing observations and that causes my final cox model to drop those many episodes (as it's complete case analysis). I'm not that interested in the missing values for the diagnosis covariable per se, as what I'm measuring is the variable of interest that measures the duration of the episode, "duration".

                    So what I did was
                    Code:
                    .replace diagnosis=13 if diagnosis==.
                    .lab def diagnosislab 1 "Mental disorders" [...] 13 "Residual category"
                    When doing that, there are no longer missing data for that covariable so those episodes are not dropped with complete case analysis. I don't know their diagnosis group but I do know the duration of their episodes.
                    Does that makes sense?

                    Comment


                    • #11
                      The notion that one can create a sort of "missing data" category as part of a dummy variable classification, in this case a category for "no dx available" is tempting but unfortunately not useful at it leads to biased estimates. See Jones, Michael P. 1996. ―Indicator and Stratification Methods for Missing Explanatory Variables in Multiple Linear Regression.‖ Journal of the American Statistical Association 91: 222-230. This issue is discussed in Paul Allison's Sage monograph on missing data where he shows, via simulation, what happens when you try to do that.

                      Regarding other questions raised in this thread:

                      (a) the total number of cases in a survival analysis is the number of cases with "complete" data meaning the number of cases with complete data plus the number where all relevant variables have been imputed; Sample size is treated as it is in any other Cox model.

                      (b) once you do the imputations, Stata's mi estimate command will give you approximately correct standard errors;

                      (c) Rubin's rules for combining estimates, computing SEs etc are based on an MVN assumption which means, as I understand it, that results from chained equations are not supported by underlying MVN theory.Thus you have a choice between strong theory but perhaps inadmissible estimates or better estimates with a weaker tie to theory.

                      (d) Note that, as explained by White et al (see reference in previous post) when doing imputation for a survival analysis you should estimate the Nelson-Aalen cumulative hazard measure and use it in the imputation process. Although you have no missing outcome data, you should still do this.
                      Richard T. Campbell
                      Emeritus Professor of Biostatistics and Sociology
                      University of Illinois at Chicago

                      Comment


                      • #12
                        Originally posted by Dick Campbell View Post
                        Second, your imputation attempt probably failed because you are using just two variables, each of which has high proportions of missing data to impute each other. You need to find other appropriate variables to add to your imputation model. The added variables can have missing values, but you need more information to do decent imputations than your current setup allows.

                        Thanks a lot for your help Dick.

                        So the fact that I am only using two variables in my chained imputation model is the likely cause of it failing?

                        Should I then register more imputed variables or register more variables as regular ones?

                        I'm asking because I only wanted to impute two variables as they are the only ones that have more than 5% of missing data. There are others that have some missing data, but the percentage is very small and the database is very big.

                        For example, take this code:

                        Code:
                        mi set wide
                        mi register imputed typeofcontract diagnosis
                        mi register regular year sex agegroup incomegroup industry region
                        mi impute chained (mlogit) tipocontrato icd9 = año, add(5) rseed(88) dots
                        Does that sound right? Or do I need to register as imputed all those variables that have some percentage of missing data (even if I didn't want a priori to impute them)? Does it even make sense to register some variables as imputed but then not add them after "mi impute chained"?
                        I ran this latter example (registering most variables as imputed, except for "year" and "sex") and it's taking forever to finish. It is working though, as I added the noisily option and I can see it running but still...

                        Thanks a lot.

                        Comment

                        Working...
                        X