Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Limiting Range of values in Multiple imputation

    Hello Everyone
    My name is Habtesh and I want to ask some advice regarding Multiple imputations. I have a cross-section data set of 104 countries and 15 variables ( 1 outcome variable, 11 policy variable, and 3 control variable). The policy variables are continuous ranging from 0-2 while the other variables are continuous without any range. my objective is to fill the missing values from 7 policy variables and I want to use multiple imputation techniques. here is what I did in Stata;
    ********************
    mi set mlong
    mi register imputed Involvement Advancerulings Appealprocedures Feesandcharges Cooperationinternal Cooperationexternal Governance
    mi register regular logTimetoexportBC Information Formalitiesauto Formalitiesdoc Formalitiesproc Regulatory Corruption lnPCGDP
    mi impute mvn Involvement Advancerulings Appealprocedures Cooperationinternal Feesandcharges Cooperationexternal Governance = Regulatory Corruption lnPCGDP, add(80)
    mi estimate: regress logTimetoexportBC Information Involvement Advancerulings Appealprocedures Feesandcharges Formalitiesauto Formalitiesdoc Formalitiesproc Cooperationinternal Governance Cooperationexternal Regulatory Corruption lnPCGDP i.Income
    ***********************************
    my questions are as hollow
    1. When I check the summary statistics, I found that the imputed value for the missing observation is not between 0 and 2; some of them are below zero and some are above two. is there any advanced option in MI to limit the range of imputed values to be between 0 and 2? I also use mi xeq to view the summary statistics by imputation and the minimum and maximum value for the policy variables with imputed missing value is below 0 and above 2.
    2. I choose M=80, Is there any mechanism to determine an appropriate value for a number of imputation?
    Thanks in advance for all your cooperation and any advise on the above question is welcomed.




  • #2
    Habtesh,

    thanks for providing the code you used. You can even improve your future pots by using code-delimters, i.e., put your code between [CODE][/CODE].

    The first thing I wish to point out is that 14 predictors might be considered too many with only 100 observations by some reviewers.

    The second point concerns your imputation model. Note that it is important to put (at least) all variables that are used in the analysis in the imputation model, especially the response/outcome/depenedent variable. This is crucial. You are now only including a subset of variables in the imputation model. As a result, the correlations between the imputed values and the variables that you omit from the imputation model will be underestimated. Hence, the relationship between the imputed variables and the response/outcome/dependent variable will be underestimated (or overestimatd, depending on the correlations among the predictor variables).

    Concerning the choice of imputation model, you are using mvn (multivariat normal) imputation. Stata will use repeated draws from a multivariate normal distribution, which will result in imputed vales that are outside the range of the original predictors. That is, in general, OK. One could pose the question what kind of variable would fall into the range [0; 2] and still be considered continuous, though. Perhaps there is good answer to that question, but I would argue that the (multivariate) normal distribution would not fit well here. If you want the imputed values to fall inside the observed range, use chained imputations with pmm (predictive mean matching), instead. The code is something like

    Code:
    mi impute chained (pmm , knn(#)) varlist ...
    Whether the underlying regression model fits better than multivariate imputation is another question that you need to answer.

    Concerning the choice of number of imputations, there is a rule-of-thumb (I cannot recall the original source now) that you would want 100*FMI imputations, where FMI is the largest fraction of missing information. You get the latter from the output header of mi estimate.

    Best
    Daniel
    Last edited by daniel klein; 11 Jan 2018, 01:04. Reason: some spelling

    Comment


    • #3
      Concerning the choice of number of imputations, there is a rule-of-thumb (I cannot recall the original source now) that you would want 100*FMI imputations, where FMI is the largest fraction of missing information. You get the latter from the output header of mi estimate.
      daniel klein I, too, have seen this rule of thumb. As a practical matter, this can be problematic because you have to first generate some imputations, then run -mi estimate- to find out FMI and then perhaps run more imputations, and then re-check -mi estimate-'s FMI value etc. I have occasionally found this leads to a cycle requiring rounds of increasing numbers of imputations. If -mi impute chained- were not so computationally intensive, I suppose I wouldn't mind that, but when working with large data sets (tens of thousands of observations and dozens to a hundred variables), some of which are imputed with models based on maximum likelihood estimation, this can turn into days of computation. Does anybody know of a way to short-circuit this? I long for a return to the "good old days" early in MI's history when it was said that 5 imputations would work for just about anything! Too bad that's not true.

      Comment


      • #4
        I am afraid there is no short-cut. The computation of FMI requires the within and between imputation variances. You would need a good guess at these values before imputing the data, and I am not aware of any approach into this direction. Hopefully, someone on the list knows more than I do on this subject.

        Best
        Daniel

        Comment


        • #5
          because of the issues raised by Clyde, I use the following "rule": examine the amount of missing data for each relevant variable; if the largest such rate is, say, 30%, obtain a minimum of 50 imputed data sets; if larger, or if there is something in the data that increases worry that 50 won't be enough, obtain 100 imputed data sets; for certain types of "sensitivity" analysis, I may increase this to 200 or even 250 imputed sets (particularly where I will use weights in an attempt to "compensate" for missing not at random, I will want additional imputations)

          Comment


          • #6
            Thanks, Rich.

            Comment


            • #7
              Hello, All
              I want to thank all of you for the valuable comments and suggestions about MI.
              I want to thank Clyde Schechter, daniel klein and Rich Goldstein for valuable comments and discussions.

              I want to give some explanation based on Daniel Klein comment and suggestion
              1.) ".....14 predictors might be considered too many with only 100 observations by some reviewers".
              My analysis uses mainly World Bank Doing Business and OECD database and I focus on middle and low-income countries so I found only 104 countries with data for both database ( when I check the data available for each database separately, I found around 130 countries but only 104 have both OECD and WB data). Therefore, I am thinking of two possible solutions; First to go with 104 countries and 14 predictors and make a cross-section regression by filling missing value with MI; second to estimate using two-year panel data for 2015 and 2017 which will increase the number of observation to around 208.
              2.) Thank you for your second comment I will put all variables that are used in the analysis in the imputation model.
              3.) "Concerning the choice of imputation model". I tried chained imputations and i have the following result
              (code)
              mi set mlong
              mi register imputed GDP PCGDP Involvement AdvanceRulings AppealProcedures FeesandCharges CooperationInternal CooperationExternal GovernanceandImpartiality
              mi register regular GATT lnsqkm landlocked Information formalitiesdocumentsnew FormalityAutomation FormalityProcedures logTEDC
              mi register passive lnGDP lnPCGDP
              mi impute chained (truncreg, ll(0) ul(2)) Involvement AdvanceRulings AppealProcedures FeesandCharges CooperationInternal CooperationExternal GovernanceandImpartiality (regress) GDP (regress) PCGDP = lnsqkm GATT landlocked Information formalitiesdocumentsnew FormalityAutomation FormalityProcedures logTEDC, add(80)
              I get this message from state
              "variable shcode56 not found
              Your mi data are xtset and some of the variables previously declared by xtset are not in the
              dataset. mi verifies that none of the xtset variables are also registered as imputed or
              passive. Type mi xtset, clear to clear old no-longer-valid settings."

              I clear xtset and run again
              mi xtset, clear
              mi impute chained (truncreg, ll(0) ul(2)) Involvement AdvanceRulings AppealProcedures FeesandCharges CooperationInternal CooperationExternal GovernanceandImpartiality (regress) GDP (regress) PCGDP = lnsqkm GATT landlocked Information formalitiesdocumentsnew FormalityAutomation FormalityProcedures logTEDC, add(80)
              variable _t0 not found
              I get this message for the second time
              Your mi data are stset and some of the variables previously declared by stset are not in the
              dataset. mi verifies that none of the stset variables are also registered as imputed or
              passive. Type mi stset, clear to clear old no-longer-valid settings.
              r(111);

              I clear stset and run again
              mi stset, clear
              mi impute chained (truncreg, ll(0) ul(2)) Involvement AdvanceRulings AppealProcedures FeesandCharges CooperationInternal CooperationExternal GovernanceandImpartiality (regress) GDP (regress) PCGDP = lnsqkm GATT landlocked Information formalitiesdocumentsnew FormalityAutomation FormalityProcedures logTEDC, add(80)
              Now everything is perfect
              result
              ..............
              ...............
              .................some result is omitted above which explains the conditional models
              Performing chained iterations ...

              Multivariate imputation Imputations = 20
              Chained equations added = 20
              Imputed: m=1 through m=20 updated = 0

              Initialization: monotone Iterations = 200
              burn-in = 10

              Involvement: truncated regression
              AdvanceRulings: truncated regression
              AppealProced~s: truncated regression
              FeesandCharges: truncated regression
              CooperationI~l: truncated regression
              CooperationE~l: truncated regression
              Governancean~y: truncated regression
              GDP: linear regression
              PCGDP: linear regression


              Observations per m
              ----------------------------------------------
              Variable Complete Incomplete Imputed Total

              Involvement 98 6 6 104
              AdvanceRulings 102 2 2 104
              AppealProced~s 95 9 9 104
              FeesandCharges 95 9 9 104
              CooperationI~l 99 5 5 104
              CooperationE~l 76 28 28 104
              Governancean~y 97 7 7 104
              GDP 102 2 2 104
              PCGDP 102 2 2 104

              (complete + incomplete = total; imputed is the minimum across m
              of the number of filled-in observations.)
              *************************************
              Finally I estimate the linear regression as follow
              mi estimate : regress logTEDC GATT landlocked Information Involvement AdvanceRulings AppealProcedures FeesandCharges formalitiesdocumentsnew FormalityAutomation FormalityProcedures CooperationInternal CooperationExternal GovernanceandImpartiality

              I have the following question for further clarification
              1. My outcome variable is in logarithmic form so i include it in log form during the imputation (LogTEDC). is this appropriate or do i need to include it as level (TEDC) to impute the missing value then take its logarithmic value during final estimation.
              2. I registered LnGDP and LnPCGDP as passive variable because they have missing value in their level form. Am I correct or i can add them as regular variable eventhose they have missing value because I'm not worried about their missing value since its only two observation.

              Thanks in advance for all the cooperation!!!!!!



              Comment


              • #8
                To figure out how many imputations you need, use my command how_many_imputations:
                ssc install how_many_imputations
                For details on what it does, see help how_many_imputations. This is better than the old rule to use 3-10 imputations or the more recent rule to use 100 times the fraction of missing information. There's more on my blog and in von Hippel (2018).

                If you're using logged variables in your analysis, you want to log them before imputation, not after (von Hippel 2009).

                The practice of limiting the range of imputed values to keep them in bounds is widespread but usually does more harm than good (von Hippel 2013). It makes the imputed values look better, but it biases regression estimates from the imputed data. I have found this problem in particular with the truncreg imputation model that you are using. It would be better to use a linear or mvn imputation model, even if some imputed values are out of bounds. The imputed values matter less than the compatibility of the imputation model with your analysis model, which I see is a linear regression.

                Likewise the practice of passive imputation introduces bias and is best avoided (von Hippel 2009).

                mi impute is often slow in large datasets, especially when you use imputation models for categorical variables (logit, ologit, mlogit). I'm not sure why this is. It's not inevitable. Similar models run more quickly in other software (e.g., SAS PROC MI).

                Hope that helps!

                References
                von Hippel, Paul T. (2018). "How many imputations do you need? A two-stage calculation using a quadratic rule." Sociological Methods and Research, in press. A version of the article is available as an arXiv e-print.
                von Hippel, P. T. (2013). “Should a normal imputation model be modified to impute skewed variables?Sociological Methods and Research, 42(1), 105-138.
                von Hippel, P. T. (2009). “How to impute interactions, squares, and other transformed variables.Sociological Methodology 39, 265-291.

                Comment

                Working...
                X