Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Is there any reason why I shouldn't use ordinary clustered Huber variances with mi imputed data?

    Hello Everybody

    I have a query regarding multiply-imputed datsets generated using the mi impute command in official Stata.

    Supposing I start with a dataset with a cluster variable and/or sampling-probability weights, and I intend to fit a regression model (such as a logit) to predict an outcome from a list of covariates, and if some of my observations have missing values in some of the predictors, and I start my mi work with the command

    mi set flong

    use the mi impute command to create multiple versions of the data with the predictors filled in using multiple imputation (using chained equations to impute predictors from other predictors and NOT from the outcome). The resulting dataset will then have observations identified uniquely by 2 variables, _mi_m and _mi_id, of which _mi_m is zero in the observations retained from the original dataset (with some missing values in some predictors) and has positive-integer values in the imputation subsets (with no missing observations in any predictors).

    Is there then any reason why I should not then input the observations with _mi_m>0, and then fit a regression model using clustered Huber variances with the original cluster variable and the original sampling-probability weight variable? The dataset used in the regression model will then be an expanded version of the original dataset, with each original cluster replaced by a cluster larger by a factor of _dta[_mi_M],containing that number of versions of the original cluster, with the same sampling-probability weights as before, but sometimes with different imputed values for some of the predictors. The multiple versions of each original observation, identified by the positive-integer variable _mi_m, will then have the same values of the cluster variable, the sampling-probability-weight variable, and the outcome variable, but may have different values of the predictor variables. However, if I use clustered and weighted Huber variances, then (as I understand it) I would still be estimating them allowing for the fact that we have been sampling clusters from a population of clusters, instead of sampling observations from a population of observations. And, because I am making inferences about the conditional distribution of the outcome variable given the predictor variables, then I do not see why my estimates of the model parameters, and of the sampling variances of the estimates, should be systematically wrong.

    We are in fact planning to do something a bit more complicated than this, using the methods of Steyerberg, Harrell et al. (2001) to compare the predictive power of alternative logistic models using Harrell's c-index. This will involve using bsample to sample subsamples of clusters. The mi estimate command has a problem with these subsamples, because the variable _mi_id will then not identify uniquely the observations in each imputation subset. However, if there are
    any reason why I should not use ordinary clustered Huber variances on such multiply-imputed datasets, then I would like to know them.

    Best wishes

    Roger

  • #2
    PS I forgot the Steyerberg, Harrell et al. reference, which is:

    Ewout W. Steyerberga, Frank E. Harrell Jr, Gerard J. J. M. Borsboom, M. J. C. (René) Eijkemans, Yvonne Vergouwe, J. Dik F. Habbema. Internal validation of predictive models: Efficiency of some procedures for logistic regression analysis. Journal of Clinical Epidemiology 54 (2001) 774–781.

    Best wishes

    Roger

    Comment


    • #3
      I am not sure whether I follow all of this, but here are some thoughts to get a hopefully interesting discussion started anyway.

      May I ask for the reason you want to omit the response from the imputation model? By doing this the imputed values, that vary between the multiple versions of the original observations (the identified by _mi_m), will have no relation to the (constant within observations) outcome. This is because the missing values are predicted from the correlations among the predictors only.Therefore you would likely underestimate the "true" relationship between the predictors and the response in the final model.

      Concerning the question about clustered standard errors, I guess one needs to think about how these estimates relate to the concept of combining the between and within imputation variances according to Rubin's rules. I do not think that you will be able to address the between imputation variances, that are based on the estimated coefficients from within each imputed dataset, with a Huber estimator for a pooled point estimate. I might be wrong.

      In sum, I doubt that the outlined strategy could recover valid point estimates or standard error in the usual sense of a multiple imputation framework. But then, I am not entirely sure that you want this. If you do want MI, then I would rather work around the technical problems of Stata's mi suit with the bsample command that you describe, and combine the desired point estimates and standard errors myself. The formulas are given in the MI manual and should not be hard to implement.

      That said, appropriate imputation of clustered, weighted data already sounds a bit scary.

      Best
      Daniel

      Comment


      • #4


        Roger, I pledged never to respond to another question about multiple imputation, but I'll dip again into these murky waters.


        What about doing a standard MI analysis by hand? 1) create the M imputation data sets; 2) subsample clusters from each; 3) analyze each set; 4) estimate the average of each parameter; 5) calculate the between imputation variance contribution; 6) combine according to Rubin's rules.

        Note that it is good practice to either weight the imputation models with the survey weights or, perhaps better, to include the design variables, including the weights, as predictors.. See post 20 in this thread for references.
        Last edited by Steve Samuels; 10 May 2016, 15:58.
        Steve Samuels
        Statistical Consulting
        [email protected]

        Stata 14.2

        Comment


        • #5
          Many thanks to Daniel and Steve for these points. I am now thinking of using the whole multiple-imputation dataset as a training set, and then using stratified bsample datasets as a test set in which to estimate Harrell's c statistics.

          The reason that I was not proposing to include the outcome in the imputation loop is that I was intending to measure the predictive power of the model to predict the disease outcome (using Harrell's c-index). If the outcome itself is used in imputing the predictors, then I would expect my audience to be skeptical of my assertions that I am doing this.

          Best wishes

          Roger

          Comment


          • #6
            PS I think I got my mouth/finger in motion before getting my brain in gear in the previous post. I am now thinking of using the whole multiple-imputation dataset as the test set (for computing Harrell's c-index), and using samples created by bsample as the multiple training sets. There are issues with this method, such as the occasional absence of rare factor levels in one subsample. However, I should be able to find a workaround, using indicator variables generated in the whole imputed dataset, together with the asis option in logistic regression.

            Best wishes

            Roger,

            Comment

            Working...
            X