Hello Everybody
I have a query regarding multiply-imputed datsets generated using the mi impute command in official Stata.
Supposing I start with a dataset with a cluster variable and/or sampling-probability weights, and I intend to fit a regression model (such as a logit) to predict an outcome from a list of covariates, and if some of my observations have missing values in some of the predictors, and I start my mi work with the command
mi set flong
use the mi impute command to create multiple versions of the data with the predictors filled in using multiple imputation (using chained equations to impute predictors from other predictors and NOT from the outcome). The resulting dataset will then have observations identified uniquely by 2 variables, _mi_m and _mi_id, of which _mi_m is zero in the observations retained from the original dataset (with some missing values in some predictors) and has positive-integer values in the imputation subsets (with no missing observations in any predictors).
Is there then any reason why I should not then input the observations with _mi_m>0, and then fit a regression model using clustered Huber variances with the original cluster variable and the original sampling-probability weight variable? The dataset used in the regression model will then be an expanded version of the original dataset, with each original cluster replaced by a cluster larger by a factor of _dta[_mi_M],containing that number of versions of the original cluster, with the same sampling-probability weights as before, but sometimes with different imputed values for some of the predictors. The multiple versions of each original observation, identified by the positive-integer variable _mi_m, will then have the same values of the cluster variable, the sampling-probability-weight variable, and the outcome variable, but may have different values of the predictor variables. However, if I use clustered and weighted Huber variances, then (as I understand it) I would still be estimating them allowing for the fact that we have been sampling clusters from a population of clusters, instead of sampling observations from a population of observations. And, because I am making inferences about the conditional distribution of the outcome variable given the predictor variables, then I do not see why my estimates of the model parameters, and of the sampling variances of the estimates, should be systematically wrong.
We are in fact planning to do something a bit more complicated than this, using the methods of Steyerberg, Harrell et al. (2001) to compare the predictive power of alternative logistic models using Harrell's c-index. This will involve using bsample to sample subsamples of clusters. The mi estimate command has a problem with these subsamples, because the variable _mi_id will then not identify uniquely the observations in each imputation subset. However, if there are any reason why I should not use ordinary clustered Huber variances on such multiply-imputed datasets, then I would like to know them.
Best wishes
Roger
I have a query regarding multiply-imputed datsets generated using the mi impute command in official Stata.
Supposing I start with a dataset with a cluster variable and/or sampling-probability weights, and I intend to fit a regression model (such as a logit) to predict an outcome from a list of covariates, and if some of my observations have missing values in some of the predictors, and I start my mi work with the command
mi set flong
use the mi impute command to create multiple versions of the data with the predictors filled in using multiple imputation (using chained equations to impute predictors from other predictors and NOT from the outcome). The resulting dataset will then have observations identified uniquely by 2 variables, _mi_m and _mi_id, of which _mi_m is zero in the observations retained from the original dataset (with some missing values in some predictors) and has positive-integer values in the imputation subsets (with no missing observations in any predictors).
Is there then any reason why I should not then input the observations with _mi_m>0, and then fit a regression model using clustered Huber variances with the original cluster variable and the original sampling-probability weight variable? The dataset used in the regression model will then be an expanded version of the original dataset, with each original cluster replaced by a cluster larger by a factor of _dta[_mi_M],containing that number of versions of the original cluster, with the same sampling-probability weights as before, but sometimes with different imputed values for some of the predictors. The multiple versions of each original observation, identified by the positive-integer variable _mi_m, will then have the same values of the cluster variable, the sampling-probability-weight variable, and the outcome variable, but may have different values of the predictor variables. However, if I use clustered and weighted Huber variances, then (as I understand it) I would still be estimating them allowing for the fact that we have been sampling clusters from a population of clusters, instead of sampling observations from a population of observations. And, because I am making inferences about the conditional distribution of the outcome variable given the predictor variables, then I do not see why my estimates of the model parameters, and of the sampling variances of the estimates, should be systematically wrong.
We are in fact planning to do something a bit more complicated than this, using the methods of Steyerberg, Harrell et al. (2001) to compare the predictive power of alternative logistic models using Harrell's c-index. This will involve using bsample to sample subsamples of clusters. The mi estimate command has a problem with these subsamples, because the variable _mi_id will then not identify uniquely the observations in each imputation subset. However, if there are any reason why I should not use ordinary clustered Huber variances on such multiply-imputed datasets, then I would like to know them.
Best wishes
Roger
Comment