Dear Stata forum members,
I am developing an imputation model to impute missing data from several variables in a dataset I have using Stata 13.1. I am looking at the factors associated with the recurrence of tuberculosis within our local patient cohort and plan to analyse these via a case control study. I have collected continuous variables including hemaglobin concentration, age and weight. These variables are missing between 10-20% data and I am planning to impute them. I then plan to use them as predictors of recurrence using conditional logistic regression. Several previous studies on the same subject have categorized age (e.g. <40 years, 40-59 and >/=60 years), weight and hemaglobin and used these variables within their conditional logistic regression model. I have run the imputation model with age, weight and haemaglobin converted to categorical variables and I have also run the imputation model with these variables as continuous variables and then categorized them after imputation. These two approaches lead to different odds ratios in my conditional logistic regression model. I presume that running the imputation model with the variables as continuous variables and then categorizing them after imputation would make for a better imputation model but I would be grateful for any advice about which approach is better.
Many thanks for your time
I am developing an imputation model to impute missing data from several variables in a dataset I have using Stata 13.1. I am looking at the factors associated with the recurrence of tuberculosis within our local patient cohort and plan to analyse these via a case control study. I have collected continuous variables including hemaglobin concentration, age and weight. These variables are missing between 10-20% data and I am planning to impute them. I then plan to use them as predictors of recurrence using conditional logistic regression. Several previous studies on the same subject have categorized age (e.g. <40 years, 40-59 and >/=60 years), weight and hemaglobin and used these variables within their conditional logistic regression model. I have run the imputation model with age, weight and haemaglobin converted to categorical variables and I have also run the imputation model with these variables as continuous variables and then categorized them after imputation. These two approaches lead to different odds ratios in my conditional logistic regression model. I presume that running the imputation model with the variables as continuous variables and then categorizing them after imputation would make for a better imputation model but I would be grateful for any advice about which approach is better.
Many thanks for your time
Comment