Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Categorization of continuous variables pre or post imputation

    Dear Stata forum members,

    I am developing an imputation model to impute missing data from several variables in a dataset I have using Stata 13.1. I am looking at the factors associated with the recurrence of tuberculosis within our local patient cohort and plan to analyse these via a case control study. I have collected continuous variables including hemaglobin concentration, age and weight. These variables are missing between 10-20% data and I am planning to impute them. I then plan to use them as predictors of recurrence using conditional logistic regression. Several previous studies on the same subject have categorized age (e.g. <40 years, 40-59 and >/=60 years), weight and hemaglobin and used these variables within their conditional logistic regression model. I have run the imputation model with age, weight and haemaglobin converted to categorical variables and I have also run the imputation model with these variables as continuous variables and then categorized them after imputation. These two approaches lead to different odds ratios in my conditional logistic regression model. I presume that running the imputation model with the variables as continuous variables and then categorizing them after imputation would make for a better imputation model but I would be grateful for any advice about which approach is better.

    Many thanks for your time

  • #2
    This isn't really an answer to your question, but I would say that better still would be to run the imputation model with the continuous variables and then not categorize them at all. Do your modeling with continuous variables.

    When you categorize age in the way described, you are saying that a 39 year old is radically different from a 40 year old, but a 59 year old and a 40 year old are, for your purposes, the same. That's clearly nonsensical. Categorizing continuous data only makes sense statistically when the outcome variable you are associating it with exhibits discontinuous behavior at the cutpoint. There is no reason to think that recurrence of tuberculosis works like that.

    Sometimes people will categorize a continuous variable as a simple way to capture non-linearity of the relationship with the outcome. But there are better ways to do that, such as splines, or fractional polynomial models, etc. And even if you are averse to using these other methods because they are difficult to explain to readers of non-technical journals, and you think they will only find a categorization understandable, then you should use many categories and make them narrow. So, perhaps 2-year age groups, 5 years at most!

    The only justification I can think of for you to use the <40, 40-59, >=60 age categorization is for the purpose of comparing your results to the previous papers. But just because previous authors did it wrong, that is no reason for you to perpetuate their mistakes. You can do it both ways and discuss the comparison, but use results based on a good model with continuous age as your primary analysis and feature it in your results section.

    Good luck.

    Comment


    • #3
      Many thanks for the advice, you did answer my question! My data set is fairly small as I am using patients notified in one tertiary referral hospital rather than national level data. My number of cases is only 82 and controls 164. so unfortunately making them into 5 year groups although ideal isn't feasible. My original rationale for categorising age was that this has been done in previous case control studies and I wished to compare my findings with the previous published work. It is also easier for a clinical audience to make sense of although as you say logically separating a patient of 39 and 40 into different categories doesn't make sense. I'll run the model with the continuous variables and analyse the data using both approaches. Once again many thanks for the advice

      Comment

      Working...
      X