Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Change string to integer?

    If an "age" variable is stored as "99 years or more", how do you guys change it in a way that is still informative? and how to do so.
    The rest of the data in the age variable is an integer, but I want to include "99 years or more" in the model.

  • #2
    There is no ideal solution to this problem. Fortunately, if they are "topcoding" (censoring the data) way up at 99 years, there won't be many observations affected and, when using it as an explanatory variable or covariate, just treating it as if it were age 99 you usually won't go far wrong. (Although, if this is a study of nursing home residents, it could be a more serious limitation.) Or you could do something more sophisticated like obtaining full age-distribution data for the population your data come from and imputing a specific value >= 99 by randomly sampling from the upper tail of that distribution. Better still, do it separately for men and women since in most populations their distributions are rather different, especially in the upper tail. A compromise position would be to just find from the full age-distribution the mean or median age above 99 and impute that in each case. This is less work, but also has less desirable statistical properties.

    Alternatively, if you are using age as an outcome variable in a statistical model (say, for example it is age at death and you are analyzing lifespans) then you can create 2 variables. For the people whose exact age is given, the two variables (call them lower_bound and upper_bound) are both set to the exact age. For the people who are "99 years or more," what we know is that their age is somewhere between 99 and infinity (or, perhaps more realistically, between 99 and 120), and you code lower_bound = 99 and upper_bound = . or 120, depending on how you want to treat that. This kind of pair of variables would be suitable to use as an outcome variable in models such as -intreg-, -stintreg-, -stintcox-. (Actually for survival analysis, you could just use plain old -streg- or -stcox- since the censoring is all right censoring. If you are not already familiar with it, read -help stset- to learn how to organize right-censored data for these commands.)

    Comment


    • #3
      Hi Ape, if your plan is to use the -regress- command, or employ another model which does not adjust for censored variables, or those with truncated distributions, I would find a rule by which to recode said variable. It could be anything, say assigning an age of 105 to each observation, or using Stata's built-in commands to generate a synthetic age for these observations, perhaps -runiform(99,120)-. Ultimately this is incredibly unlikely to bias your results given there were recently an estimated 97,914 persons in this age category, in the U.S., at least. As long as you document how these variables are included in your estimation procedure, whoever is reviewing will hopefully be alright with your methods.

      Or, are you asking how to program this change? Is the "99 years or more" variable stored as separate indicator, or included in a vector with the other ages, likely stored as a string variable?
      Last edited by Eric Makela; 04 Sep 2022, 16:49.

      Comment

      Working...
      X