excluding variables in summary stats

man singh

Join Date: May 2023

Posts: 11
#1

excluding variables in summary stats

16 Dec 2023, 08:10

Hi I have a dataset in which one of the variables looks like this below:
>>AGE
>> 25
>> 30
>> above 45
>> 27
>> 32
>> 36
>> 28

How can I generate summary stats for this considering that one value says "above 45"?
What is a way to exclude the above 45 option from the analysis?
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35696
#2

16 Dec 2023, 09:37

Please use dataex to indicate exactly how such data are being stored.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#3

16 Dec 2023, 09:57

Problems arising from bad data rarely have good solutions. There are a few things you can do, all of them bad, and your job is to pick the least bad among them.

First, you can't do any kind of statistics with this data until you convert it from string to numeric.

Code:

gen age = real(AGE)

will do this much, and it will leave a missing value for an entry like "above 45" that is not a numeric variable. If you then -summ age- you will get summary statistics that exclude these non-numeric responses. However, it is likely that these excluded values are systematically different from the valid responses so you are probably getting a biased, possibly a very strongly biased, summary in this way.

Another possibility is to decide to treat "above 45" as 45. Or you might choose some higher value that you think might reasonably represent the typical age of a person who is characterized in your data as "above 45." In that case, before converting AGE to numeric, you would -replace AGE = subinstr(AGE, "above", "", .)-. The validity of this approach depends on the unknowable validity of your guess about what age "above 45" might represent.

Another possibility is to recognize this as right-censored data. For this you could do something like:

Code:

gen age = real(AGE) gen lower_limit = . replace lower_limit = real(subinstr(AGE, "above", "", .)) if strpos(AGE, "above") replace age = real(subinstr(AGE, "above", "", .)) if strpos(AGE, "above") tobit age, ll(lower_limit)

-tobit- is a regression model for censored data. The constant term in the output would represent an estimate of the mean age taking into account censoring. Understand, of course, that there is an assumption that the censored variables are drawn from normal distributions. Such a strong parametric assumption may or may not be appropriate.

None of these approaches is truly satisfactory. There are likely other ways of handling this as well, but the basic limitations of this data can not be fully overcome regardless.
Comment

Announcement

excluding variables in summary stats

Comment

Comment