I am working with 2015-2018 National Health Interview Survey. For a sub-analysis, I am looking at 3 groups of individuals: 1 year or less since diagnosis, 5 years or less since diagnosis, and more than 5 years since diagnosis. The age at time of survey is included in the dataset, as is the age at diagnosis. However, age 85 and above is top-coded as 85. Therefore, I am limiting my sample to those between ages 18-84 (both at time of survey and time of diagnosis). Respondents can name up to 3 cancer diagnoses. They are first asked, have you ever been diagnosed with cancer and if endorsed, they are then asked about specific types of cancer (yes/no), and if endorsed, age of diagnosis.
This is the coding structure I was using to generate years since diagnosis. (e.g., ncolonage is a new variable to limit only those diagnosed between age 18-84. I then created a new variable to capture age difference. nage=age at survey. this has also from previous coding been limited to ages 18-84). However, there are discrepancies when I do tabulations. For example, there are less colonagedifferences than the number of individuals who endorsed colon cancer at time of survey (nage). I believe this may be due to the fact that respondents can endorse up to 3 different cancers. How can I accurately capture age since diagnosis to limit my sample size and account for the fact that some participants may have endorsed up to 3 different cancers? Coding structure? (There are 30 different types of cancers that can be endorsed, each with different column in dataset)
gen ncolonage=.
replace ncolonage=colonage if colonage<=84
gen colonagediff=nage-ncolonage
tab nage ncolonca
tab colonagediff
This is the coding structure I was using to generate years since diagnosis. (e.g., ncolonage is a new variable to limit only those diagnosed between age 18-84. I then created a new variable to capture age difference. nage=age at survey. this has also from previous coding been limited to ages 18-84). However, there are discrepancies when I do tabulations. For example, there are less colonagedifferences than the number of individuals who endorsed colon cancer at time of survey (nage). I believe this may be due to the fact that respondents can endorse up to 3 different cancers. How can I accurately capture age since diagnosis to limit my sample size and account for the fact that some participants may have endorsed up to 3 different cancers? Coding structure? (There are 30 different types of cancers that can be endorsed, each with different column in dataset)
gen ncolonage=.
replace ncolonage=colonage if colonage<=84
gen colonagediff=nage-ncolonage
tab nage ncolonca
tab colonagediff

Comment