Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • NHIS dataset, years since cancer dx

    I am working with 2015-2018 National Health Interview Survey. For a sub-analysis, I am looking at 3 groups of individuals: 1 year or less since diagnosis, 5 years or less since diagnosis, and more than 5 years since diagnosis. The age at time of survey is included in the dataset, as is the age at diagnosis. However, age 85 and above is top-coded as 85. Therefore, I am limiting my sample to those between ages 18-84 (both at time of survey and time of diagnosis). Respondents can name up to 3 cancer diagnoses. They are first asked, have you ever been diagnosed with cancer and if endorsed, they are then asked about specific types of cancer (yes/no), and if endorsed, age of diagnosis.

    This is the coding structure I was using to generate years since diagnosis. (e.g., ncolonage is a new variable to limit only those diagnosed between age 18-84. I then created a new variable to capture age difference. nage=age at survey. this has also from previous coding been limited to ages 18-84). However, there are discrepancies when I do tabulations. For example, there are less colonagedifferences than the number of individuals who endorsed colon cancer at time of survey (nage). I believe this may be due to the fact that respondents can endorse up to 3 different cancers. How can I accurately capture age since diagnosis to limit my sample size and account for the fact that some participants may have endorsed up to 3 different cancers? Coding structure? (There are 30 different types of cancers that can be endorsed, each with different column in dataset)

    gen ncolonage=.
    replace ncolonage=colonage if colonage<=84
    gen colonagediff=nage-ncolonage
    tab nage ncolonca
    tab colonagediff


  • #2
    You have made a valiant attempt to describe your data. But even the most heroic attempt cannot substitute for actually showing some example data. That is why the Forum FAQ, which all Forum members are asked to read before posting, advises using the -dataex- command to show examples when posting here. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    Now, it sounds like the number of variables in your full data set is way larger than -dataex- can accommodate. So before using it, I suggest you pare it down to the variables that are directly relevant to the immediate problem. What would be particularly helpful is if you can select from your data a set of observations that produces the problem you are encountering with the code you show in #1. That will make it easier to troubleshoot. It probably would also help if you illustrate what you expect your results to look like.

    Comment

    Working...
    X