Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying and Numbering Unique Observations -- Going from Long to Wide

    Hi STATALIST,

    I am having a bit of an issue with trying to turn Cancer Registry data from long to wide.

    I have the following variables:
    ID (id)
    Sex (sex)
    Age at Enrollment (age_E)
    Age at Diagnosis (age_Dx)
    Time from Enrollment to Diagnosis (time)
    Cancer Site (site)

    I want to convert the data set to wide using the variable ID, but none of the other variables can be use to unieuqly add a suffix to the reshaped variables -- this is because some individuals in the data set were diagnosed with multiple primary cancers simultaneously (thus they have the same data across the board).

    -duplicates tag- doesn't help, as it only tags the duplicates but does not provide a unique value.

    Here is how the code looks:

    reshape wide sex age_E age_Dx time site , i(id) j()

    I need to create a variable for j(). Does someone have an idea how to create a variable so that every time a duplicate ID is encountered it add 1 (+1) to some variable. Ideally the data would like like the table below...
    ID Sex age_E age_dx time site cancer #
    1234 1 50 65 5475 lung 1
    1234 1 50 65 5475 lung 2
    Thanks in advance for your help.

  • #2
    Code:
    bys ID: g j_var=_n
    reshape wide Sex    age_E    age_dx    time    cancer, i(ID) j(j_var)

    Comment


    • #3
      Thank you so much! This works, would you mind providing a brief explanation of what that script is actually doing logically? You sort by ID, then generate a variable called j_var, what exactly does the expression _n mean to Stata?

      Comment


      • #4
        Generally, _n gives the observation number within the dataset. In conjunction with the by prefix, _n gives the observation number within each by-group: that is, it starts over from 1 each time ID changes. I would perhaps have chosen to write
        Code:
        bysort ID (time): generate j_var = _n
        so that if a patient has diagnoses at multiple times, the reshaped variables will be ordered by time.

        With that said, experienced users here generally agree that, with few exceptions, Stata makes it much more straightforward to accomplish complex analyses using a long layout of your data rather than a wide layout of the same data.

        Comment

        Working...
        X