Identifying and Numbering Unique Observations -- Going from Long to Wide

Matt Warkentin

Join Date: May 2016

Posts: 104
#1

Identifying and Numbering Unique Observations -- Going from Long to Wide

09 Aug 2016, 15:22

Hi STATALIST,

I am having a bit of an issue with trying to turn Cancer Registry data from long to wide.

I have the following variables:
ID (id)
Sex (sex)
Age at Enrollment (age_E)
Age at Diagnosis (age_Dx)
Time from Enrollment to Diagnosis (time)
Cancer Site (site)

I want to convert the data set to wide using the variable ID, but none of the other variables can be use to unieuqly add a suffix to the reshaped variables -- this is because some individuals in the data set were diagnosed with multiple primary cancers simultaneously (thus they have the same data across the board).

-duplicates tag- doesn't help, as it only tags the duplicates but does not provide a unique value.

Here is how the code looks:

reshape wide sex age_E age_Dx time site , i(id) j()

I need to create a variable for j(). Does someone have an idea how to create a variable so that every time a duplicate ID is encountered it add 1 (+1) to some variable. Ideally the data would like like the table below...

ID Sex age_E age_dx time site cancer #

1234 1 50 65 5475 lung 1

1234 1 50 65 5475 lung 2

Thanks in advance for your help.
Tags: None

Oded Mcdossi

09 Aug 2016, 15:35

Code:

bys ID: g j_var=_n
reshape wide Sex    age_E    age_dx    time    cancer, i(ID) j(j_var)

Comment

Matt Warkentin

Join Date: May 2016

Posts: 104
#3

09 Aug 2016, 15:40

Thank you so much! This works, would you mind providing a brief explanation of what that script is actually doing logically? You sort by ID, then generate a variable called j_var, what exactly does the expression _n mean to Stata?
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

09 Aug 2016, 20:36

Generally, _n gives the observation number within the dataset. In conjunction with the by prefix, _n gives the observation number within each by-group: that is, it starts over from 1 each time ID changes. I would perhaps have chosen to write

Code:

bysort ID (time): generate j_var = _n

so that if a patient has diagnoses at multiple times, the reshaped variables will be ordered by time.

With that said, experienced users here generally agree that, with few exceptions, Stata makes it much more straightforward to accomplish complex analyses using a long layout of your data rather than a wide layout of the same data.
Comment