Create new rows per id if observations are found in multiple variables

Hanna Larsson

Join Date: Jul 2023

Posts: 11
#1

Create new rows per id if observations are found in multiple variables

04 Oct 2023, 06:32

Hi!
I would like to create a distinct row for each observation in the diagnosis-variables (= one diagnosis) per id and date.

My data looks like this:

id date diagnosis1 diagnosis2 diagnosis3

1 2023-01-01 A . .

1 2023-01-02 B . .

2 2023-02-03 F G .

2 2023-03-03 A F C

2 2023-03-04 A . .

(The real dataset contains 21 diagnosis variables and >1000 unique id)

I want it to turn out like this:
id date newvar

1 2023-01-01 A

1 2023-01-02 B

2 2023-02-03 F

2 2023-02-03 G

2 2023-03-03 A

2 2023-03-03 F

2 2023-03-03 C

2 2023-03-04 A

I would really appreciate your help.

Thank you
Tags: None
Girish Venkataraman

Join Date: Dec 2021

Posts: 281
#2

04 Oct 2023, 07:16

It took me a few minutes massaging your data in a readable form before I could execute. see -dataex- and the forum rules before copy/pasting raw data. On a different note, I am not sure what diseases these are, but depending on the context, I was curious if you would need the earliest instance of each diagnosis by patient.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input long id str14 date str2(diagnosis1 diagnosis2 diagnosis3) 1 "2023-01-01" "A" "." "." 1 "2023-01-02" "B" "." "." 2 "2023-02-03" "F" "G" "." 2 "2023-03-03" "A" "F" "C" 2 "2023-03-04" "A" "." "." end

This gets your what you want:

Code:

reshape long diagnosis@, i(id date) string drop _j drop if diagnosis == "."
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35775
#3

04 Oct 2023, 08:05

It does not affect Girish Venkataraman's strategy, which is bang on target, but as a detail note that Stata does not require or give any special interpretation to the string "." in string values. To Stata empty strings "" and missing string values s are one and the same.

While I am adding nuance I want to underline that using dataex is a request, not a rule. I am as energetic as anyone else in reminding people of the requests we make when paying attention would turn a difficult or impossible question into an easier question. but our wording is different.
1 like
Comment
Hanna Larsson

Join Date: Jul 2023

Posts: 11
#4

06 Oct 2023, 02:53

Girish Venkataraman Thank you so much for you help and sorry for the inconvenience with my example data. As you probably understan am I new to Stata, I promise to try to use the correct format next time.

Unfortunately the code doesn't work. I get this error message

"variable id does not uniquely identify the observations
Your data are currently wide. You are performing a reshape long. You specified I(id date) and j(_j). In the current wide
form, variable id date should uniquely identify the observations."

This is true as some individuals received multiple different diagnosis at the same date. In order to be a unique observation id and date isn't enough but diagnosis needs to be included as well. Is that possible?

Thank you!
Comment
Hanna Larsson

Join Date: Jul 2023

Posts: 11
#5

06 Oct 2023, 02:55

Thank you Nick Cox for the clarification! And thank you for all your answers in this forum, it has helped me a lot being new to Stata.
1 like
Comment
Girish Venkataraman

Join Date: Dec 2021

Posts: 281
#6

06 Oct 2023, 05:12

Originally posted by Hanna Larsson View Post

Girish Venkataraman Thank you so much for you help and sorry for the inconvenience with my example data. As you probably understan am I new to Stata, I promise to try to use the correct format next time.

Unfortunately the code doesn't work. I get this error message

"variable id does not uniquely identify the observations
Your data are currently wide. You are performing a reshape long. You specified I(id date) and j(_j). In the current wide
form, variable id date should uniquely identify the observations."

This is true as some individuals received multiple different diagnosis at the same date. In order to be a unique observation id and date isn't enough but diagnosis needs to be included as well. Is that possible?

Thank you!

Hmm...I am guessing that you have two rows with the same date in the same patient further down in your original data. Is this the case? The scope of my above code is limited to one patient having dates that are unique within that patient. It is hard to go further without a sample of the original data via -dataex- or your end goal with reshape.
1 like
Comment
Hanna Larsson

Join Date: Jul 2023

Posts: 11
#7

06 Oct 2023, 06:10

Girish Venkataraman Yes that is the case! Some patients have received multiple different diagnosis on the same date but it is registred on different rows (probably due to different physicians in the same healthcare facility but that information is unfortunately not included in my data). Maybe I can create a variable that tells if that's the case and include that as a factor in the reshape of the data?

I'm unfortunately not allowed to share a sample of the original data (even if I remodel it) due to very strict rules at my workplace.
Comment
Hanna Larsson

Join Date: Jul 2023

Posts: 11
#8

06 Oct 2023, 06:42

Girish Venkataraman I tried what I suggested above and it worked. Once again, thank you. I wish you a happy weekend.
1 like
Comment

id	date	diagnosis1	diagnosis2	diagnosis3
1	2023-01-01	A	.	.
1	2023-01-02	B	.	.
2	2023-02-03	F	G	.
2	2023-03-03	A	F	C
2	2023-03-04	A	.	.

id	date	newvar
1	2023-01-01	A
1	2023-01-02	B
2	2023-02-03	F
2	2023-02-03	G
2	2023-03-03	A
2	2023-03-03	F
2	2023-03-03	C
2	2023-03-04	A

Announcement

Create new rows per id if observations are found in multiple variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment