How to collapse observations and overwrite variables

Jason Qian

Join Date: Oct 2018

Posts: 4
#1

How to collapse observations and overwrite variables

22 Jan 2019, 14:59

Hi everyone,
I have a data set where there are 4 diseases, but the presence of each is entered as a separate observation for every patient ID. I would like to consolidate each patient ID into a single observation with all diseases present. For example:

Starting data set:
ID A B C D

12345 1 0 0 0

12345 0 1 0 0

98765 0 0 0 1

98765 0 1 0 0

98765 0 0 1 0

Desired result:
ID A B C D

12345 1 1 0 0

98765 0 1 1 1

Is there a way to achieve this result? Thanks in advance for your help!

Jason
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30174
#2

22 Jan 2019, 15:17

Code:

collapse (max) A B C D, by(ID)
1 like
Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3050

22 Jan 2019, 15:19

Code:

. collapse (sum) a b c d, by(id)

. list, clean

          id   a   b   c   d  
  1.   12345   1   1   0   0  
  2.   98765   0   1   1   1

Comment

Jason Qian

Join Date: Oct 2018

Posts: 4
#4

22 Jan 2019, 15:31

Thank you both!
Comment
Olivia Helen

Join Date: Jan 2019

Posts: 7
#5

22 Jan 2019, 16:37

Hi Joro, Clyde and Jason,

I have a very similar question about wanting to collapse data (please let me know if the etiquette is to start a new thread or it is appropriate to post as a reply to a recent similar topic).

My data differs from Jason's as seen below. It is survey data where some participants have had more than one attempt at completing the survey. Each participant is uniquely identified by 'ID', and each attempt is uniquely identified by 'id'.

I want to retain their most complete attempt (some participants completed but some did not). Participants who completed have a string date stored in variable 'submitdate'.

Sample data:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(ID id) str19 var3 float(V1a V1b V2a V2b V3a V3b) 16 46 "" . . . . . . 19 6 "2018-06-27 13:55:37" 4 4 4 2 6 3 20 2 "" 4 2 5 . . . 20 12 "2018-06-12 17:28:57" 4 3 5 3 5 2 30 3 "" 3 3 5 2 . . 30 18 "" 3 2 4 2 . . 30 23 "" 4 3 6 3 5 3 30 31 "" 3 2 5 . . . 35 8 "" 3 3 6 3 5 2 45 39 "2018-06-25 14:12:32" 3 2 5 2 5 3 45 32 "" 4 3 4 3 . . end

In this example data, I want to keep the most complete response for each ID, even if the response differs between attempts.

E.g. for ID 20 it is easy, I want to keep the attempt with a 'submitdate' as that is the one the participant completed. However ID=30 never completed the questionnaire, but i want to keep their most complete response (in this example associated with id=23).

In my code I had written

Code:

collapse (firstnm) submitdate-V3b), by(ID)

However i noticed in my real dataset (larger) this collapsed differently each time I ran the do file, giving me diferent values in 'id' and 'V1a-V3b'

Any advice would be greatly appreciated!

thanks,
Olivia
Comment

Joro Kolev

Join Date: Aug 2018
Posts: 3050

22 Jan 2019, 16:53

Olivia, I dont think in your case -collapse- is the best strategy.

I do not know how you define formally " their most complete response ", I see in the example of ID==20 and id==23 what you mean because I am a human, but this is not how machines work. A machine needs a formal rule.

I would go about the problem like this:

Lets assume that " their most complete response " is formally defined as the maximum given answers.

Code:

. egen numans =  rownonmiss( V1a V1b V2a V2b V3a V3b )

. sort ID numans

. by ID: keep if _n==_N
(5 observations deleted)

. list, clean

       ID   id                  var3   V1a   V1b   V2a   V2b   V3a   V3b   numans  
  1.   16   46                           .     .     .     .     .     .        0  
  2.   19    6   2018-06-27 13:55:37     4     4     4     2     6     3        6  
  3.   20   12   2018-06-12 17:28:57     4     3     5     3     5     2        6  
  4.   30   23                           4     3     6     3     5     3        6  
  5.   35    8                           3     3     6     3     5     2        6  
  6.   45   39   2018-06-25 14:12:32     3     2     5     2     5     3        6

Comment

Olivia Helen

Join Date: Jan 2019

Posts: 7
#7

22 Jan 2019, 17:04

Thank you Joro, that worked perfectly in isolating the response from each participant with the maximum given answers.

Thanks for your time,
Olivia
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#8

22 Jan 2019, 17:04

And by the way the solution above is a bit Byzantian, because if on one try I answered correctly questions 1 and 2, and then on another round I answered questions 3 and 4, there is ambiguity in the sorting, which simply reflects the fact that in this case " their most complete response " is not uniquely defined according to the rule "maximum number of answers".
Comment

Announcement

How to collapse observations and overwrite variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment