Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to collapse observations and overwrite variables

    Hi everyone,
    I have a data set where there are 4 diseases, but the presence of each is entered as a separate observation for every patient ID. I would like to consolidate each patient ID into a single observation with all diseases present. For example:

    Starting data set:
    ID A B C D
    12345 1 0 0 0
    12345 0 1 0 0
    98765 0 0 0 1
    98765 0 1 0 0
    98765 0 0 1 0

    Desired result:
    ID A B C D
    12345 1 1 0 0
    98765 0 1 1 1

    Is there a way to achieve this result? Thanks in advance for your help!

    Jason

  • #2
    Code:
    collapse (max) A B C D, by(ID)

    Comment


    • #3
      Code:
      . collapse (sum) a b c d, by(id)
      
      . list, clean
      
                id   a   b   c   d  
        1.   12345   1   1   0   0  
        2.   98765   0   1   1   1

      Comment


      • #4
        Thank you both!

        Comment


        • #5
          Hi Joro, Clyde and Jason,

          I have a very similar question about wanting to collapse data (please let me know if the etiquette is to start a new thread or it is appropriate to post as a reply to a recent similar topic).

          My data differs from Jason's as seen below. It is survey data where some participants have had more than one attempt at completing the survey. Each participant is uniquely identified by 'ID', and each attempt is uniquely identified by 'id'.

          I want to retain their most complete attempt (some participants completed but some did not). Participants who completed have a string date stored in variable 'submitdate'.

          Sample data:
          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input float(ID id) str19 var3 float(V1a V1b V2a V2b V3a V3b)
          16 46 ""                    . . . . . .
          19  6 "2018-06-27 13:55:37" 4 4 4 2 6 3
          20  2 ""                    4 2 5 . . .
          20 12 "2018-06-12 17:28:57" 4 3 5 3 5 2
          30  3 ""                    3 3 5 2 . .
          30 18 ""                    3 2 4 2 . .
          30 23 ""                    4 3 6 3 5 3
          30 31 ""                    3 2 5 . . .
          35  8 ""                    3 3 6 3 5 2
          45 39 "2018-06-25 14:12:32" 3 2 5 2 5 3
          45 32 ""                    4 3 4 3 . .
          end

          In this example data, I want to keep the most complete response for each ID, even if the response differs between attempts.

          E.g. for ID 20 it is easy, I want to keep the attempt with a 'submitdate' as that is the one the participant completed. However ID=30 never completed the questionnaire, but i want to keep their most complete response (in this example associated with id=23).

          In my code I had written
          Code:
          collapse (firstnm) submitdate-V3b), by(ID)

          However i noticed in my real dataset (larger) this collapsed differently each time I ran the do file, giving me diferent values in 'id' and 'V1a-V3b'

          Any advice would be greatly appreciated!

          thanks,
          Olivia

          Comment


          • #6
            Olivia, I dont think in your case -collapse- is the best strategy.

            I do not know how you define formally " their most complete response ", I see in the example of ID==20 and id==23 what you mean because I am a human, but this is not how machines work. A machine needs a formal rule.

            I would go about the problem like this:

            Lets assume that " their most complete response " is formally defined as the maximum given answers.

            Code:
            . egen numans =  rownonmiss( V1a V1b V2a V2b V3a V3b )
            
            . sort ID numans
            
            . by ID: keep if _n==_N
            (5 observations deleted)
            
            . list, clean
            
                   ID   id                  var3   V1a   V1b   V2a   V2b   V3a   V3b   numans  
              1.   16   46                           .     .     .     .     .     .        0  
              2.   19    6   2018-06-27 13:55:37     4     4     4     2     6     3        6  
              3.   20   12   2018-06-12 17:28:57     4     3     5     3     5     2        6  
              4.   30   23                           4     3     6     3     5     3        6  
              5.   35    8                           3     3     6     3     5     2        6  
              6.   45   39   2018-06-25 14:12:32     3     2     5     2     5     3        6

            Comment


            • #7
              Thank you Joro, that worked perfectly in isolating the response from each participant with the maximum given answers.

              Thanks for your time,
              Olivia

              Comment


              • #8
                And by the way the solution above is a bit Byzantian, because if on one try I answered correctly questions 1 and 2, and then on another round I answered questions 3 and 4, there is ambiguity in the sorting, which simply reflects the fact that in this case " their most complete response " is not uniquely defined according to the rule "maximum number of answers".

                Comment

                Working...
                X