Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problems with dropping duplicates of ID based on criteria

    Dear all,

    I had a problem previously about dropping variables based on criteria.

    This time, I will need to keep variables with the worst outcome (3 = higest severity/stage of disease). Below is a list of patients and their visit at hospital:
    patient stage of disease
    Cindy 1
    Cindy 2
    Bob 1
    Bob 2
    Bob 3
    Louis 1

    I would like to count number of unique patients with the highest stage of disease, regardless visitnumber, visit date etc. So I end up with a dataset like this:
    patient stage of disease
    Cindy 2
    Bob 3
    Louis 1
    Where serverest stage for Cindy was 2, for Bob 3, and Louis only 1

    I made the duplicate-command, but I cant figure out how to sort and what criteria to set up to keep exactly the observatiosn with highest severity

    Hope you can save my life, once again, thanks in advance!

    Best wishes,
    Malinna

  • #2
    EDIT:

    I sorted the patient ID, and then I sorted the stage of disease in desceding order (by adding minus infront of the variable), so that worse case is the first observation in each ID-group:

    the duplicate-command:
    sort patient -diagnose
    quietly by patient: gen dup_patient = cond(_N==1,0,_n)
    then I wanted to drop patient if dup_patient>1

    However, when I quickly look at the data, I see that in some ID-groups, the sorting went wrong. If there was 4 of the same patient, diagnose could be listed like this:
    Patient diagnose
    Lucas 3
    Lucas 2
    Lucas 3
    Lucas 1
    What went wrong? :S
    Hope you can save my life, once again, thanks in advance!

    Best wishes,
    Malinna

    Comment


    • #3
      Code:
      sort patient -diagnose
      isn't legal. Did you mean gsort?

      We need a reproducible example with data and with exact code that you used.

      Please do read the FAQ and act on

      http://www.statalist.org/forums/help#stata (using dataex (SSC) and CODE delimiters)

      http://www.statalist.org/forums/help#realnames (fixing your name to the required form)

      Comment


      • #4
        Hi Nick,

        I didnt use gsort, but now I did, and it works! thanks alot, once again!

        Comment


        • #5
          OK. I see what is going on.

          Code:
          sort code -diagnose
          is legal (and I was wrong), because it will be read as

          Code:
          sort code-diagnose
          namely all the variables from code to diagnose in your dataset. That can include other variables.

          To see that can jumble your data (as you think), note this example.

          Code:
          . sysuse auto
          (1978 Automobile Data)
          
          . order rep78 turn foreign
          
          . sort rep78 -foreign
          
          . l rep78 turn foreign if rep78 == 3
          
               +-------------------------+
               | rep78   turn    foreign |
               |-------------------------|
           11. |     3     31   Domestic |
           12. |     3     34   Domestic |
           13. |     3     34    Foreign |
           14. |     3     36   Domestic |
           15. |     3     36    Foreign |
               |-------------------------|
           16. |     3     36    Foreign |
           17. |     3     37   Domestic |
           18. |     3     40   Domestic |
           19. |     3     40   Domestic |
           20. |     3     40   Domestic |
               |-------------------------|
           21. |     3     40   Domestic |
           22. |     3     41   Domestic |
           23. |     3     41   Domestic |
           24. |     3     42   Domestic |
           25. |     3     42   Domestic |
               |-------------------------|
           26. |     3     42   Domestic |
           27. |     3     42   Domestic |
           28. |     3     43   Domestic |
           29. |     3     43   Domestic |
           30. |     3     43   Domestic |
               |-------------------------|
           31. |     3     43   Domestic |
           32. |     3     43   Domestic |
           33. |     3     43   Domestic |
           34. |     3     43   Domestic |
           35. |     3     44   Domestic |
               |-------------------------|
           36. |     3     44   Domestic |
           37. |     3     45   Domestic |
           38. |     3     45   Domestic |
           39. |     3     48   Domestic |
           40. |     3     51   Domestic |
              +-------------------------+
          The data are sorted on rep78 turn foreign and thus look jumbled with respect to rep78 foreign.
          Last edited by Nick Cox; 08 Oct 2015, 04:41.

          Comment


          • #6
            OOOOH, i see! I misunderstood the "minus" in front of variable. I thought minus meant sort in descending order. Now I completely understand. I have used the gsort-code now.
            Now the sorting and ranking is correct. Thanks, Nick!

            Comment


            • #7
              Hi Nick,

              I had a similar problem except the observations I want to keep are based on more than 1 criterion. I'm trying to keep the latest submitted information for each patient, where more than one row of information may have been submitted for each patient:

              Using STATA/IC 13.1:

              example data:

              I generated the "patient_duplicate" column below using:

              sort patient_id
              quietly by patient_id: gen patient_duplicate= cond(_N==1,0,_n)

              All other columns were pre-existing:

              patient_id: patient identification number (can be alphanumeric)
              form_number : old or current form number (alphanumeric)
              form_correcting : new form number (alphanumeric) if applicable

              ...

              patient_id patient_duplicate form_number form_correcting date_form_completed
              1 0 7369 01/03/2009
              2 1 3997 01/03/2009
              2 2 3r308 3997 02/03/2009
              2 3 2275 3r308 03/03/2009
              3 1 1082 04/03/2009
              3 2 4495 925-4 05/03/2009
              3 3 925-4 1082 05/03/2009
              First I reformat the form submission date :

              gen date2= date(date_form_completed, "DMY")
              format %td date2

              Then, When I run the following code:

              by patient_id(date2), sort: keep if _n==_N

              It keeps the wrong information for Patient #3 since the last 2 versions of this patient's information were submitted on the same day (it keeps the very last row shown, instead of the second to last row shown). What I really want to accomplish is to keep the last created form version for each patient. Please let me know if you can help.


              ** however, now that I've recreated my real data using my example data (above), there is an error when I run that last bit of code: "factor variables and time-series operators not allowed r(101);"

              Any idea why?

              This error doesn't appear when I run the same exact code on my actual data (with over 1,000 instances of duplicates which have to be sorted out).

              Thanks,
              Cat
              Last edited by Catherine Castillo; 01 Dec 2015, 13:57.

              Comment


              • #8
                ** however, now that I've recreated my real data using my example data (above), there is an error when I run that last bit of code: "factor variables and time-series operators not allowed r(101);"
                A good example of why it is so important to show exactly what commands were run and exactly what output Stata gives--and that is best done by copying directly from the Stata Results window or your log file into a code block. That said, in this case, I can tell you what the problem is.

                The error results because you need to have a blank space between patient_id and (date2). Without that, the parser gets confused and thinks you are trying to do something different from what you intend. The reason it doesn't appear when you run with your actual data is that, I guarantee you, you are not running the "same exact code."

                As for your first problem, your command, when corrected by inserting a blank space before (date2), only sorts the data on patient_id and date2. When, as in the case of patient 3, there is more than one observation with the same values of patient_id and date2, Stata will sort them in some arbitrary order. Worse still, the order is not reproducible. So if you re-ran this several times, some of them would give you the answer you were hoping for, and others would not. In order to solve this problem, you need to provide the -by- prefix with another variable that correctly sorts those observations within patient_id and date2. It isn't obvious to me what that variable would be. You state that you wanted the second-to-last observation shown rather than the last. But how would we know that? It's not obvious to me. What is the rule by which you break ties? It seems like it might involve some other variables not shown in your post?

                Comment


                • #9
                  Hi Clyde,

                  Thanks for catching this - yes, once I enter the missing space, it runs the code the way that I explained (it will keep the very last row of data for patient #3).

                  However, I want to keep the second to last row because it is a correction of the very last row (the form number of the last row is 925-4 , but this form was corrected and replaced by form number 4495. Unfortunately, both of these forms were submitted on the same day).

                  Is there a way to amend the code in a way that takes this into account?

                  My apologies for the confusion.

                  Thank you,
                  Cat

                  Comment


                  • #10
                    Oh, OK. I think this works:

                    Code:
                    * Example generated by -dataex-. To install: ssc install dataex
                    clear
                    input float(patient_id patient_duplicate) str10(form_number form_correcting) float date_form_completed
                    1 0 "7369"  ""      17957
                    2 1 "3997"  ""      17957
                    2 2 "3r308" "3997"  17958
                    2 3 "2275"  "3r308" 17959
                    3 1 "1082"  ""      17960
                    3 2 "4495"  "925-4" 17961
                    3 3 "925-4" "1082"  17961
                    end
                    format %tdDD/NN/CCYY date_form_completed
                    
                    list, sepby(patient_id)
                    
                    gen form2 = form_number
                    replace form2 = form_correcting+"C" if !missing(form_correcting)
                    
                    by patient_id (date_form_completed form2), sort: keep if _n == _N
                    
                    list
                    However, it will only work for one step. If yet a third form, say 813, correcting form 4495 were also submitted on the same day, its value of form2 would be 4495C, which would sort before 925-4C. So this solution will not take you all the way down a chain. Are there any situations like that in your data?
                    Last edited by Clyde Schechter; 01 Dec 2015, 15:58.

                    Comment


                    • #11
                      Hi Clyde,

                      Thank you for looking further into this - very much appreciated! Yes, in order to answer your question I've now generated an additional duplicate-identifying variable to count the number of duplicate patient_id and date_form_completed combinations:

                      quietly by patient_id date_form_completed: gen form_date_dup = cond(_N==1,0,_n)

                      When I run a tabulation of form_date_dup , the values on the left column of the tabulation go from 0 to 6 meaning that some patients have up to 6 forms submitted on the same day, right?

                      So, I guess the answer is yes. I'd have to find a way to select the last submitted form for each patient who submitted anywhere from 2 to 6 forms on the same date.








                      Comment


                      • #12
                        Hi again Clyde,

                        Now that I think of it, there are only 29 of those duplicates with 3+ rows with the same date. Would the easiest solution be to sort these longer chains out manually? Since over 100 will be solved using your solution, is there a way to run your suggested code on just the duplicates with 2 rows with the same date and not the ones with a longer chain?

                        Comment


                        • #13
                          I think the code to automatically deal with chains of 3 or more forms from the same date will be complicated and manually fixing them may make sense if there are only 29 such.

                          Code:
                           //  IDENTIFY PATIENT-IDS FINAL DATES THAT HAVE 3 OR MORE OBSERVATIONS
                          by patient_id (date_form_completed), sort: gen byte last_date = (date_form_completed == date_form_completed[_N])
                          by patient_id date_form_completed: gen byte needs_attention = (_N >= 3 & last_date)
                          will identify those observations that need manual attention. Then you can modify the earlier code:

                          Code:
                           gen form2 = form_number replace form2 = form_correcting+"C" if !missing(form_correcting)  
                          by patient_id (date_form_completed form2), sort: drop if _n < _N & !needs_attention
                          This will weed out all but the final form for those situations not requiring special attention. Then you can -edit if needs_attention- and fix up the complicated cases.
                          Last edited by Clyde Schechter; 02 Dec 2015, 08:09.

                          Comment


                          • #14
                            Perfectly happy with this - thanks for the help!

                            Comment


                            • #15
                              Hi
                              I have a similar issue, i am trying to keep variables with a consistent level of non-missing variables. I have quarterly data financial data, from 1991 to 2019. I would like to keep values with consistent data (non-missing values) for at least 10 years (10X4). I have used the foreach foreach v of varlist TotalAssets000 {
                              by BANKID & date; count if !missing(`v')
                              by BANKID & date; keep `v' if !missing(`v') & r(N) < 44
                              }

                              But am not getting the result that i need.
                              Bankid date net income interest expense Total assets
                              102223 1991Q1 54034 235890 45000
                              102223 1991Q2 45000 7067 45000
                              102223 1991Q3 54034 80004 45000
                              102223 1991Q4 23700 345000 45000
                              102223 1992Q1 . 7800 .
                              102223 1992Q2 120000 6000 .
                              102223 1992Q3 85000 . .
                              102223 1992Q4 85000 . .
                              102223 1993Q1 85000 . .
                              102223 1993Q2 85000 . .
                              102223 1993Q3 85000 . .
                              102223 1993Q4 85000 . .
                              102224 1994Q1 85000 . 75000
                              102224 1994Q2 85000 . 75000
                              102224 1994Q3 . . 75000
                              102224 1994Q4 . . 75000
                              102224 1995Q1 . . 68000
                              102224 1995Q2 . . 68000
                              102224 1995Q3 . . 68000
                              102224 1995Q4 . . 68000

                              Comment

                              Working...
                              X