Problems with dropping duplicates of ID based on criteria

Janni Lam

Join Date: Nov 2014

Posts: 15
#1

Problems with dropping duplicates of ID based on criteria

08 Oct 2015, 00:24

Dear all,

I had a problem previously about dropping variables based on criteria.

This time, I will need to keep variables with the worst outcome (3 = higest severity/stage of disease). Below is a list of patients and their visit at hospital:

patient stage of disease

Cindy 1

Cindy 2

Bob 1

Bob 2

Bob 3

Louis 1

I would like to count number of unique patients with the highest stage of disease, regardless visitnumber, visit date etc. So I end up with a dataset like this:

patient stage of disease

Cindy 2

Bob 3

Louis 1

Where serverest stage for Cindy was 2, for Bob 3, and Louis only 1

I made the duplicate-command, but I cant figure out how to sort and what criteria to set up to keep exactly the observatiosn with highest severity

Hope you can save my life, once again, thanks in advance!

Best wishes,
Malinna
Tags: None
Janni Lam

Join Date: Nov 2014

Posts: 15
#2

08 Oct 2015, 01:25

EDIT:

I sorted the patient ID, and then I sorted the stage of disease in desceding order (by adding minus infront of the variable), so that worse case is the first observation in each ID-group:

the duplicate-command:
sort patient -diagnose
quietly by patient: gen dup_patient = cond(_N==1,0,_n)
then I wanted to drop patient if dup_patient>1

However, when I quickly look at the data, I see that in some ID-groups, the sorting went wrong. If there was 4 of the same patient, diagnose could be listed like this:
Patient diagnose

Lucas 3

Lucas 2

Lucas 3

Lucas 1

What went wrong? :S
Hope you can save my life, once again, thanks in advance!

Best wishes,
Malinna
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35585
#3

08 Oct 2015, 04:11

Code:

sort patient -diagnose

isn't legal. Did you mean gsort?

We need a reproducible example with data and with exact code that you used.

Please do read the FAQ and act on

http://www.statalist.org/forums/help#stata (using dataex (SSC) and CODE delimiters)

http://www.statalist.org/forums/help#realnames (fixing your name to the required form)
Comment
Janni Lam

Join Date: Nov 2014

Posts: 15
#4

08 Oct 2015, 04:31

Hi Nick,

I didnt use gsort, but now I did, and it works! thanks alot, once again!
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35585

08 Oct 2015, 04:39

OK. I see what is going on.

Code:

sort code -diagnose

is legal (and I was wrong), because it will be read as

Code:

sort code-diagnose

namely all the variables from code to diagnose in your dataset. That can include other variables.

To see that can jumble your data (as you think), note this example.

Code:

. sysuse auto
(1978 Automobile Data)

. order rep78 turn foreign

. sort rep78 -foreign

. l rep78 turn foreign if rep78 == 3

     +-------------------------+
     | rep78   turn    foreign |
     |-------------------------|
 11. |     3     31   Domestic |
 12. |     3     34   Domestic |
 13. |     3     34    Foreign |
 14. |     3     36   Domestic |
 15. |     3     36    Foreign |
     |-------------------------|
 16. |     3     36    Foreign |
 17. |     3     37   Domestic |
 18. |     3     40   Domestic |
 19. |     3     40   Domestic |
 20. |     3     40   Domestic |
     |-------------------------|
 21. |     3     40   Domestic |
 22. |     3     41   Domestic |
 23. |     3     41   Domestic |
 24. |     3     42   Domestic |
 25. |     3     42   Domestic |
     |-------------------------|
 26. |     3     42   Domestic |
 27. |     3     42   Domestic |
 28. |     3     43   Domestic |
 29. |     3     43   Domestic |
 30. |     3     43   Domestic |
     |-------------------------|
 31. |     3     43   Domestic |
 32. |     3     43   Domestic |
 33. |     3     43   Domestic |
 34. |     3     43   Domestic |
 35. |     3     44   Domestic |
     |-------------------------|
 36. |     3     44   Domestic |
 37. |     3     45   Domestic |
 38. |     3     45   Domestic |
 39. |     3     48   Domestic |
 40. |     3     51   Domestic |
    +-------------------------+

The data are sorted on rep78 turn foreign and thus look jumbled with respect to rep78 foreign.

Last edited by Nick Cox; 08 Oct 2015, 04:41.

Comment

Janni Lam

Join Date: Nov 2014

Posts: 15
#6

08 Oct 2015, 05:01

OOOOH, i see! I misunderstood the "minus" in front of variable. I thought minus meant sort in descending order. Now I completely understand. I have used the gsort-code now.
Now the sorting and ranking is correct. Thanks, Nick!
Comment

Catherine Castillo

Join Date: Dec 2015
Posts: 8

01 Dec 2015, 13:54

Hi Nick,

I had a similar problem except the observations I want to keep are based on more than 1 criterion. I'm trying to keep the latest submitted information for each patient, where more than one row of information may have been submitted for each patient:

Using STATA/IC 13.1:

example data:

I generated the "patient_duplicate" column below using:

sort patient_id
quietly by patient_id: gen patient_duplicate= cond(_N==1,0,_n)

All other columns were pre-existing:

patient_id: patient identification number (can be alphanumeric)
form_number : old or current form number (alphanumeric)
form_correcting : new form number (alphanumeric) if applicable

...

patient_id	patient_duplicate	form_number	form_correcting	date_form_completed
1	0	7369		01/03/2009
2	1	3997		01/03/2009
2	2	3r308	3997	02/03/2009
2	3	2275	3r308	03/03/2009
3	1	1082		04/03/2009
3	2	4495	925-4	05/03/2009
3	3	925-4	1082	05/03/2009

First I reformat the form submission date :

gen date2= date(date_form_completed, "DMY")
format %td date2

Then, When I run the following code:

by patient_id(date2), sort: keep if _n==_N

It keeps the wrong information for Patient #3 since the last 2 versions of this patient's information were submitted on the same day (it keeps the very last row shown, instead of the second to last row shown). What I really want to accomplish is to keep the last created form version for each patient. Please let me know if you can help.

** however, now that I've recreated my real data using my example data (above), there is an error when I run that last bit of code: "factor variables and time-series operators not allowed r(101);"

Any idea why?

This error doesn't appear when I run the same exact code on my actual data (with over 1,000 instances of duplicates which have to be sorted out).

Thanks,
Cat

Last edited by Catherine Castillo; 01 Dec 2015, 13:57.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30050
#8

01 Dec 2015, 14:25

** however, now that I've recreated my real data using my example data (above), there is an error when I run that last bit of code: "factor variables and time-series operators not allowed r(101);"

A good example of why it is so important to show exactly what commands were run and exactly what output Stata gives--and that is best done by copying directly from the Stata Results window or your log file into a code block. That said, in this case, I can tell you what the problem is.

The error results because you need to have a blank space between patient_id and (date2). Without that, the parser gets confused and thinks you are trying to do something different from what you intend. The reason it doesn't appear when you run with your actual data is that, I guarantee you, you are not running the "same exact code."

As for your first problem, your command, when corrected by inserting a blank space before (date2), only sorts the data on patient_id and date2. When, as in the case of patient 3, there is more than one observation with the same values of patient_id and date2, Stata will sort them in some arbitrary order. Worse still, the order is not reproducible. So if you re-ran this several times, some of them would give you the answer you were hoping for, and others would not. In order to solve this problem, you need to provide the -by- prefix with another variable that correctly sorts those observations within patient_id and date2. It isn't obvious to me what that variable would be. You state that you wanted the second-to-last observation shown rather than the last. But how would we know that? It's not obvious to me. What is the rule by which you break ties? It seems like it might involve some other variables not shown in your post?
Comment
Catherine Castillo

Join Date: Dec 2015

Posts: 8
#9

01 Dec 2015, 14:54

Hi Clyde,

Thanks for catching this - yes, once I enter the missing space, it runs the code the way that I explained (it will keep the very last row of data for patient #3).

However, I want to keep the second to last row because it is a correction of the very last row (the form number of the last row is 925-4 , but this form was corrected and replaced by form number 4495. Unfortunately, both of these forms were submitted on the same day).

Is there a way to amend the code in a way that takes this into account?

My apologies for the confusion.

Thank you,
Cat
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30050

#10

01 Dec 2015, 15:52

Oh, OK. I think this works:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(patient_id patient_duplicate) str10(form_number form_correcting) float date_form_completed
1 0 "7369"  ""      17957
2 1 "3997"  ""      17957
2 2 "3r308" "3997"  17958
2 3 "2275"  "3r308" 17959
3 1 "1082"  ""      17960
3 2 "4495"  "925-4" 17961
3 3 "925-4" "1082"  17961
end
format %tdDD/NN/CCYY date_form_completed

list, sepby(patient_id)

gen form2 = form_number
replace form2 = form_correcting+"C" if !missing(form_correcting)

by patient_id (date_form_completed form2), sort: keep if _n == _N

list

However, it will only work for one step. If yet a third form, say 813, correcting form 4495 were also submitted on the same day, its value of form2 would be 4495C, which would sort before 925-4C. So this solution will not take you all the way down a chain. Are there any situations like that in your data?

Last edited by Clyde Schechter; 01 Dec 2015, 15:58.

Comment

Catherine Castillo

Join Date: Dec 2015

Posts: 8
#11

02 Dec 2015, 06:14

Hi Clyde,

Thank you for looking further into this - very much appreciated! Yes, in order to answer your question I've now generated an additional duplicate-identifying variable to count the number of duplicate patient_id and date_form_completed combinations:

quietly by patient_id date_form_completed: gen form_date_dup = cond(_N==1,0,_n)

When I run a tabulation of form_date_dup , the values on the left column of the tabulation go from 0 to 6 meaning that some patients have up to 6 forms submitted on the same day, right?

So, I guess the answer is yes. I'd have to find a way to select the last submitted form for each patient who submitted anywhere from 2 to 6 forms on the same date.
Comment
Catherine Castillo

Join Date: Dec 2015

Posts: 8
#12

02 Dec 2015, 06:50

Hi again Clyde,

Now that I think of it, there are only 29 of those duplicates with 3+ rows with the same date. Would the easiest solution be to sort these longer chains out manually? Since over 100 will be solved using your solution, is there a way to run your suggested code on just the duplicates with 2 rows with the same date and not the ones with a longer chain?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30050
#13

02 Dec 2015, 08:07

I think the code to automatically deal with chains of 3 or more forms from the same date will be complicated and manually fixing them may make sense if there are only 29 such.

Code:

// IDENTIFY PATIENT-IDS FINAL DATES THAT HAVE 3 OR MORE OBSERVATIONS by patient_id (date_form_completed), sort: gen byte last_date = (date_form_completed == date_form_completed[_N]) by patient_id date_form_completed: gen byte needs_attention = (_N >= 3 & last_date)

will identify those observations that need manual attention. Then you can modify the earlier code:

Code:

gen form2 = form_number replace form2 = form_correcting+"C" if !missing(form_correcting) by patient_id (date_form_completed form2), sort: drop if _n < _N & !needs_attention

This will weed out all but the final form for those situations not requiring special attention. Then you can -edit if needs_attention- and fix up the complicated cases.

Last edited by Clyde Schechter; 02 Dec 2015, 08:09.
Comment
Catherine Castillo

Join Date: Dec 2015

Posts: 8
#14

02 Dec 2015, 11:08

Perfectly happy with this - thanks for the help!
Comment

John Aibinu

Join Date: May 2019
Posts: 9

#15

06 May 2020, 04:26

Hi
I have a similar issue, i am trying to keep variables with a consistent level of non-missing variables. I have quarterly data financial data, from 1991 to 2019. I would like to keep values with consistent data (non-missing values) for at least 10 years (10X4). I have used the foreach foreach v of varlist TotalAssets000 {
by BANKID & date; count if !missing(`v')
by BANKID & date; keep `v' if !missing(`v') & r(N) < 44
}

But am not getting the result that i need.

Bankid	date	net income	interest expense	Total assets
102223	1991Q1	54034	235890	45000
102223	1991Q2	45000	7067	45000
102223	1991Q3	54034	80004	45000
102223	1991Q4	23700	345000	45000
102223	1992Q1	.	7800	.
102223	1992Q2	120000	6000	.
102223	1992Q3	85000	.	.
102223	1992Q4	85000	.	.
102223	1993Q1	85000	.	.
102223	1993Q2	85000	.	.
102223	1993Q3	85000	.	.
102223	1993Q4	85000	.	.
102224	1994Q1	85000	.	75000
102224	1994Q2	85000	.	75000
102224	1994Q3	.	.	75000
102224	1994Q4	.	.	75000
102224	1995Q1	.	.	68000
102224	1995Q2	.	.	68000
102224	1995Q3	.	.	68000
102224	1995Q4	.	.	68000

patient	stage of disease
Cindy	1
Cindy	2
Bob	1
Bob	2
Bob	3
Louis	1

patient	stage of disease
Cindy	2
Bob	3
Louis	1

Patient	diagnose
Lucas	3
Lucas	2
Lucas	3
Lucas	1

Announcement