choose which duplicate observation to keep

caroline phn

Join Date: Apr 2023

Posts: 8
#1

choose which duplicate observation to keep

24 May 2023, 08:23

Hi,
I have to calculated the incidence over one year for a disease and for that i have twodatabases (a labotary database and a hospital database). For example, some people have made one test in a labotary database and later an other test in hospital database and the results of those two test may be differents (one positive and the other negative). I merge those twodatabase and then I use this command to identify the duplicates

"sort NAME YEAR
quietly by NAME YEAR: gen dup = cond(_N==1,0,_n)
tab dup"

If for one duplicated, one of the RESULT is negative and the other is positive, I would like to delete the duplicated data for wich RESULT is coded negative and keep the data where it's coded positive. For the duplicated where the RESULT are the same (negative and negative or positive and positive) I want to just delete one of the duplicated, without any condition.
Do someone know how can I code this ?
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35782
#2

24 May 2023, 08:43

This isn't well described as a problem in duplicates. The main feature of a duplicates problem is that you have sets of identical observations in respect of your variables, but you only need one of each such set; so it is immaterial which others you drop because they are, as said, identical.

Make sure you have a safe copy of your dataset in case you (I) mess this up or change your mind.

This sounds like

Code:

bysort NAME YEAR (RESULT) : gen todrop = RESULT > 0 & RESULT[_n-1] < 0 & _N == 2 drop if todrop

where I am taking rather literally the implication of your wording that you have precisely two values of RESULT for each NAME and YEAR. The code will ignore combinations that consist of 1 observation or of 3 or more observations.
Comment

Announcement

choose which duplicate observation to keep

Comment