Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • choose which duplicate observation to keep

    Hi,
    I have to calculated the incidence over one year for a disease and for that i have twodatabases (a labotary database and a hospital database). For example, some people have made one test in a labotary database and later an other test in hospital database and the results of those two test may be differents (one positive and the other negative). I merge those twodatabase and then I use this command to identify the duplicates

    "sort NAME YEAR
    quietly by NAME YEAR: gen dup = cond(_N==1,0,_n)
    tab dup"

    If for one duplicated, one of the RESULT is negative and the other is positive, I would like to delete the duplicated data for wich RESULT is coded negative and keep the data where it's coded positive. For the duplicated where the RESULT are the same (negative and negative or positive and positive) I want to just delete one of the duplicated, without any condition.
    Do someone know how can I code this ?

  • #2
    This isn't well described as a problem in duplicates. The main feature of a duplicates problem is that you have sets of identical observations in respect of your variables, but you only need one of each such set; so it is immaterial which others you drop because they are, as said, identical.

    Make sure you have a safe copy of your dataset in case you (I) mess this up or change your mind.

    This sounds like

    Code:
    bysort NAME YEAR (RESULT) : gen todrop = RESULT > 0 & RESULT[_n-1] < 0 & _N == 2 
    drop if todrop
    where I am taking rather literally the implication of your wording that you have precisely two values of RESULT for each NAME and YEAR. The code will ignore combinations that consist of 1 observation or of 3 or more observations.

    Comment

    Working...
    X