Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • removing duplicates

    Hello,

    I use this command for removing duplicates:

    sort postcal code, gender, birth year, species, type, date
    quietly by postcal code, gender, birth year, species, type: gen dup2 = cond(_N==1,0,_n)
    tabulate dup2
    drop if dup2 ==2
    drop if dup2==3

    I want to remove the duplicate with the last date (therefore I sorted also on date).

    If one of these variables has empty cells for both subjects of a duplicate pair (and all other values of the variables above are the same), will these subjects then also become duplicates or only when it has the same value and both cells are not empty?

    If empty cells are also included as duplicate values, is there a way how I can prevent this?


    Kind regards,
    Karuna Vendrik

  • #2
    The code you show will not run because it is riddled with commas in places where commas are not allowed. Moreover, given that you did not put a comma between postal and code, I'm guessing that you do not have two separate variables, postal and code; rather you have one variable, perhaps called postal_code.

    When that is corrected, it will treat observations with missing values as duplicates of other observations with missing values in the same variables if the non-missing variables also agree.

    To override that behavior you can do this:
    Code:
    egen mcount = rowmiss(postcal_code gender birth year species type)
    by postcal_code gender birth year species type (date), sort: drop if _n > 1 & mcount == 0
    Added: Removing duplicates with -drop if dup2 = 2- and -drop if dup2 = 3- may work in this particular data set if there are never more than three duplicates. But it is usually better practice to rely on such specific attributes of the data set: you may update the data set and some point and that assumption then fails, breaking your code. The code shown here works equally well regardless of how many duplicates may be present.

    Comment

    Working...
    X