how to check if two observations are duplicates

johnkim

Join Date: Apr 2014

Posts: 47
#1

how to check if two observations are duplicates

24 Apr 2015, 19:15

Dear all,
I have a .dta with group variable, grp. In addition to grp, there are many other variables, say A-Z. There are duplicates, i.e., if you run

by id, sort: gen n = _N

n is occasionally 2. I want to check of such observations are duplicates in the sense that, for each such pair, each variable takes on at most one nonmissing value, i.e., it is NOT the case that

a) A[1] and A[2] are both nonmissing and A[1]!=A[2], OR
b) B[1] and B[2] are both nonmissing and B[1]!=B[2], OR
...
z) Z[1] and Z[2] are both nonmissing and Z[1]!=Z[2].

If so, I want to collapse such observations into one and record a missing value if both observations are missing, or the unique nonmissing value.

Is there a way of doing it efficiently? I'd appreciate your thoughts. Thank you!

Best,
John
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 28624

24 Apr 2015, 20:19

Code:

// VERIFY NO CONFLICT OF NON-MISSING VALUES FOR VARIABLES
foreach v of varlist A-Z {
    display as text "Checking `v' for conflicts"
    by id (`v'), sort: assert `v' == `v'[1] if !missing(`v')
}

//  NOW COLLAPSE TO SINGLE VALUE EPR id
collapse (firstnm) A-Z, by(id)

Last edited by Clyde Schechter; 24 Apr 2015, 20:21.

Comment

johnkim

Join Date: Apr 2014

Posts: 47
#3

24 Apr 2015, 20:59

Dear Clyde,
Thank you so much for your help. Can I please ask you one more question (on your code)? I'm guessing your code assumes variables A-Z are numeric variables. Is there a way to modify the code s.t. if numeric, run what you have, if string, run with [_N] instead of [1]? I'd appreciate your answer! Thank you again for the lead!

Best,
John
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 28624

24 Apr 2015, 22:28

Ah, yes, for string variables the sort order is wrong:

Code:

foreach v of varlist A-Z {
    capture confirm numeric var `v'
    if c(rc) == 0 {
        by id (`v'), sort: assert `v' == `v'[1] if !missing(`v')
    }
    else {  // STRING VARIABLE
        by id (`v'), sort: assert `v' == `v'[_N] if !missing(`v')
    }
}

collapse (firstnm) A-Z, by(id) // WORKS FOR BOTH STRING & NUMERIC

Last edited by Clyde Schechter; 24 Apr 2015, 22:32.

Comment

johnkim

Join Date: Apr 2014

Posts: 47
#5

26 Apr 2015, 13:12

Thank you so much! I learned new commands on the way :P
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4264
#6

26 Apr 2015, 19:10

you might also want to check -compobs- ; use search to find and install
Comment

Announcement

how to check if two observations are duplicates

Comment

Comment

Comment

Comment

Comment