Counting the number of non-distinct identifiers in two variables

Ng Luke

Join Date: Aug 2015

Posts: 21
#1

Counting the number of non-distinct identifiers in two variables

11 Apr 2017, 09:40

Hi,

I have two variables, Partner1 and Partner2, which contain unique passport numbers in string format. An example:

Partner 1 Partner 2

E101 E100

E102 E103

E100 E101

E105 E106

E107 E108

Over here, we have 6 observations. However, we see that E100 and E101 are "repeated" between the two variables. Is there a way to count the number of repeated string variables in two variables? Thank you.
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30114

11 Apr 2017, 10:00

Try this:

Code:

levelsof partner1, local(one)
levelsof partner2, local(two)
local repeated: list one & two
local repeated: list uniq repeated
display `"Number of repeated values: `:word count `repeated''"'

If your data set is huge, this approach may overflow the size limits on local macros. In that case, you could do this instead:

Code:

preserve
keep partner1 partner2
gen long obs_no = _n
reshape long partner, i(obs_no) j(which)
by obs_no, sort: keep if _N == 2
count
display `"Number of repeated values: `=`r(N)'/2'"'
restore

Added: A third approach that might be faster than the above in a very large data set:

Code:

preserve
keep partner2
rename partner2 partner
tempfile two
save `two'
restore, preserve
keep partner1
rename partner1 partner
merge 1:1 partner using `two', keep(match)
count
display "Number of repeated values: `r(N)'"
restore

Last edited by Clyde Schechter; 11 Apr 2017, 10:02.

Comment

Ng Luke

Join Date: Aug 2015

Posts: 21
#3

30 Jun 2017, 02:06

Hi Clyde,

Thanks so much for the help. I now have an issue whereby I need to drop the second "repeated" observation to prevent any duplicates.However, the data set is really huge so using levelsof may not be appropriate. Is there any other way to go around this?

Thank you.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

30 Jun 2017, 09:17

It's easy to miss that strings can be ordered and therefore compared. So unless I'm missing something here, if the two passport numbers are ordered alphabetically, then simple Stata grouping will identify duplicates.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str4(partner1 partner2)
"E101" "E100"
"E102" "E103"
"E100" "E101"
"E105" "E106"
"E107" "E108"
end

gen lowpn = cond(partner1 <= partner2, partner1, partner2)
gen highpn = cond(partner1 > partner2, partner1, partner2)

bysort lowpn highpn: keep if _n == 1

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35709
#5

30 Jun 2017, 09:31

Robert's approach is written up at http://www.stata-journal.com/sjpdf.h...iclenum=dm0043
Comment
Ng Luke

Join Date: Aug 2015

Posts: 21
#6

02 Jul 2017, 21:12

Hi Robert and Nick,

Thank you both so much for the help.

I have been thinking for days!
Comment

Partner 1	Partner 2
E101	E100
E102	E103
E100	E101
E105	E106
E107	E108

Announcement