Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Counting the number of non-distinct identifiers in two variables

    Hi,

    I have two variables, Partner1 and Partner2, which contain unique passport numbers in string format. An example:
    Partner 1 Partner 2
    E101 E100
    E102 E103
    E100 E101
    E105 E106
    E107 E108
    Over here, we have 6 observations. However, we see that E100 and E101 are "repeated" between the two variables. Is there a way to count the number of repeated string variables in two variables? Thank you.

  • #2
    Try this:

    Code:
    levelsof partner1, local(one)
    levelsof partner2, local(two)
    local repeated: list one & two
    local repeated: list uniq repeated
    display `"Number of repeated values: `:word count `repeated''"'
    If your data set is huge, this approach may overflow the size limits on local macros. In that case, you could do this instead:

    Code:
    preserve
    keep partner1 partner2
    gen long obs_no = _n
    reshape long partner, i(obs_no) j(which)
    by obs_no, sort: keep if _N == 2
    count
    display `"Number of repeated values: `=`r(N)'/2'"'
    restore
    Added: A third approach that might be faster than the above in a very large data set:

    Code:
    preserve
    keep partner2
    rename partner2 partner
    tempfile two
    save `two'
    restore, preserve
    keep partner1
    rename partner1 partner
    merge 1:1 partner using `two', keep(match)
    count
    display "Number of repeated values: `r(N)'"
    restore
    Last edited by Clyde Schechter; 11 Apr 2017, 10:02.

    Comment


    • #3
      Hi Clyde,

      Thanks so much for the help. I now have an issue whereby I need to drop the second "repeated" observation to prevent any duplicates.However, the data set is really huge so using levelsof may not be appropriate. Is there any other way to go around this?

      Thank you.

      Comment


      • #4
        It's easy to miss that strings can be ordered and therefore compared. So unless I'm missing something here, if the two passport numbers are ordered alphabetically, then simple Stata grouping will identify duplicates.

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input str4(partner1 partner2)
        "E101" "E100"
        "E102" "E103"
        "E100" "E101"
        "E105" "E106"
        "E107" "E108"
        end
        
        gen lowpn = cond(partner1 <= partner2, partner1, partner2)
        gen highpn = cond(partner1 > partner2, partner1, partner2)
        
        bysort lowpn highpn: keep if _n == 1

        Comment


        • #5
          Robert's approach is written up at http://www.stata-journal.com/sjpdf.h...iclenum=dm0043

          Comment


          • #6
            Hi Robert and Nick,

            Thank you both so much for the help.

            I have been thinking for days!

            Comment

            Working...
            X