Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Why my Matchit score is so low

    Hello!

    I need help on matchit package.
    I am currently trying to match two dataset by bank names, I don't have and id to match, so I just generate row sequence number seq1 and seq2

    Here is my two datasets, they were both cleaned and no duplicates

    set1 variables: seller seq1

    set2 variables: name1 seq2
    I use the following code:

    use set1, clear
    matchit seq1 seller using set2.dta, idu(seq2) txtu(name1) di sim(bigram) w(log) t(0) override
    gsort - similscore

    But the matching result is so bad. I checked both datasets and sure that there could be a better match. Thanks!
    Click image for larger version

Name:	2023-10-05 094019.png
Views:	1
Size:	226.0 KB
ID:	1729224





  • #2
    Well, most of these pairings are clearly terrible mismatches and you should want the match score to be very low.

    But the last one you show, Bank of America, N.A. with BANK OF AMER NA should be a decent match--or so it seems to the human eye. But to Stata, these have almost nothing in common because name1 is in upper case and seller is in mixed case. -matchit- is case-sensitive. So I would first deal with typographical sources of non-matching and then re-run -matchit-. So something like this:

    Code:
    // DO THIS IN THE MASTER DATA AND USING DATA SETS
    gen name0 = trim(itrim(upper(seller))
    local punctuation . , - {
        replace name0 = subinstr(name0, "`x'", "", .)
    }
    
    // THEN RERUN -matchit- USING name0 AS THE txtmaster VARIABLE:
    matchit seq1 name0 using set2.dta, idu(seq2) txtu(name1) di sim(bigram) w(log) t(0) override
    You have a much better chance of identifying matches with issues of case, punctuation, and stray blank spaces out of the way.

    Comment


    • #3
      It works great! Thanks!

      Comment

      Working...
      X