Why my Matchit score is so low

Song Yang

Join Date: Oct 2023

Posts: 5
#1

Why my Matchit score is so low

05 Oct 2023, 07:43

Hello!

I need help on matchit package.
I am currently trying to match two dataset by bank names, I don't have and id to match, so I just generate row sequence number seq1 and seq2

Here is my two datasets, they were both cleaned and no duplicates

set1 variables: seller seq1

set2 variables: name1 seq2
I use the following code:

use set1, clear
matchit seq1 seller using set2.dta, idu(seq2) txtu(name1) di sim(bigram) w(log) t(0) override
gsort - similscore

But the matching result is so bad. I checked both datasets and sure that there could be a better match. Thanks!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#2

05 Oct 2023, 12:03

Well, most of these pairings are clearly terrible mismatches and you should want the match score to be very low.

But the last one you show, Bank of America, N.A. with BANK OF AMER NA should be a decent match--or so it seems to the human eye. But to Stata, these have almost nothing in common because name1 is in upper case and seller is in mixed case. -matchit- is case-sensitive. So I would first deal with typographical sources of non-matching and then re-run -matchit-. So something like this:

Code:

// DO THIS IN THE MASTER DATA AND USING DATA SETS gen name0 = trim(itrim(upper(seller)) local punctuation . , - { replace name0 = subinstr(name0, "`x'", "", .) } // THEN RERUN -matchit- USING name0 AS THE txtmaster VARIABLE: matchit seq1 name0 using set2.dta, idu(seq2) txtu(name1) di sim(bigram) w(log) t(0) override

You have a much better chance of identifying matches with issues of case, punctuation, and stray blank spaces out of the way.
1 like
Comment
Song Yang

Join Date: Oct 2023

Posts: 5
#3

05 Oct 2023, 17:10

It works great! Thanks!
Comment

Announcement

Why my Matchit score is so low

Comment

Comment