Examining -Matchit- options for improving matches based on types of string variables

Michael Costello

Join Date: Dec 2015

Posts: 30
#1

Examining -Matchit- options for improving matches based on types of string variables

22 Aug 2020, 06:17

I am wondering if anyone has seen any kind of examination of the various matching methods available in the -matchit- function? I don't really understand the difference between bigram, ngram, ngram_circ, token, soundex and token_soundex. Also where do the different scoring options (jaccard, simple, and minisimple) excel?

Specifically, I have a dataset where I'm trying to match off of business name and address. Does one of the options work better at ignoring small (in my mind anyway) differences such as "LLC" vs. "Inc" vs. no modifier, such as often happens when business names are recorded? Does another one of these options give greater weight to differences in numbers, such as I'm seeing in addresses? Ex: "123 Main Street." should not be matched with "321 Main Street.", but should be matched with both "123 Main" and 123 Main St". Can someone, either by experience of reference, tell me where I might get the best results from? Is this a question that Julio Raffo has written about before? If so I can't seem to find it.

Thanks for any help you can provide.
Tags: None

Julio Raffo

Join Date: May 2014
Posts: 132

01 Sep 2020, 09:47

Hi answers to both questions (scoring and similarity) can be found at matchit's help file (

Code:

help matchit

). Just check that you have the latest version. On addresses my experience is that minsimple works better if addresses are not very standardized. Also having the auto-generated stopwords can help you.

In any case, you can also try the code below to understand what each similarity function is actually doing.

Code:

which matchit
do c:\ado\plus\m\matchit.ado // replace this by whatever path the which above reported
mata:
mata d simf*() // this lists all the existing similarity options
A=simf_bigram("Nick Cox")
asarray_keys(A)
A=simf_ngram("Nick Cox",3)
asarray_keys(A)
A=simf_token("Nick Cox")
asarray_keys(A)
A=simf_soundex("Nick Cox")
asarray_keys(A)
A=simf_token_soundex("Nick Cox")
asarray_keys(A)
// Add as many of the simf functions listed after the mata describe command as you want
end

The above should output something like this:

Code:

. which matchit
c:\ado\plus\m\matchit.ado
*! 1.5.2 J.D. Raffo May 2020

. do c:\ado\plus\m\matchit.ado

. *! 1.5.2 J.D. Raffo May 2020
. program matchit
  1.  version 12
  2.  syntax varlist(min=2 max=2) ///

[~~truncated~~]

:
: end
--------------------------------------------------------------------------------------------------------------------------------------------------

.
end of do-file

. mata:
------------------------------------------------- mata (type end to exit) ------------------------------------------------------------------------
: mata d simf*() // this lists all the existing similarity options

      # bytes   type                        name and extent
-------------------------------------------------------------------------------
          744   transmorphic matrix         simf_bigram()
          776   transmorphic matrix         simf_cotoken()
          572   transmorphic matrix         simf_firstgram()
          764   transmorphic matrix         simf_ngram()
          944   transmorphic matrix         simf_ngram_circ()
        4,572   transmorphic matrix         simf_nysiis_fk()
        1,208   transmorphic matrix         simf_scotoken()
          244   transmorphic matrix         simf_soundex()
        1,148   transmorphic matrix         simf_soundex_ext()
        1,148   transmorphic matrix         simf_soundex_fk()
          244   transmorphic matrix         simf_soundex_nara()
        1,220   transmorphic matrix         simf_tkngram()
          740   transmorphic matrix         simf_token()
          692   transmorphic matrix         simf_token_soundex()
        1,348   transmorphic matrix         simf_tokenwrap()
-------------------------------------------------------------------------------

: A=simf_bigram("Nick Cox")

: asarray_keys(A)
        1
    +------+
  1 |  ic  |
  2 |  ck  |
  3 |  k   |
  4 |   C  |
  5 |  ox  |
  6 |  Ni  |
  7 |  Co  |
    +------+

: A=simf_ngram("Nick Cox",3)

: asarray_keys(A)
         1
    +-------+
  1 |  k C  |
  2 |  Cox  |
  3 |  Nic  |
  4 |  ick  |
  5 |  ck   |
  6 |   Co  |
    +-------+

: A=simf_token("Nick Cox")

: asarray_keys(A)
          1
    +--------+
  1 |   Cox  |
  2 |  Nick  |
    +--------+

: A=simf_soundex("Nick Cox")

: asarray_keys(A)
  N220

: A=simf_token_soundex("Nick Cox")

: asarray_keys(A)
          1
    +--------+
  1 |  N200  |
  2 |  C200  |
    +--------+

: end
--------------------------------------------------------------------------------------------------------------------------------------------------

Announcement

Examining -Matchit- options for improving matches based on types of string variables

Comment