matchit for very large dataset

Rochelle Zhang

Join Date: Jul 2025

Posts: 0
#1

matchit for very large dataset

10 May 2017, 22:00

Hey all,

I have read several posts on matchit and a couple of other matching algorithms (see the end of my post).

I have 2 datasets I want to match by firm name, master data has 600,000 names (across 20 plus years), and dataset two has 13,000 distinct names. I have not run matchit on these two datasets, due to my concern over the matching time.

For matchit, some suggest to drop common words , in the case of firm names, “inc” “co” “limited”, might be common, that helps speed up matching, especially considering the large size of my datasets.

My questions:
How do you extract those common words e.g. “inc” “co” “limited”, then drop it

e.g. “ONTARIO LIMITED” how to drop "LIMITED"
In this post (http://www.statalist.org/forums/foru...-two-databases) Clyde and Julio suggest pretreat names

win_name2 = upper(trim(itrim(win_name))) )

if there are other such transformations I should apply to firm names, could you share here please?

Thanks,
Rochelle

http://www.statalist.org/forums/foru...-two-databases
http://www.statalist.org/forums/foru...reclink-syntax
http://www.statalist.org/forums/foru...-e-fuzzy-match
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30104
#2

10 May 2017, 22:59

A few suggestions: After the pre-treatment involving upper(trim(itrim())) you already noted, you can do

Code:

replace win_name2 = subinstr(win_name2, "LIMITED", "", .)

Removing INC and CO are harder, because they can occur as parts of the substance of the name. For example COORS BREWING contains CO.And a hypothetical firm INCOME MANAGEMENT SERVICES, INC contains two instances of INC, one of which would be nice to drop, but the other would not. I suppose there is some risk of this with LIMITED (or its abbreviation LTD), but that seems like less of a problem. So I think I probably would eliminate occurrences of LIMITED or LTD, but probably would not do so for INC, and definitely not for CO, as the latter simply appears to often in real names..

Other transformations that might be helpful here include the removal of punctuation marks like commas, periods, etc.

Code:

replace win_name2 = subinstr(win_name2, "," , "", .) replace win_name2 = subinstr(win_name2, ".", "", .)

All of that said, is this really necessary? Is there no better identifier you can use than corporate names? What about things like gvkey and permno that seem to be widely used in finance? Or, there must be taxpayer registration numbers that might be more tractable than names, no? Is nothing like that available to you?

Last edited by Clyde Schechter; 10 May 2017, 23:03.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

11 May 2017, 07:27

Note that when looking for common abbreviations, it helps to pad the string with a space on both end. Here are some excerpts of a company name clean-up do-file I have worked on.

Code:

* Reduce to one instance per name
    
    drop raw_id
    by name: keep if _n == 1
    
    clonevar shortname = name
    
    
* Initialize the listdiff variable

    set seed 12345
    listdiff shortname

* Replace remaining non-printable characters, ascii < 32 or > 126

    tempfile test
    outsheet shortname using "`test'", noquote nonames
    hexdump "`test'", tab results
    
    forvalues i = 0/31 {
        if r(c`i') local blankit `blankit' `i'
    }
    forvalues i = 127/255 {
        if r(c`i') local blankit `blankit' `i'
    }
    
    foreach c in `blankit' {
        dis as txt "replacing char " as res `c' as txt " to blank"
        replace shortname = subinstr(shortname,char(`c'),"",.)
        listdiff shortname, noobs sep(0)
    }

* Remove all punctuation characters or other characters
* that don't make the name more distinctive
    
    replace shortname = subinstr(shortname,"."," ",.)
    listdiff shortname, noobs sep(0)
    replace shortname = subinstr(shortname,"'","",.)
    listdiff shortname, noobs sep(0)
    replace shortname = subinstr(shortname,"`","",.)
    listdiff shortname, noobs sep(0)
    replace shortname = subinstr(shortname,","," ",.)
    listdiff shortname, noobs sep(0)
    replace shortname = subinstr(shortname,"-"," ",.)
    listdiff shortname, noobs sep(0)
    replace shortname = subinstr(shortname,"_"," ",.)

* Make sure that we have a space at the beginning and end so that we can
* search words delimited by spaces.

    replace shortname = " " + shortname + " "
    listdiff shortname, reset
    
    replace shortname = subinstr(shortname," INCORPORATED ", " ",.)
    listdiff shortname, noobs sep(0)
    replace shortname = subinstr(shortname," INC ", " ",.)
    listdiff shortname, noobs sep(0)
    replace shortname = subinstr(shortname," CORPORATION ", " ",.)
    listdiff shortname, noobs sep(0)
    replace shortname = subinstr(shortname," CORP ", " ",.)
    listdiff shortname, noobs sep(0)
    replace shortname = subinstr(shortname," COMPANY ", " ",.)
    listdiff shortname, noobs sep(0)
    replace shortname = subinstr(shortname," CO ", " ",.)
    listdiff shortname, noobs sep(0)
    replace shortname = subinstr(shortname," LIMITED ", " ",.)
    listdiff shortname, noobs sep(0)
    replace shortname = subinstr(shortname," LTD ", " ",.)
    listdiff shortname, noobs sep(0)
    replace shortname = subinstr(shortname," LLC ", " ",.)
    listdiff shortname, noobs sep(0)
    replace shortname = subinstr(shortname," L L C ", " ",.)
    listdiff shortname, noobs sep(0)
    replace shortname = subinstr(shortname," LP ", " ",.)

With hundreds of thousands of names, some replace statements will affect a large number of observations and you cannot possibly eyeball each of these changes to assure yourself that there is no unforeseen collateral damage. This is where the listdiff program comes in handy: it lists a random sample (20 by default) of names that were changed by the replace. Here's the source, just save the text in a file called "listdiff.ado" somewhere Stata can find it (help adopath). The program uses randomtag, which you can install from SSC.

Code:

*! version 1.1.0, 03aug2014, Robert Picard, [email protected]      
program listdiff

    version 13
    
    syntax varname , ///
    [ ///
    reset ///
    Prefix(string) ///
    Suffix(string) ///
    MAXimum(integer 20) ///
    noUPdate ///
    * ]
    
    if _N == 0 error 2000

    if `maximum' <= 0 error 411

    if "`prefix'" != "" & "`suffix'" != "" {
        dis as err "prefix and suffix options cannot be combined"
        exit 198
    }
    
    if "`prefix'`suffix'" == "" local prefix _
            
    cap confirm var `prefix'`varlist'`suffix'
    if "`reset'" != "" | _rc {
        cap drop `prefix'`varlist'`suffix'
        clonevar `prefix'`varlist'`suffix' = `varlist'
        exit
    }
    
    qui count if `prefix'`varlist'`suffix' != `varlist'
    dis as txt "Number of differences found = " as res r(N)
    
    if r(N) {
    
        tempname tag
        cap randomtag if `prefix'`varlist'`suffix' != `varlist', count(`maximum') gen(`tag')
                
        list `prefix'`varlist'`suffix' `varlist' if `tag', `options'
        
        if "`update'" == "" qui replace `prefix'`varlist'`suffix' = `varlist'
        
    }
end

You should always use a log file when you run any do-file. With listdiff, the random sample of changes made becomes part of the record.

As you build-up the do-file, you may find listsome (also from SSC) handy. For example, you think it's safe to blank out a period in the name. If you use listsome before going ahead with the change, you will get a good idea of what will be impacted. Again, when working with very large datasets, there's no way you can eyeball all cases, over the full span of the dataset.

Code:

. sysuse auto, clear
(1978 Automobile Data)

. listsome make if strpos(make,".")

     +-------------------+
     | make              |
     |-------------------|
 11. | Cad. Deville      |
 12. | Cad. Eldorado     |
 13. | Cad. Seville      |
 14. | Chev. Chevette    |
 15. | Chev. Impala      |
     |-------------------|
 16. | Chev. Malibu      |
 17. | Chev. Monte Carlo |
 18. | Chev. Monza       |
 19. | Chev. Nova        |
 23. | Dodge St. Regis   |
     |-------------------|
 26. | Linc. Continental |
 27. | Linc. Mark V      |
 28. | Linc. Versailles  |
 29. | Merc. Bobcat      |
 30. | Merc. Cougar      |
     |-------------------|
 31. | Merc. Marquis     |
 32. | Merc. Monarch     |
 33. | Merc. XR-7        |
 34. | Merc. Zephyr      |
 42. | Plym. Arrow       |
     +-------------------+

.

Last edited by Robert Picard; 11 May 2017, 07:53. Reason: added a comment about log files and a listsome example

Comment

Rochelle Zhang

Join Date: Jul 2025

Posts: 0
#4

11 May 2017, 10:23

@Clyde, Thank you for your suggestions !!!! Unfortunately the names come from 2 different databases, and they do not have any other identifier that can be used for matching, such as gvkey, cusip, ticker, permno.

Robert, Thank you for making your code available, and suggest listsome ! This forum has so many great members like you, Clyde, Nick, etc that helps us a great deal .

I have a couple of questions about code in your #3, It runs well for my master name data, with 600,000 names from 30 plus years (quarterly interval). I already applied keep 1 instance per name.

*** Q1: ***

Code:

* Replace remaining non-printable characters, ascii < 32 or > 126 tempfile test outsheet shortname using "`test'", noquote nonames hexdump "`test'", tab results

what does this do?

*** Q2: ***
do I need to change the 0/31, 127/255 based on my dataset?

Code:

forvalues i = 0/31 { if r(c`i') local blankit `blankit' `i' } forvalues i = 127/255 { if r(c`i') local blankit `blankit' `i' } foreach c in `blankit' { dis as txt "replacing char " as res `c' as txt " to blank" replace shortname = subinstr(shortname,char(`c'),"",.) listdiff shortname, noobs sep(0) }

Thank you.
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#5

11 May 2017, 11:13

That whole section is there because the data I was using had keying errors that inserted non-ASCII characters. The code shows a technique to replace each invalid character with a space. Just disregard this section (do not use it) if this does not apply to you.
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#6

11 May 2017, 19:59

Just one comment on a relatively new feature of -matchit- that wasn't there at the time of the post you mention in #1. Now -matchit- can automatically generate a list of words (or more precisely grams) that are too frequent in your datasets (diagnose option). It also allows you to choose if you want to ignore these in the matching (and scoring) process, which will speed up the process (stopwordsauto option). This feature combined with weights usually gives a good boost in speed and accuracy without the need of engaging in major clean-ups. Of course, by all means, the more you prepare your data with suggestions as Robert's ones, the better. Indeed, using the diagnose option with the sim(token) similarity algorithm will give you a hint of where to start with.

You can find an example of these features in #16 of this post
Comment
Rochelle Zhang

Join Date: Jul 2025

Posts: 0
#7

11 May 2017, 21:57

Thanks Julio!

If any of you have experience with matchit, could you please read on ,

as i stated earlier, my master data has 600,000 names, and using data has 13,000 names, I use the line below to match,

I ended up with over 900,000 matches, after deleting similarity index score below 0.7, I have + 30,000 records. I am showing some matches below,

Code:

matchit id1 name1 using Data\roaster, idu(id2) txtu(name2) w(log) sim(ngram,3)

As you can see , even with high score, the names do not seem to indicate the same firm.

My question: should I adjust options used in my matchit code to decrease the amount of false matches ?

my sample output (after running matchit
[CODE]
clear
input long id1 str22 name1 int id2 str25 name2 double similscore
88751 "CAPITAL ADVISORS" 2134 "CATALYST CAPITAL ADVISORS" .80200012
88751 "CAPITAL ADVISORS" 2134 "CATALYST CAPITAL ADVISORS" .80200012
40470 "ASSET MANAGEMENT" 3241 "DELTA ASSET MANAGEMENT" .8020105
261464 "ING CAPITAL MANAGEMENT" 5858 "IMS CAPITAL MANAGEMENT" .80191802
end
[CODE]

row 3 and 4 do not seem to be good matches despite how similscore.
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#8

11 May 2017, 22:28

The best solution is to use any secondary data you may have, like addresses or zip codes. Alternatively (or complementary), you could reapply -matchit- using the columns syntax and selecting other algorithms. For instance, you could use the options sim(token) with weights. This should make terms like "management", "capital" or "asset" less meaningful for the score (if and only if these terms are more frequently found in the data)
Comment

Announcement