Problems Using the Matchit Command

Carlos Cardoso

Join Date: Mar 2020
Posts: 5

Problems Using the Matchit Command

29 Mar 2020, 13:59

Hi,

I am trying to match two different databases using a variable string that corresponds to company names.

The first base contains 500,000 observations and the second contains 1,200.00 observations.

To carry out the match, I use the Matchit command with the following specification:

HTML Code:

matchit id_Dbase1 n_firmDbase1 using "C:\WorkArea\Dbase2.dta", idusing(id_Dbase2) txtusing(n_firmDbase2) override sim(token_soundex)

The result of said specification is as follows:

HTML Code:

Matching current dataset with C:\WorkArea\Dbase2.dta
Similarity function: token_soundex
Loading USING file: C:\WorkArea\Dbase2.dta
Indexing USING file.
0%
20%
40%
60%
80%
Done!
Computing results
        Percent completed ...   (search space saved by index so far)
                     J():  3900  unable to allocate real <tmp>[329143,1]
      asarray_create_u():     -  function returned error
       asarray_rebuild():     -  function returned error
               asarray():     -  function returned error
asarray_index_intersect():     -  function returned error
        core_computing():     -  function returned error
                 <istmt>:     -  function returned error
r(3900);

end of do-file

r(3900);

Could someone give me a suggestion on how I could match the two databases?

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30089
#2

29 Mar 2020, 14:36

What the message is telling you is that this problem requires more memory than your computer and operating system are able to give it. It is likely that you are pairing up large numbers of pretty loose (and unlikely to be correct) matches.

I have a couple of suggestions, which may or may not be applicable to your situation, and may or may not work even if they are.

1. Is it possible to break these data sets into smaller pieces, match the pieces, and put the results together?. For example, if the data sets have variables defining the home countries of the firms, or their industries, it might be reasonable to do a separate match for each country or industry (or combination of both) and then put the results back together by -append-ing the final results.

2. Clean up the variables you are trying to match by making them all upper case, and apply the -trim()- and -itrim()- functions. Also strip out punctuation characters. This will convert a bunch of fuzzy matches into exact matches that you can identify with simple -merge-. Then use -matchit- only to find fuzzy matches for the ones that have no exact match, and append the results together.

3. Set the -threshold()- option. The default value, which is what you are getting now, is a match score of 0.5. If you raise that, you will lose some loose potential matches, but reduce the amount of memory required. In my experience, matches with similarity scores that low are seldom right. Try a threshold of, say, 0.7 or even 0.8: you will get a smaller set of potential matches and probably only lose a handful of correct matches, if any.

4. Try using a different similarity score. Soundex (and token soundex) don't extract a whole lot of information from the strings--they work very well on human names (which is what they were developed for in the first place) because there is a great deal of redundancy in the spelling of human names. But firm names are wilder, and a more informative similarity score might reduce the number of low-probability matches that get a high score on soundex. When I use -matchit- for other than human names, I usually use bigram.

If all else fails, you can try to find a computer that has a lot more RAM to run this.

I hope others who have experience using -matchit-, and its author, Julio Raffo, will read this thread and contribute their ideas as well.
1 like
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#3

30 Mar 2020, 03:26

Clyde, as usual, is right. It seems to be a memory problem. -Matchit- tries it's best to be memory efficient but clearly it's not perfect. In particular, -matchit- needs to have ids and names for both files on memory, the index created from one of these (the using file) and the results matrix/array. So:

- if your master and using files are too big for your current RAM, you need to split these (Clyde's option 1) or get more RAM (Clyde's last comment). You can check if this is the case by looking how much STATA is "eating" of your memory right after the text "Indexing Using File" appears.

- If you have enough memory left, then you can check if the index is too big. This one is trickier. You need to run -matchit- with the keepmata option. After crashing, go to mata and check how big the index (INDEXU) is. For instance, it should look something like this below but with larger # bytes for INDEXU (+ IDM, IDU, TXTM and TXTU, these represent how much the master and using files are taking of your memory). As you can see by the proportions in my example, the index is rarely the problem as it takes a tiny fraction of the memory in comparison to the files. But in case you want to try, Clyde's option 4 addresses this issue.

Code:

. mata: ------------------------------------------------- mata (type end to exit) ------------------------------------------------------------------------ : mata d # bytes type name and extent ------------------------------------------------------------------------------- 8 real scalar FLAG 8,000 real colvector IDM[1000] 8,000 real colvector IDU[1000] 8 struct scalar INDEXU 8 struct scalar STOPWARRAY 8 real scalar THRESHOLD 8 real scalar TIME 27,391 string colvector TXTM[1000] 27,391 string colvector TXTU[1000] 8 struct scalar WGTARRAY 8 struct scalar WGTU 40 real rowvector newvars[5] 8 pointer scalar scorefunc_p 8 pointer scalar similfunc_p ------------------------------------------------------------------------------- end

- if all these leave enough space, then is the final results that are the problem. In this case changing the threshold (Clyde's option 3) might solve the issue. I agree that only rarely results below .7 or .8 threshold were of much use. But this depends of course of the nature of all your data.

I hope this helps.

Best,

J.
Comment
Carlos Cardoso

Join Date: Mar 2020

Posts: 5
#4

30 Mar 2020, 12:41

Hi, Clyde and Julio. Thank you very much for your valuable comments and suggestions, which were very valuable.
CC
Comment

Announcement

Problems Using the Matchit Command

Comment

Comment

Comment