Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problems Using the Matchit Command

    Hi,

    I am trying to match two different databases using a variable string that corresponds to company names.

    The first base contains 500,000 observations and the second contains 1,200.00 observations.

    To carry out the match, I use the Matchit command with the following specification:

    HTML Code:
    matchit id_Dbase1 n_firmDbase1 using "C:\WorkArea\Dbase2.dta", idusing(id_Dbase2) txtusing(n_firmDbase2) override sim(token_soundex)
    The result of said specification is as follows:

    HTML Code:
    Matching current dataset with C:\WorkArea\Dbase2.dta
    Similarity function: token_soundex
    Loading USING file: C:\WorkArea\Dbase2.dta
    Indexing USING file.
    0%
    20%
    40%
    60%
    80%
    Done!
    Computing results
            Percent completed ...   (search space saved by index so far)
                         J():  3900  unable to allocate real <tmp>[329143,1]
          asarray_create_u():     -  function returned error
           asarray_rebuild():     -  function returned error
                   asarray():     -  function returned error
    asarray_index_intersect():     -  function returned error
            core_computing():     -  function returned error
                     <istmt>:     -  function returned error
    r(3900);
    
    end of do-file
    
    r(3900);
    Could someone give me a suggestion on how I could match the two databases?

  • #2
    What the message is telling you is that this problem requires more memory than your computer and operating system are able to give it. It is likely that you are pairing up large numbers of pretty loose (and unlikely to be correct) matches.

    I have a couple of suggestions, which may or may not be applicable to your situation, and may or may not work even if they are.

    1. Is it possible to break these data sets into smaller pieces, match the pieces, and put the results together?. For example, if the data sets have variables defining the home countries of the firms, or their industries, it might be reasonable to do a separate match for each country or industry (or combination of both) and then put the results back together by -append-ing the final results.

    2. Clean up the variables you are trying to match by making them all upper case, and apply the -trim()- and -itrim()- functions. Also strip out punctuation characters. This will convert a bunch of fuzzy matches into exact matches that you can identify with simple -merge-. Then use -matchit- only to find fuzzy matches for the ones that have no exact match, and append the results together.

    3. Set the -threshold()- option. The default value, which is what you are getting now, is a match score of 0.5. If you raise that, you will lose some loose potential matches, but reduce the amount of memory required. In my experience, matches with similarity scores that low are seldom right. Try a threshold of, say, 0.7 or even 0.8: you will get a smaller set of potential matches and probably only lose a handful of correct matches, if any.

    4. Try using a different similarity score. Soundex (and token soundex) don't extract a whole lot of information from the strings--they work very well on human names (which is what they were developed for in the first place) because there is a great deal of redundancy in the spelling of human names. But firm names are wilder, and a more informative similarity score might reduce the number of low-probability matches that get a high score on soundex. When I use -matchit- for other than human names, I usually use bigram.

    If all else fails, you can try to find a computer that has a lot more RAM to run this.

    I hope others who have experience using -matchit-, and its author, Julio Raffo, will read this thread and contribute their ideas as well.

    Comment


    • #3
      Clyde, as usual, is right. It seems to be a memory problem. -Matchit- tries it's best to be memory efficient but clearly it's not perfect. In particular, -matchit- needs to have ids and names for both files on memory, the index created from one of these (the using file) and the results matrix/array. So:

      - if your master and using files are too big for your current RAM, you need to split these (Clyde's option 1) or get more RAM (Clyde's last comment). You can check if this is the case by looking how much STATA is "eating" of your memory right after the text "Indexing Using File" appears.

      - If you have enough memory left, then you can check if the index is too big. This one is trickier. You need to run -matchit- with the keepmata option. After crashing, go to mata and check how big the index (INDEXU) is. For instance, it should look something like this below but with larger # bytes for INDEXU (+ IDM, IDU, TXTM and TXTU, these represent how much the master and using files are taking of your memory). As you can see by the proportions in my example, the index is rarely the problem as it takes a tiny fraction of the memory in comparison to the files. But in case you want to try, Clyde's option 4 addresses this issue.

      Code:
      . mata:
      ------------------------------------------------- mata (type end to exit) ------------------------------------------------------------------------
      : mata d
      
            # bytes   type                        name and extent
      -------------------------------------------------------------------------------
                  8   real scalar                 FLAG
              8,000   real colvector              IDM[1000]
              8,000   real colvector              IDU[1000]
                  8   struct scalar               INDEXU
                  8   struct scalar               STOPWARRAY
                  8   real scalar                 THRESHOLD
                  8   real scalar                 TIME
             27,391   string colvector            TXTM[1000]
             27,391   string colvector            TXTU[1000]
                  8   struct scalar               WGTARRAY
                  8   struct scalar               WGTU
                 40   real rowvector              newvars[5]
                  8   pointer scalar              scorefunc_p
                  8   pointer scalar              similfunc_p
      -------------------------------------------------------------------------------
      end
      - if all these leave enough space, then is the final results that are the problem. In this case changing the threshold (Clyde's option 3) might solve the issue. I agree that only rarely results below .7 or .8 threshold were of much use. But this depends of course of the nature of all your data.

      I hope this helps.

      Best,

      J.


      Comment


      • #4

        Hi, Clyde and Julio. Thank you very much for your valuable comments and suggestions, which were very valuable.
        CC

        Comment

        Working...
        X