Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fuzzy matching - Large data set, using reclink command

    Hi,
    I'm trying to fuzzy match a census file with a migrant data set. I'm doing matching based on three key variables: full name, age and county of residence. I want to match those observations which have exactly the same age and county however, allowing for the full name to be somewhat different because of spelling errors. So, I allow the matching score to be 0.9 and above.

    I am using reclink command:
    reclink full_name age1 county using "/Users/ciaral/Dropbox/New Merging/using.dta", idm(idm) idu(idu) required(age1 county) _merge(mergedata1) uprefix(ellis) gen(score) minscore(.9).

    The problem is my master data is too large - around 2.5 million observations. The using data is around 0.3 million. So the reclink command runs too slow and just shows the perfect matches.

    I have tried to split the master data set into smaller data files but unless I split them up to 50.000 observations, it goes too slow...
    Do you know any method of how to deal with large data sets using reclink command or possibly another method of fuzzy matching in Stata?

    P.S. I tried to get rid of unnecessary variables that I have in the data sets but it is still very very slow.

    Thanks a lot for your help!
    Ciara


  • #2
    Welcome to Statalist, Ciara.

    Perhaps someone will be able to advise you on using the user-written reclink command. If not, others here have had success with the user-written matchit command from Julio Raffo , as discussed in the following two threads.

    http://www.statalist.org/forums/foru...s-observations

    http://www.statalist.org/forums/foru...-e-fuzzy-match

    Comment

    Working...
    X