Stata extension for self-learning algorithm to determine a formular for "goodness of match"

Milan Quentel

Join Date: Nov 2016

Posts: 52
#1

Stata extension for self-learning algorithm to determine a formular for "goodness of match"

16 Jan 2017, 01:37

Dear Statalists,

I do not know whether there is any tool in Stata (probably not), an already existing extension (maybe) or a way for me to write this myself, but here we go:

I am matching individuals in two data sets based on their first and last name. In both data sets there is additional information such as job or area of living. Because the latter characteristics are not standardized (occupation might be fireworker in the on data set and firefighter in the other and similar cases), I cannot use them to directly match individuals in stata. The problem arises when matching individuals based only on their names sometimes falsely matches two people. I know because as a human I can read all the information in the line and decide relatively quickly that a "teacher" in Los Angeles named Robert Brown is not the same person as the other Robert Brown who is a mechanic in Kansas City. On the other hand a Robert Brown who is a "teacher" in "Los Angeles" is relatively certain to be the same person as Robert Brown who is a "professor" (or even "teacher in mathematics") in "LA". However, the data sets are too large, so I cannot go through all the observations/individuals manually.

Now here is how I would like to solve the problem by automisation: First, I would like to generate some soft variables such as "... is employed in education" equal to one if any substring of the occupation contains "teacher", "professor", "education" (etc.). I will construct several of such variable (based on occupation, employer, region of origin etc.). Neither of these variables will be necessary for a good match but together they should give a clear picture of the individuals identity. Because some of these variables contain stronger information ("..is employed by the police department San Diego" vs. "lives in California") I would like to weigh them by some criterion. How here comes the self-learning algorithm. I would like to go through some of the observations manually (say, 100) and determine whether they are a certain match (p=1) or no match (p=0) solely based on my (possibly not so intelligent and biased) human intelligence and then let stata determine some formula that would have "predicted" my judgement, so I can then use the formula for all other (50'000) observations.

Is there a way to do this in Stata?

[One solution I already thought about: Logit (or probit) regression of my judgment (match = 0, 1) on all criteria and use the coefficients. Would that work? Would it be scientific at all? Is 100 manual judgements enough. How many should it be?]

Many thanks,
Milan
Tags: algorithm, matching, self-learning
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

16 Jan 2017, 12:44

Problems similar to yours are often addressed using "fuzzy matching" techniques.

Others here have had success with the user-written matchit command from Julio Raffo, as discussed in the following two threads.

http://www.statalist.org/forums/foru...s-observations

http://www.statalist.org/forums/foru...-e-fuzzy-match
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#3

17 Jan 2017, 00:25

Originally posted by Milan Quentel View Post

...I would like to go through some of the observations manually (say, 100) and determine whether they are a certain match (p=1) or no match (p=0) solely based on my (possibly not so intelligent and biased) human intelligence and then let stata determine some formula that would have "predicted" my judgement, so I can then use the formula for all other (50'000) observations...

In addition to what William Lisowski mentioned, you can do the clerical analysis also within Stata by applying clrevmatch after matchit. For instance:

Code:

matchit id2 name2 using file1.dta, idu(id1) txtu(name1) joinby id1 using file1.dta joinby id2 using file2.dta save matched12.dta, replace clrevmatch using matched12.dta, /// idmaster(id1) idusing(id2) varM(name1 addr1 profession1) varU(name2 addr2 profession2) /// clrev_result(myscore) clrev_note(mynotes) newfilename(matched_clrev.dta)
1 like
Comment
Milan Quentel

Join Date: Nov 2016

Posts: 52
#4

17 Jan 2017, 02:34

Dear Julio Raffo and William Lisowski,

thank you both very much. The matchit command is a gem! It is the perfect complement to the matching criteria I have written since they are based on content and not so much on spelling (teacher=professor=lecturer, but not professor=profess.). I will also look into clrevmatch when I find the time. Thanks a lot.

One more question:
From your experience, how many observations should I fit manually? At the moment I have n=200. When looking at the logit model what are good criteria that the observations are enough to fit the model? The joint significance of the coefficients? The Pseudo R squared?

But then, probably these questions cannot be answered in general, I guess. In any case, you have helped me a lot. Thank you.
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#5

30 Jan 2017, 07:52

Apologies for the late reply. I'm glad to hear you find -matchit- useful.

About your sample size question, I have seen (and sometimes done myself) almost everything. I guess the most formal approach would be to calculate a sample size according to the explanatory power that you want to have. And, if you want to take it to the next level, you probably would need to consider the different stratification based on the other variables you want to use in the logit step.

Concerning the Logit, I don't think the joint-significance and pseudo R-squared will help you much with sample size.
1 like
Comment

Announcement

Stata extension for self-learning algorithm to determine a formular for "goodness of match"

Comment

Comment

Comment

Comment