Dear Statalists,
I do not know whether there is any tool in Stata (probably not), an already existing extension (maybe) or a way for me to write this myself, but here we go:
I am matching individuals in two data sets based on their first and last name. In both data sets there is additional information such as job or area of living. Because the latter characteristics are not standardized (occupation might be fireworker in the on data set and firefighter in the other and similar cases), I cannot use them to directly match individuals in stata. The problem arises when matching individuals based only on their names sometimes falsely matches two people. I know because as a human I can read all the information in the line and decide relatively quickly that a "teacher" in Los Angeles named Robert Brown is not the same person as the other Robert Brown who is a mechanic in Kansas City. On the other hand a Robert Brown who is a "teacher" in "Los Angeles" is relatively certain to be the same person as Robert Brown who is a "professor" (or even "teacher in mathematics") in "LA". However, the data sets are too large, so I cannot go through all the observations/individuals manually.
Now here is how I would like to solve the problem by automisation: First, I would like to generate some soft variables such as "... is employed in education" equal to one if any substring of the occupation contains "teacher", "professor", "education" (etc.). I will construct several of such variable (based on occupation, employer, region of origin etc.). Neither of these variables will be necessary for a good match but together they should give a clear picture of the individuals identity. Because some of these variables contain stronger information ("..is employed by the police department San Diego" vs. "lives in California") I would like to weigh them by some criterion. How here comes the self-learning algorithm. I would like to go through some of the observations manually (say, 100) and determine whether they are a certain match (p=1) or no match (p=0) solely based on my (possibly not so intelligent and biased) human intelligence and then let stata determine some formula that would have "predicted" my judgement, so I can then use the formula for all other (50'000) observations.
Is there a way to do this in Stata?
[One solution I already thought about: Logit (or probit) regression of my judgment (match = 0, 1) on all criteria and use the coefficients. Would that work? Would it be scientific at all? Is 100 manual judgements enough. How many should it be?]
Many thanks,
Milan
I do not know whether there is any tool in Stata (probably not), an already existing extension (maybe) or a way for me to write this myself, but here we go:
I am matching individuals in two data sets based on their first and last name. In both data sets there is additional information such as job or area of living. Because the latter characteristics are not standardized (occupation might be fireworker in the on data set and firefighter in the other and similar cases), I cannot use them to directly match individuals in stata. The problem arises when matching individuals based only on their names sometimes falsely matches two people. I know because as a human I can read all the information in the line and decide relatively quickly that a "teacher" in Los Angeles named Robert Brown is not the same person as the other Robert Brown who is a mechanic in Kansas City. On the other hand a Robert Brown who is a "teacher" in "Los Angeles" is relatively certain to be the same person as Robert Brown who is a "professor" (or even "teacher in mathematics") in "LA". However, the data sets are too large, so I cannot go through all the observations/individuals manually.
Now here is how I would like to solve the problem by automisation: First, I would like to generate some soft variables such as "... is employed in education" equal to one if any substring of the occupation contains "teacher", "professor", "education" (etc.). I will construct several of such variable (based on occupation, employer, region of origin etc.). Neither of these variables will be necessary for a good match but together they should give a clear picture of the individuals identity. Because some of these variables contain stronger information ("..is employed by the police department San Diego" vs. "lives in California") I would like to weigh them by some criterion. How here comes the self-learning algorithm. I would like to go through some of the observations manually (say, 100) and determine whether they are a certain match (p=1) or no match (p=0) solely based on my (possibly not so intelligent and biased) human intelligence and then let stata determine some formula that would have "predicted" my judgement, so I can then use the formula for all other (50'000) observations.
Is there a way to do this in Stata?
[One solution I already thought about: Logit (or probit) regression of my judgment (match = 0, 1) on all criteria and use the coefficients. Would that work? Would it be scientific at all? Is 100 manual judgements enough. How many should it be?]
Many thanks,
Milan
Comment