Fuzzy matching of firms across years and datasets when firmID cannot be used

Paul Pelzl

Join Date: May 2016

Posts: 24
#1

Fuzzy matching of firms across years and datasets when firmID cannot be used

19 Sep 2016, 09:02

Dear all,

I have two firm-level panel datasets; the first includes data from 2008-2010 and the second from 2011-2012. I want to create a panel dataset from 2008-2012. Both datasets do contain a firm ID variable that is consistent within the respective dataset, thus enabling panel analysis over the respective period. However, the firms do not have a consistent ID across datasets; in the first, the ID variable starts from 1, in the second from roughly 1,000,000. At the same time, I know that at least a fraction of firms appears in both datasets, but given the circumstances just described, I cannot use the firmID variable(s) to match the firms across datasets (i.e., I cannot simply append one dataset to the other, sort by firmID year and be done.)

There are four variables (call them A, B, C, D) that appear in both datasets and are time-invariant (founding year of the firm, post code, sector and legal status) (Note: I dropped those that change any of the above over time). Further, there are many variables that are time-variant and appear in both datasets (E, F, G, ..., W). At last, there is also a string variable X that lists the clients of the firm (in numbers=IDs, i.e. e.g. "1234, 2280, 34873"). What I aim to do is to determine an algorithm that does the following:

Match (=assign the same unique ID) if:

A_2010=A_2011, B_2010=B_2011, C_2010=C_2011, D_2010=D_2011,

AND

for variables E,F (I could also use, for example, E,F,G,H) it holds that

E_2010 ϵ {E_2011*0.9 , E_2011*1.1} AND F_2010 ϵ {F_2011*0.9 , F_2011*1.1}

AND

{Clients_2010} must contain at least one element of {Clients_2011}.

AND

If two ore more firms from 2010 meet all above requirements for a particular firm in 2011, then match the one that has a larger overlap in terms of clients.

I see that this matching practice is questionable from a conceptual perspective. However, I am asked to match firms along this way, and therefore I would be very happy if someone knows a way to do that. Perhaps there is even a user-written command/package to carry out such tasks? I have found the command nearmrg (available from ssc: ssc install nearmrg), but it only allows me to specify one variable along which the fuzzy match shall occur, whereas I have three fuzzy-match variables, of which one is not numeric, and also four exact-match variables...

I look forward to your replies!

Last edited by Paul Pelzl; 19 Sep 2016, 09:27.
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

19 Sep 2016, 12:23

My initial reaction is the following:
Create two datasets that contain one observation for each firm with the variables needed for your matching, one dataset for the 2008-2010 and one for the 2011-2012 data. Make sure the names of A, B, C, and D are the same in the two files, but make the names of the Firm ID, E, F, G, H, and Clients different. At this point, eliminate the variables not being used for matching.

Use joinby A B C D to combine the two files into one file containing one record for each pair that matches on A, B, C, and D, and the values of all the other variables.

Now work with that file to eliminate records that don't fulfill your other conditions.

When you think you're done, use the isid command make sure that no 2010 firm appears more than once and again to make sure no 2011 firm appears more than once.

When you're done, you'll have a file with records that each the "crosswalk" between 2010 IDs and 2011 IDs. Drop the rest of the variables and merge your original datasets to to this. Do this with a fair amount of care to understand where you have records in your original data that didn't match.

That's pretty lightly sketched out, lots of details missing, but perhaps it will get you started in a useful direction.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30353
#3

19 Sep 2016, 13:02

William has nicely set out the big picture for Paul. I just note that one of the specific criteria, "{Clients_2010} must contain at least one element of {Clients_2011}," is more complicated to apply than the others (which are basically one or two lines of code). I'm not aware of any user written program that carries out this function either. WIth local macros, it is easy to find their common content, but for string variables it takes a little work. Here is some code that illustrates how it can be done:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input long id str11(clients10 clients11) 1 "abc def ghi" "jkl def" 2 "abc ghi mno" "mno ghi qrs" 3 "qrs tuv wxy" "abc def mno" end gen union = clients10 + " " + clients11 gen expander = wordcount(union) expand expander by id, sort: gen token = word(union, _n) by id token, sort: drop if _N == 1 duplicates drop by id, sort: gen overlap = _N by id (token), sort: gen in_common = token if _n == 1 by id (token): replace in_common = in_common[_n-1] + " " + token if _n > 1 by id (token): keep if _n == _N drop token expander

Note: This code will fail if the individual names in each variable contain embedded blanks, so if he has those he may have to regenerate those variables replacing internal blanks with underscores or some other connector.

Paul also notes that if the overall algorithm provides ties for best match, he wants to select the one with the greatest overlap of clients. The above coded produces two variables that are useful in this connection. overlap is the count of overlapping "clients" between the two. in_common actually lists the overlapping "clients". I included the latter because it dawns on me that if there is a tie overall, there may still be matches that tie on the maximum number of overlapping clients. If that is the case, knowing who the actual clients are may provide some basis for breaking the tie.

Hope this is helpful.
1 like
Comment
Paul Pelzl

Join Date: May 2016

Posts: 24
#4

20 Sep 2016, 06:09

Thanks a lot to both of you! Your comments are proving to be very helpful, you've made my day a very productive one so far.
Comment

Announcement

Fuzzy matching of firms across years and datasets when firmID cannot be used

Comment

Comment

Comment