Issue with merging two databases

Willem Vanlaer

Join Date: May 2014

Posts: 52
#1

Issue with merging two databases

17 Mar 2016, 09:48

Hello,

I have an issue merging two databases.

Database 1 contains information on public procurements contracts (e.g. €-value of contract, activity of contract, winning company, etc.)
Database 2 contains company information (company name, number of employees, turnover, balance sheet, etc.)
Evidently, I want to merge both databases based on the name of the company.
The problem lies in the fact that the company name is not always consistent in both databases.

Examples
Database1 Database2

A&M Motors A & M Motors

Architectuurbureau Filips
Architectuurbureau Filips BVBA

Aforest Belgium Aforest

Air Liquide Medical NV Air Liquide Medical SA

Do you guys have any suggestions on how to deal with this?

Kind regards,

Willem
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

17 Mar 2016, 10:05

There are no simple generic approaches to this problem. And there are no approaches that are guaranteed to be 100% correct.

Partly it depends on how bad the discrepancies in the way the names are coded are.

The first step is to get rid of the simple discrepancies such as variations in spacing, upper/lower case:

Code:

replace name = trim(itrim(upper(name)))

The next thing I would do is attempt the match to get a sense of how many problem cases there are. If the number of problem cases is small, a series of -replace- statements in one or both data files can resolve it.

If the number of problem cases is too large for that to be practical, but still not huge, you can create a crosswalk file consisting of the name in Database1, and the corresponding name in Database2, then merge database1 to the crosswalk and database2 to the merged result.

If the number of problem cases is too large for that, then you are in the realm of trying to automate the process of approximate matching. There is a user written program -reclink-, available from SSC, that is sometimes helpful. Another approach is to encode the names in some relevant way and consider the names in the two databases to match if they have the same encoded value. A number of different ways of doing this exist. Built into Stata is the -soundex()- code, though it is really intended for use on person names and may not do so well with corporate names. There is the -strdist- way of finding closest string distance matches, available from SSC. And in the spirit of these, but offering many more encodings to try, there is a whole new package developed by Bill Buchanan--see more about this at http://www.statalist.org/forums/foru...larity-metrics.

If you use one of these automated approaches, it is likely that at the end of the process you will still be left with some unmatched names, and you may also have created a few false matches. There are no perfect algorithms. Reviewing the remaining unmatched cases you may be able to figure out how to match them yourself and finish the job with a few -replace- commands.

This is a frequent, vexatious problem in data analysis. Good luck.
1 like
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#3

17 Mar 2016, 14:03

I'd second @Clyde Schecter's advice. I'm pretty sure I added some of the links to the relevant Javadocs for the phonetic encoding algorithms, which might be a good place to start figuring out which might be best suited to your use case; the documentation for some of them is fairly explicit if there are specific applications for which it is well suited. The same package also contains several different string distance metrics, so you may be able to get relatively close using some combination of phonetic encoding and string distance metrics, but as Clyde mentioned there isn't really any good fool proof way of doing this.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

17 Mar 2016, 14:09

You might also find useful the matchit command developed by Julio Raffo that is discussed in the following Statalist threads.

http://www.statalist.org/forums/foru...s-observations

http://www.statalist.org/forums/foru...-e-fuzzy-match
1 like
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#5

18 Mar 2016, 11:57

As William Lisowski mentions, -matchit- is conceived for this kind of problem. If you simply try (I’m assuming you have the variables id & name for each database):

Code:

use Database1 matchit id1 name1 using Database2, idu(id2) txtu(name2)

However, a quick look at your sample data, it seems that you can benefit from weights to reduce impact of less informative strings like “NV”, “SA” or “Belgium”:

Code:

matchit id1 name1 using Database2, idu(id2) txtu(name2) w(simple)

Also it seems there are not major misspellings, so instead of using the default ngram you could use token as function:

Code:

matchit id1 name1 using Database2, idu(id2) txtu(name2) w(simple) sim(token)

In any case Clyde Schechter is right about how hard it is to find a silver bullet solution. For instance, -matchit- is case sensitive so solutions like Clyde’s one

Code:

replace name = trim(itrim(upper(name)))

are often rewarding in improving the score. Similarly, your case of “A&M” vs. “A & M” suggests that removing all blanks

Code:

replace name = subinstr(name," ","",.)

can also help you. Please note that, in this latter case, you shouldn’t use sim(token).
2 likes
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#6

18 Mar 2016, 13:22

Willem Vanlaer Julio Raffo brings up a good point that I hadn't mentioned. In each of the programs I've put together for string distance and phonemic encoding all of the string values are automatically converted to lower case. As Julio Raffo mentions, the uniformity of the casing can help - particularly with string distance metrics like the Levenshtein edit distance that would treat "AM", "Am", "aM", and "am" as distinct (compared to the first the second and third examples would have a distance of 1 and the last example would have a distance of 2 purely due to the casing of the characters).
Comment
Willem Vanlaer

Join Date: May 2014

Posts: 52
#7

19 Mar 2016, 09:12

@Everybody, thank you for your response!
I think I'm going to play around with matchit a bit.

Julio Raffo: thanks for providing me with some suggestions. I have some additional questions however, as I am quite a Stata-novice:
1. Stata says matchit is an "unrecognized command". How do I solve this? I know there is an .ado-file (which I have downloaded). But I do not know how to progress from there.
2. With regards to the syntax of the code, could you perhaps explain what is the function of the words in bold:
use Database1 matchit id1 name1 using Database2, idu(id2) txtu(name2) I assume name1 and name2 are names of the variables which, in my case, identify the company.
But I am not quite clear on id1 and idu(id2).
Thank you very much for your patience.
Comment
Carole J. Wilson

Join Date: Jan 2015

Posts: 932
#8

19 Mar 2016, 10:26

matchit can be installed with the command ssc install matchit. More generally, you can search for user contributed programs with search matchit, all or findit matchit.

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1
1 like
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#9

19 Mar 2016, 10:31

Regarding question 1 in #7: how did you download matchit.ado? If you just downloaded it from the web over your browser or something like that, it may have landed in some directory where Stata doesn't look for it. -matchit- is available from SSC, so a foolproof way to install it is to run the command -ssc install matchit-. Then Stata will find it on the SSC repository and download it to a directory where Stata will know to look for it when you want to use it.

So try the -ssc install matchit- command. If -matchit- is already properly installed, Stata will tell you so (and in that case it is a mystery why you got an "unrecognized command" error.) But I suspect it is not properly installed at this point, and Stata will fix that up for you.
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#10

19 Mar 2016, 11:05

About the syntax, you're right about name1 & 2. The id variables should uniquely identify your observations in each database, respectively. If you don't have such thing in either then simply do

egen id1=group(name1)
egen id2=group(name2)

Last edited by Julio Raffo; 19 Mar 2016, 11:07.
Comment
Willem Vanlaer

Join Date: May 2014

Posts: 52
#11

01 Apr 2016, 01:01

Originally posted by Julio Raffo View Post

About the syntax, you're right about name1 & 2. The id variables should uniquely identify your observations in each database, respectively. If you don't have such thing in either then simply do

egen id1=group(name1)
egen id2=group(name2)

Julio Raffo So I have tried matchit. It has run all night and came up with some nice results. 2 questions remain:
After having ran matchit, only 4 variables remain (the 2 unique identifiers + the company names). However, I would like that all variables (of both databases) remain. How do I do this?

A more general question: how do you decide which matches to keep? Is it only possible to do this manually? For example, let's say you only keep matches if the similscore > 0.8. Then, for some companies, this will provide a unique match. However, for others (especially for those companies where the similscore = 1), numerous matches remain. How to deal with this?
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#12

01 Apr 2016, 04:22

Willem Vanlaer there isn't really any good way to create a generalized rule for something like this. How many cases do you find have multiple matches? If the number isn't too high, then a manual process would probably be reasonable. If you have a lot of observations like that, then it might be good to use some type of rule to make it a bit more transparent and replicable. One way that you could attempt doing things from that point is to use phonetic encodings of the company names and then estimate the string distances between the phonetic encodings to give you a way of estimating how similar the company names sound to each other. The http://www.statalist.org/forums/foru...larity-metrics"]link[/URL] that Clyde Schechter referenced above to a package I started putting together would give you a bit more flexibility to do something like this (e.g., create multiple instances of the variable with different phonetic encodings and then estimate multiple string distance metrics to generate an average distance/similarity score).
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#13

01 Apr 2016, 11:35

About 1, you simply could use the -merge- or -joinby- commands to put your original datasets together. Check the help for each, they are relatively straightforward.

About 2, I agree with wbuchanan that there is no clear cut rule. I usually sort the results from best to worst (e.g. gsort -similscore) and browse the results manually to see when results become basically all false matches. I then drop everything below that threshold and work with the remaining potential matches. It always pays-off to use secondary information like ZIP codes or addresses. If you don't have any, you can always apply new similarity algorithms to the original pair of strings. You could use wbuchanan package for phonetic algorithms or -matchit- in its columns syntax. For the latter case you should simply type "matchit name1 name2" plus the desired method (e.g. sim(soundex) or sim(bigram), etc).
1 like
Comment
Willem Vanlaer

Join Date: May 2014

Posts: 52
#14

05 Apr 2016, 09:28

Originally posted by Julio Raffo View Post

About 1, you simply could use the -merge- or -joinby- commands to put your original datasets together. Check the help for each, they are relatively straightforward.

About 2, I agree with wbuchanan that there is no clear cut rule. I usually sort the results from best to worst (e.g. gsort -similscore) and browse the results manually to see when results become basically all false matches. I then drop everything below that threshold and work with the remaining potential matches. It always pays-off to use secondary information like ZIP codes or addresses. If you don't have any, you can always apply new similarity algorithms to the original pair of strings. You could use wbuchanan package for phonetic algorithms or -matchit- in its columns syntax. For the latter case you should simply type "matchit name1 name2" plus the desired method (e.g. sim(soundex) or sim(bigram), etc).

Thanks. The manual check will be work for the intern.

Another question. To increase the number of matches, I have replaced the name of the company in one database with the name of that company in upper cases.
More specifically, my inital code was:

Code:

clear all use "C:\Users\wvanlaer\Documents\Ongoing\Proposal FOD ECO\tedall.dta", clear matchit id_award win_name using "C:\Users\wvanlaer\Documents\Ongoing\Proposal FOD ECO\sme_all2.dta", idusing (companynumber) txtusing (name)

Then, I altered the name:

Code:

clear all use "C:\Users\wvanlaer\Documents\Ongoing\Proposal FOD ECO\tedall.dta", clear gen win_name2 = upper(win_name) sort win_name2 save "C:\Users\wvanlaer\Documents\Ongoing\Proposal FOD ECO\tedall.dta", replace

Subsequently, I reran matchit:

Code:

clear all use "C:\Users\wvanlaer\Documents\Ongoing\Proposal FOD ECO\tedall.dta", clear matchit id_award win_name2 using "C:\Users\wvanlaer\Documents\Ongoing\Proposal FOD ECO\sme_all2.dta", idusing (companynumber) txtusing (name)

Whereas the first matching takes quite some time (about 10 hours), the second seems to take forever. It has been more than 8 hours now and it still is not even at 20%. How can this be?
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#15

05 Apr 2016, 10:18

-matchit- is case sensitive, so having both string variables to the same case is usually a good idea. Another good idea is to remove double spacing or any heading or trailing blanks (e.g. gen win_name2 = upper(trim(itrim(win_name))) ). Removing punctuation marks is also (sometimes) a good strategy.

Concerning the time increase, I suspect this is related with the reduction of the theoretical search space. In order to avoid comparing every observation in your dataset -matchit- performs an indexation of them and only compares those pairs sharing at least one element (gram). So the more observations share common text across datasets (even if clearly different ones), the more -matchit- will perform unnecessary comparisons. This is the case why you could (should) remove terms which are excessively common in your data (like INC, LTD, & CO, BROS, etc, etc). All companies sharing the "INC" term will be unnecessarily compared, wasting computation time.

Why does it increases if you only changed the case? Well, now INC=Inc=inc, so each group of observations these are now compared. So if you had one observation of each on each dataset you were comparing 1x1+1x1+1x1=3, now you are comparing (1+1+1)x(1+1+1)=9. Things can be even worse in practice . By default -matchit- uses the bigram algorithm which basically decomposes all strings into moving windows of two characters. So, it transforms each string into vectors of any combination of two characters found in your data. Your original variables probably are in an interval like [aa...zA...ZZ] while the new ones only are in [AA...ZZ] (it should also include blanks and punctuation marks, but let's ignore them for the sake of the argument). So the reduction in the range increases the matching scores (which is good), but it also increases the space being sought (which is bad for computing time).

An alternative is to use a higher ngram (e.g. option sim(ngram,3) ), which will increase the time of computing the index but decreases the space searched. Follow here for a similar case: http://www.statalist.org/forums/foru...reclink-syntax

J.
2 likes
Comment

Announcement

Issue with merging two databases

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment