Fuzzy match - names only

Francois Durant

Join Date: Dec 2014

Posts: 761
#1

Fuzzy match - names only

13 Apr 2018, 06:12

Hi,

I am trying to fuzzy match 2 datasets 2 name only. I do not have a number ID to match the 2 database. I have been trying to use "matchit". The results I'm currenlty getting are not convincing. Is there any ways to use this SSC without "ID1", which is the number ID?

Here is the code I have been running:
I have created 1 unique number per name in each dataset. They do not have anything in common which is why I do not want to use them

Code:

matchit mgrno mgrname using HFnames_MorningStar1.dta, idu(obs1) txtu(Name) sim(token) t(0) override

here is how the 2 dataset look like:

I am just trying to fuzzy match the 2 dataset by mgrname and Name. Can anyone help? Thanks!
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

13 Apr 2018, 15:53

The output of help matchit warns us that matchit is case-sensitive. In your post, your first dataset is all upper-case; your second is not. If you have not already done so, you should work with a copy of your second dataset in which you have used strupper() to convert the name to all upper-case.
Comment
Francois Durant

Join Date: Dec 2014

Posts: 761
#3

16 Apr 2018, 02:25

Thanks for this advise. Could you also help me with the fact that I only want to fuzzy match names and not an additional variables (ID). Thanks.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

16 Apr 2018, 06:05

The output of help matchit tells us that the string variables (mgrname and Name in your command) are used for matching, and the numeric variables (mgrno and obs1) are used for identification. They are not used for matching.
Comment
Francois Durant

Join Date: Dec 2014

Posts: 761
#5

16 Apr 2018, 18:30

That's a part that I don't really understand on the help file. What does identification means here?

I just want to fuzzy match "Name" with "mgrname", I do not have any other relevant (identification) variables. How do I reflect that in the code? Thanks

I found a PDF that talks about it but does not explain how to use identification variables either:
https://www.stata.com/meeting/switze...tzerland16.pdf
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

16 Apr 2018, 19:04

The code you show in post #1 reflects what you want. The ID is not used for matching. The ID assists users who want to match IDs using text descriptions, but since your data do not have IDs, you have to invent IDs because they are not optional.

In post #1 you said the results were not convincing. Your unconvincing results are not related to the presence of the ID. Unconvincing results would certainly be related to a master data being all upper-case and using data being mixed case.
Comment
Francois Durant

Join Date: Dec 2014

Posts: 761
#7

17 Apr 2018, 19:43

can you confirm that I should create an ID in each dataset? How can I mention in the code that I do not want to use this ID for matching? If it is there, I am assuming it will interfere somwhere in the process. Can you explain in greater detail how to handle this aspect?

I have noted your comment about upper and lower case. I have corrected that in the new version of my code. I'm still not getting matches. Please help. Thanks.
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#8

25 Apr 2018, 08:12

Hi François,

Thanks for your question. Yes, matchit in the two datasets syntax requires two variables (one numeric and one string) in the master dataset and two variables (one numeric and one string) in the using dataset. The original idea was that you will have an unique identifier for your names (like social security number, gvkey, etc.) in order to link these with other variables later on. There are also some performance gains of having those numeric id.

In any case, if you do not have such id in your data you can easily create it by using the group() function of the egen command. As follows an example:

Code:

use HFnames_MorningStar1.dta egen myid=group(Name) save HFnames_MorningStar1_ids.dta use YourOtherDataset.dta, clear matchit mgrno mgrname using HFnames_MorningStar1_ids.dta, idu(myid) txtu(Name) sim(token) t(0) override

Additionally, looking at your examples I suggest using the following options: sim(bigram) w(logs). Because, sim(token) will not catch misspellings and not using weights will allow for less informative text (such as "LLP", "INC", etc) to have too much impact in the score.

Best,

J.
Comment
Mohamed Mahmoud

Join Date: Apr 2022

Posts: 36
#9

17 Jun 2022, 09:21

Dear STATA helpers ,

I need to match using company name between two datasets using "matchit" command , however the matching not come in a correct way

this the command that I used :

matchit gvkey CONAME using "lobbying_data_edited_ID", idu(myid) txtu(registrant_raw) sim(token) t(0) override

I attached the photos for each dataset as follows:

1. Master file photo
2. using File photo
3. after matching photo

Could anyone familiar with match it command tell me what I did in wrong way ?

many thanks

Last edited by Mohamed Mahmoud; 17 Jun 2022, 09:32.
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#10

18 Jun 2022, 02:41

Originally posted by Mohamed Mahmoud View Post

Dear STATA helpers ,

I need to match using company name between two datasets using "matchit" command , however the matching not come in a correct way

this the command that I used :

matchit gvkey CONAME using "lobbying_data_edited_ID", idu(myid) txtu(registrant_raw) sim(token) t(0) override

I attached the photos for each dataset as follows:

1. Master file photo
2. using File photo
3. after matching photo

Could anyone familiar with match it command tell me what I did in wrong way ?

many thanks

Hi MM,

You are using the token option which is basically comparing strings separated by blanks, where one of the strings is "dirty" with punctuation marks, etc. You have two main options to improve the similarity scores:

1. Clean the registrant_raw variable (e.g. removing punctuation marks and other less informative symbols).
2. use the bigram option instead of token

Note that you can do both to get a better similarity score. Moreover, I suggest using weights, so less informative strings (e.g. "INC") do not artificially increase the similarity.

Best,

J.
1 like
Comment
Mohamed Mahmoud

Join Date: Apr 2022

Posts: 36
#11

19 Jun 2022, 05:57

Dear Julio Raffo,

Could you please let me know how to remove the punctuation marks and another less informative symbol for the "registrant row variable"? what is the command for this?

many thanks,

Best,
MM
Comment

Announcement

Fuzzy match - names only

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment