question on matching

EUNBI KIM

Join Date: Jul 2019

Posts: 17
#1

question on matching

31 Jul 2019, 08:50

Hello,

I have a "CEO" variable for firms on panel data and the CEO names are not inputted the same due to spelling errors. For example, it can be "Jaeyoung Song" for one year and "Jaeyong Song" for another year. I hope to make the information to be consistent despite the typos. Also, other times, it is spelled as "Song Jaeyoung" instead of "Jaeyoung Song" due to first-last name ordering differences in other countries than the US. Is there any way that I can match them and give the same values to these names?

Ultimately, I hope to give the same values for the same CEO of the same firm ID. Let's say

Firm 1 has years from 1992-2000 and its CEO has changed as follows:
1992 Jaeyoung Song
1993 Jaeyoung Song
1994 Jaeyong Song
1995 Song Jaeyoung
1996 Taeho Kim
1997 Taeho Kim
1998 Sunhwa Han
1999 Sunha Han
2000 Sunhwa Han

Then CEO changes twice over the years and I hope to give the same CEO the same ID within the firm. Could you help me on how to match slightly different information and also give different values for different people within the same ID?

Thank you for your help in advance!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

31 Jul 2019, 14:02

The simple approach: First separate the first and last names. Then match on soundex codes.

Code:

split CEO_name, gen(name) gen code1 = soundex(name1) gen code2 = soundex(name2) gen code3 = soundex(name3) by code1 code2 code3, sort: gen CEO_id = 1 if _n == 1 replace CEO_id = sum(CEO_id)

Note: code not tested.

Now this may or may not work well for you. Soundex codes were developed by the US census, and I don't know how well they work with Asian names. But the simplicity alone makes it worth a trial. This approach will not deal with instances of inverted order of first and last names, but you can inspect the results and identify those cases and fix them.

If the results from the above are not satisfactory, try using -matchit-, by Julio Raffo, available from SSC. I doubt it will help much for cases where the order of first and last names is inverted, but it should deal well with variant spellings very well. It has a number of different metrics for identifying similar strings. Do read the help file before you use it.
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#3

20 Aug 2019, 03:41

Yes, I confirm that -matchit- can provide some sort of solution for both misspellings and name inversion. Many similarity can do the trick, such as the different ngram (e.g. the default bigram) or the hybrid options (e.g. token_soundex , tkngram or token_wrap(), where the latter allows for the options nysiis_fk, soundex_fk, soundex_ext, soundex_nara, or soundex)
Comment

Announcement

question on matching

Comment

Comment