Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • question on matching

    Hello,

    I have a "CEO" variable for firms on panel data and the CEO names are not inputted the same due to spelling errors. For example, it can be "Jaeyoung Song" for one year and "Jaeyong Song" for another year. I hope to make the information to be consistent despite the typos. Also, other times, it is spelled as "Song Jaeyoung" instead of "Jaeyoung Song" due to first-last name ordering differences in other countries than the US. Is there any way that I can match them and give the same values to these names?

    Ultimately, I hope to give the same values for the same CEO of the same firm ID. Let's say

    Firm 1 has years from 1992-2000 and its CEO has changed as follows:
    1992 Jaeyoung Song
    1993 Jaeyoung Song
    1994 Jaeyong Song
    1995 Song Jaeyoung
    1996 Taeho Kim
    1997 Taeho Kim
    1998 Sunhwa Han
    1999 Sunha Han
    2000 Sunhwa Han

    Then CEO changes twice over the years and I hope to give the same CEO the same ID within the firm. Could you help me on how to match slightly different information and also give different values for different people within the same ID?

    Thank you for your help in advance!

  • #2
    The simple approach: First separate the first and last names. Then match on soundex codes.

    Code:
    split CEO_name, gen(name)
    gen code1 = soundex(name1)
    gen code2 = soundex(name2)
    gen code3 = soundex(name3)
    by code1 code2 code3, sort: gen CEO_id = 1 if _n == 1
    replace CEO_id = sum(CEO_id)
    Note: code not tested.


    Now this may or may not work well for you. Soundex codes were developed by the US census, and I don't know how well they work with Asian names. But the simplicity alone makes it worth a trial. This approach will not deal with instances of inverted order of first and last names, but you can inspect the results and identify those cases and fix them.


    If the results from the above are not satisfactory, try using -matchit-, by Julio Raffo, available from SSC. I doubt it will help much for cases where the order of first and last names is inverted, but it should deal well with variant spellings very well. It has a number of different metrics for identifying similar strings. Do read the help file before you use it.

    Comment


    • #3
      Yes, I confirm that -matchit- can provide some sort of solution for both misspellings and name inversion. Many similarity can do the trick, such as the different ngram (e.g. the default bigram) or the hybrid options (e.g. token_soundex , tkngram or token_wrap(), where the latter allows for the options nysiis_fk, soundex_fk, soundex_ext, soundex_nara, or soundex)

      Comment

      Working...
      X