Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Correcting and consolidating multiple name spellings

    I am currently working on a dataset of government contracts and have run into an issue while cleaning it. I have found that many of the contractor names are spelled multiple ways like the example shown below.
    Code:
    "2 fast entertaiment"    
    "2 fast entertainment"  
    "2 fast entertainment copr"
    "2 fast entertainment corp"
    "2 fast entertainment corp."
    I have about 269,000 observations and about 97,000 unique values of contractor name. Assuming the avg number of different spellings is anywhere between 1.5-2 (i guesstimated this by eye going over a collapsed list of names) I should have anywhere between 48,000 and 64,000. Using -strtrim- -stritrim- -strlower- -usubinstr- and -ustrnormalize- to address the most obvious and common problems I have been able to reduce the number of unique names to about 90,000, but this is far from ideal, to illustrate this is the reduced list of the example name.
    Code:
    "2 fast entertaiment"    
    "2 fast entertainment"  
    "2 fast entertainment corp"
    The problems I am facing are that 1) I don't necessarily want to remove the corp/llc/llp/inc components of company names because I fear that I could have something like ACME llc and ACME corp be two different entities and I group them into one, but in a case like the example above I feel that it is clearly the same entity and I can remove or add a corp. And 2) I don't know how to fix the misspelling of words like entertaiment to entertainment without going mistake by mistake and using -replace-. If this dataset weren't so large I wouldn't mind doing this by hand but it seems like a poor use of time to spend at the very least a whole day scouring through names to identify and correct these mistakes. Please let me know if you have any ideas or suggestions on how to tackle this problem.

    Your help and guidance are greatly appreciated. Have a great [insert appropriate time of day here] and I hope you are staying safe and healthy!

  • #2
    Hi Enrique,

    I've been wondering about this problem for a while and have some ideas on how to tackle the problem.
    • Using an overlap measure between names to determine if two strings should be grouped together. Take all elements (not just the unique ones) from one string and compare it with the other string, and use the Jaccard index or other overlap measures, and use a threshold to determine if they belong together. Obviously this does not incorporate the sequence of the elements, and will have perfect overlap for "aaa bbb ccc" "ccc bbb aaa". Probably would need to pre-sort the strings then compare the strings sequentially.
    • Incorporate information of elements and their relative positions to get a better score. Obviously "erika" and "erica" are closer together than "erika" and "kiera." Perhaps an overlap measure plus a penalty for distance. "e-1 r-2 i-3 k-4 a-5" and "e-1 r-2 r-3 c-4 a-5" has a distance of zero for the common characters whereas "k-1 i-2 e-3 r-4 a-5" will have a greater distance (of course, the distance measure needs to be properly defined).
    • One could create a distance a pairwise distance/score matrix instead of doing it sequentially.
    • As for determining which is the most representative string for a group, you could use the mode (the string that comes up most frequently) or the median.
    Now I have written this out, I feel like I should write a program...

    Comment


    • #3
      I think matchit (from SSC) is one possible way to approach this. See https://www.stata.com/meeting/switze...tzerland16.pdf

      Comment


      • #4
        Originally posted by Daniel Shin View Post
        Hi Enrique,

        I've been wondering about this problem for a while and have some ideas on how to tackle the problem.
        • Using an overlap measure between names to determine if two strings should be grouped together. Take all elements (not just the unique ones) from one string and compare it with the other string, and use the Jaccard index or other overlap measures, and use a threshold to determine if they belong together. Obviously this does not incorporate the sequence of the elements, and will have perfect overlap for "aaa bbb ccc" "ccc bbb aaa". Probably would need to pre-sort the strings then compare the strings sequentially.
        • Incorporate information of elements and their relative positions to get a better score. Obviously "erika" and "erica" are closer together than "erika" and "kiera." Perhaps an overlap measure plus a penalty for distance. "e-1 r-2 i-3 k-4 a-5" and "e-1 r-2 r-3 c-4 a-5" has a distance of zero for the common characters whereas "k-1 i-2 e-3 r-4 a-5" will have a greater distance (of course, the distance measure needs to be properly defined).
        • One could create a distance a pairwise distance/score matrix instead of doing it sequentially.
        • As for determining which is the most representative string for a group, you could use the mode (the string that comes up most frequently) or the median.
        Now I have written this out, I feel like I should write a program...
        I had a similar, though less developed, idea. I have the original set of contractor names (97797 unique values) and the set with the general corrections I was able to accomplish (90026 unique values). I already tried to score/compare them using -matchit- and -strdist- but I am struggling to find a good threshold to use. Just from looking it over it seems like something around 86% using the -matchit- similscore seems to differentiate ok but I am not necessarily confident in that number. I am going to make some more attempts and I'll post any developments I make. Please let me know if you end up making a program for this, I am sure that given the extremely high and consistent standards for data collection that the PR government uses (sarcasm) I will end up facing a similarly daunting problem in the future.

        Comment

        Working...
        X