I have an extremely messy dataset that I've been asked to work with. I'm attempting to find duplicate entries based of a several variables, some numeric, but mostly string variables. I have tried using the -matchit- function to inspect the string variables one-at-a-time to look for duplicates, and the -duplicates- function to work with the numeric variables. In an ideal world I'd like to . . .
1) generate some kind of metric for how closely an observation matches another
2) create some kind of group variable to help tell the difference between unrelated observations.
I can sort of do the first by using the -matchit- function to match the dataset to itself. The original dataset is ~50,000 observations, so this is very time consuming. However, there is a variable in the dataset that allows me to divide up the full dataset into ~200 smaller regional datasets and where it is extremely unlikely that a record (company) could be in two different regions. So, by working with the mini-datasets the -matchit- function becomes much more manageable.
Here is a mock datasets. Note that I have a unique ID variable that I created, as well as a variable which is supposed to be unique to a company/business, but obviously is not. There is also a name and address field, neither of which are always consistently formatted.
Most of the names are close enough that the -matchit- function finds them pretty quickly. However, with "Artusi Tap Gurus" and "Rtusi" it's the address match that would likely link them. Similarly, the "Tango Tanning" observations could be matched by a combination of the name and address.
Again, in an ideal world, I'd create some kind of matching metric to say just how closely the observation is to another, and also some way to group those two (or more) observations together. For example, my ideal output looks something like this (using the -groups- function):
This is where I'm at at the moment. Apologies for having not the most beautiful code. (This comes immediately after defining the dataset from above.)
So this code doesn't work on the addresses at all, it's unable to match the "Artusi" and the "Rtusi", and it correctly identifies the "Tango's" as related, but doesn't figure out how to mark one as good and the other as duplicate. Same with the "Abseilon"
Any help would be greatly appreciated. Hopefully my question is clear enough and my terminology accurate enough to not incur the wrath of he-who-must-not-be-named-on-the-Statalist.
1) generate some kind of metric for how closely an observation matches another
2) create some kind of group variable to help tell the difference between unrelated observations.
I can sort of do the first by using the -matchit- function to match the dataset to itself. The original dataset is ~50,000 observations, so this is very time consuming. However, there is a variable in the dataset that allows me to divide up the full dataset into ~200 smaller regional datasets and where it is extremely unlikely that a record (company) could be in two different regions. So, by working with the mini-datasets the -matchit- function becomes much more manageable.
Here is a mock datasets. Note that I have a unique ID variable that I created, as well as a variable which is supposed to be unique to a company/business, but obviously is not. There is also a name and address field, neither of which are always consistently formatted.
Code:
clear input str50 name long unique_id long other_id str50 street "ARTUSI, TAP GURUS" 2179 26244 "1000 E. Maple Dr." "The Oasis Coffee and Tea" 44725 176450 "421 E Thomas Rd" "Dynasty Athletic and Dance, LLC" 29244 135168 "15985 W. Bartlett AVe" "Abseilon CA, LLC" 22471 119740 "PARK VIEW LN" "Abseilon CA, LLC" 22202 118718 "352 W. Park View Lane" "Abseilon CA, LLC" 5557 46044 "352 W. Park View Lane" "Abseilon CAN, LLC" 22201 118717 "352 W. Park View Lane" "Accurate Care Chiropractic & Acupuncture PC" 13839 102454 "35 CAVE RD STE 30" "Aruba Healthy Clinic, LLC" 44732 176476 "4344 W. Colonial Rd Ste 8" "Aruba Healthy Clinic, LLC" 6663 46297 " 4344 W. Colonial Rd, Ste 8" "Pep 4 Free, LLC" 9275 72133 "28068 Outlook Blvd #1005" "Poledancers Aerial, LLC" 43042 162420 "8020 S. Rainbow Blvd" "Poledancers Aerial Fitness, LLC" 6332 46141 "7920 S. Rainbow Blvd #105" "Daniels Funeral Home, Inc." 4159 30042 "3500 S. Western Ave" "Price Backmon PLLC" 38692 77368 "803 W Mohegan Ave Ste 120" "Rtusi" 40065 156356 "1000 E. Maple Dr." "Tango Tanning" 40064 819323 "GRASSHILL RD" "Tango Tanning & Salon, LLC" 6732 04594 "12345 S. Grasshill Road" "Taqueria Momento" 42158 131880 "3099 WEST RIDGE HIGHWAY" "Marisa Bassett, Attorney" 9871 04954 "12345 Legal Way" "Maria Clemens, Attorney" 1101 138810 "3099 SOUTH RAILWAY" end list
Again, in an ideal world, I'd create some kind of matching metric to say just how closely the observation is to another, and also some way to group those two (or more) observations together. For example, my ideal output looks something like this (using the -groups- function):
Code:
+------------------------------------------------------------------------------------------------------------------------------+ | group_code name other_id street unique~d Freq. MatchPcnt | |------------------------------------------------------------------------------------------------------------------------------| | 1 ARTUSI, TAP GURUS 26244 1000 E. Maple Dr. 2179 1 0.90 | | 1 Rtusi Tap Gurus 156356 1000 E. Maple Dr. 40065 1 0.90 | |------------------------------------------------------------------------------------------------------------------------------| | 2 Abseilon CA, LLC 46044 352 W. Park View Lane 5557 1 1.00 | | 2 Abseilon CA, LLC 118718 352 W. Park View Lane 22202 1 1.00 | | 2 Abseilon CA, LLC 119740 PARK VIEW LN 22471 1 0.95 | | 2 Abseilon CAN, LLC 118717 352 W. Park View Lane 22201 1 0.95 | |------------------------------------------------------------------------------------------------------------------------------| | 3 Accurate Care Chiropractic & Acupuncture PC 102454 35 CAVE RD STE 30 13839 1 0.00 | |------------------------------------------------------------------------------------------------------------------------------| | 4 Aruba Healthy Clinic, LLC 46297 4344 W. Colonial Rd, Ste 8 6663 1 1.00 | | 4 Aruba Healthy Clinic, LLC 176476 4344 W. Colonial Rd Ste 8 44732 1 1.00 | |------------------------------------------------------------------------------------------------------------------------------| | 5 Daniels Funeral Home, Inc. 30042 3500 S. Western Ave 4159 1 0.00 | |------------------------------------------------------------------------------------------------------------------------------| | 6 Dynasty Athletic and Dance, LLC 135168 15985 W. Bartlett AVe 29244 1 0.00 | |------------------------------------------------------------------------------------------------------------------------------| | 7 Pep 4 Free, LLC 72133 28068 Outlook Blvd #1005 9275 1 0.00 | |------------------------------------------------------------------------------------------------------------------------------| | 8 Poledancers Aerial Fitness, LLC 46141 7920 S. Rainbow Blvd #105 6332 1 0.90 | | 8 Poledancers Aerial, LLC 162420 8020 S. Rainbow Blvd 43042 1 0.90 | |------------------------------------------------------------------------------------------------------------------------------| | 9 Price Backmon PLLC 77368 803 W Mohegan Ave Ste 120 38692 1 0.00 | |------------------------------------------------------------------------------------------------------------------------------| | 10 Tango Tanning 819323 GRASSHILL RD 40064 1 0.85 | | 10 Tango Tanning & Salon, LLC 4594 12345 S. Grasshill Road 6732 1 0.85 | |------------------------------------------------------------------------------------------------------------------------------| | 11 Taqueria Momento 131880 3099 WEST RIDGE HIGHWAY 42158 1 0.00 | |------------------------------------------------------------------------------------------------------------------------------| | 11 The Oasis Coffee and Tea 176450 421 E Thomas Rd 44725 1 0.00 | |------------------------------------------------------------------------------------------------------------------------------| | 12 Marisa Bassett, Attorney 04954 12345 Legal Way 9871 1 0.50 | | 12 Maria Clemens, Attorney 138810 3099 SOUTH RAILWAY 1101 1 0.50 | +------------------------------------------------------------------------------------------------------------------------------+
Code:
//Setup values and inspect the data save "temp.dta", replace sort name groups name unique_id other_id, sepby(name) missing //Run matchit function, drop low propensity matches. matchit unique_id name using "temp.dta", override idusing(unique_id) txtusing(name) s(minsimple) gsort -similscore list if unique_id != unique_id1 keep if similscore >= .95 drop if unique_id == unique_id1 //Inspect again, drop duplicates sort name gen morethan2 = (unique_id != unique_id1 & name == name1) duplicates drop name if morethan2 != 0, force duplicates drop name, force //Keep the list of singular businesses and merge them back into the original dataset keep unique_id merge 1:1 unique_id using "temp.dta", gen(flag) label define flag 2 keep 3 drop label values flag flag groups name unique_id other_id flag, sepby(name) missing show(freq) save "final.dta", replace
Any help would be greatly appreciated. Hopefully my question is clear enough and my terminology accurate enough to not incur the wrath of he-who-must-not-be-named-on-the-Statalist.
Comment