Grouping observations by several string variables needing fuzzy matching in order to group

Michael Costello

Join Date: Dec 2015
Posts: 30

Grouping observations by several string variables needing fuzzy matching in order to group

05 Aug 2020, 14:46

I have an extremely messy dataset that I've been asked to work with. I'm attempting to find duplicate entries based of a several variables, some numeric, but mostly string variables. I have tried using the -matchit- function to inspect the string variables one-at-a-time to look for duplicates, and the -duplicates- function to work with the numeric variables. In an ideal world I'd like to . . .

1) generate some kind of metric for how closely an observation matches another
2) create some kind of group variable to help tell the difference between unrelated observations.

I can sort of do the first by using the -matchit- function to match the dataset to itself. The original dataset is ~50,000 observations, so this is very time consuming. However, there is a variable in the dataset that allows me to divide up the full dataset into ~200 smaller regional datasets and where it is extremely unlikely that a record (company) could be in two different regions. So, by working with the mini-datasets the -matchit- function becomes much more manageable.

Here is a mock datasets. Note that I have a unique ID variable that I created, as well as a variable which is supposed to be unique to a company/business, but obviously is not. There is also a name and address field, neither of which are always consistently formatted.

Code:

clear
input str50 name long unique_id long other_id str50 street
"ARTUSI, TAP GURUS" 2179 26244 "1000 E. Maple Dr."
"The Oasis Coffee and Tea" 44725 176450 "421 E Thomas Rd"
"Dynasty Athletic and Dance, LLC" 29244 135168 "15985 W. Bartlett AVe"
"Abseilon CA, LLC" 22471 119740 "PARK VIEW LN"
"Abseilon CA, LLC" 22202 118718 "352 W. Park View Lane"
"Abseilon CA, LLC" 5557 46044 "352 W. Park View Lane"
"Abseilon CAN, LLC" 22201 118717 "352 W. Park View Lane"
"Accurate Care Chiropractic & Acupuncture PC" 13839 102454 "35 CAVE RD STE 30"
"Aruba Healthy Clinic, LLC" 44732 176476 "4344 W. Colonial Rd Ste 8"
"Aruba Healthy Clinic, LLC" 6663 46297 " 4344 W. Colonial Rd, Ste 8"
"Pep 4 Free, LLC" 9275 72133 "28068 Outlook Blvd #1005"
"Poledancers Aerial, LLC" 43042 162420 "8020 S. Rainbow Blvd"
"Poledancers Aerial Fitness, LLC" 6332 46141 "7920 S. Rainbow Blvd #105"
"Daniels Funeral Home, Inc." 4159 30042 "3500 S. Western Ave"
"Price Backmon PLLC" 38692 77368 "803 W Mohegan Ave Ste 120"
"Rtusi" 40065 156356 "1000 E. Maple Dr."
"Tango Tanning" 40064 819323 "GRASSHILL RD"
"Tango Tanning & Salon, LLC" 6732 04594 "12345 S. Grasshill Road"
"Taqueria Momento" 42158 131880 "3099 WEST RIDGE HIGHWAY"
"Marisa Bassett, Attorney" 9871 04954 "12345 Legal Way"
"Maria Clemens, Attorney" 1101 138810 "3099 SOUTH RAILWAY"
end
list

Most of the names are close enough that the -matchit- function finds them pretty quickly. However, with "Artusi Tap Gurus" and "Rtusi" it's the address match that would likely link them. Similarly, the "Tango Tanning" observations could be matched by a combination of the name and address.

Again, in an ideal world, I'd create some kind of matching metric to say just how closely the observation is to another, and also some way to group those two (or more) observations together. For example, my ideal output looks something like this (using the -groups- function):

Code:

  +------------------------------------------------------------------------------------------------------------------------------+
  | group_code                                        name   other_id                        street   unique~d   Freq. MatchPcnt |
  |------------------------------------------------------------------------------------------------------------------------------|
  |        1                             ARTUSI, TAP GURUS      26244             1000 E. Maple Dr.       2179       1      0.90 |
  |        1                               Rtusi Tap Gurus     156356             1000 E. Maple Dr.      40065       1      0.90 |
  |------------------------------------------------------------------------------------------------------------------------------|
  |        2                              Abseilon CA, LLC      46044         352 W. Park View Lane       5557       1      1.00 |
  |        2                              Abseilon CA, LLC     118718         352 W. Park View Lane      22202       1      1.00 |
  |        2                              Abseilon CA, LLC     119740                  PARK VIEW LN      22471       1      0.95 |
  |        2                             Abseilon CAN, LLC     118717         352 W. Park View Lane      22201       1      0.95 |
  |------------------------------------------------------------------------------------------------------------------------------|
  |        3   Accurate Care Chiropractic & Acupuncture PC     102454             35 CAVE RD STE 30      13839       1      0.00 |
  |------------------------------------------------------------------------------------------------------------------------------|
  |        4                     Aruba Healthy Clinic, LLC      46297    4344 W. Colonial Rd, Ste 8       6663       1      1.00 |
  |        4                     Aruba Healthy Clinic, LLC     176476     4344 W. Colonial Rd Ste 8      44732       1      1.00 |
  |------------------------------------------------------------------------------------------------------------------------------|
  |        5                    Daniels Funeral Home, Inc.      30042           3500 S. Western Ave       4159       1      0.00 |
  |------------------------------------------------------------------------------------------------------------------------------|
  |        6               Dynasty Athletic and Dance, LLC     135168         15985 W. Bartlett AVe      29244       1      0.00 |
  |------------------------------------------------------------------------------------------------------------------------------|
  |        7                               Pep 4 Free, LLC      72133      28068 Outlook Blvd #1005       9275       1      0.00 |
  |------------------------------------------------------------------------------------------------------------------------------|
  |        8               Poledancers Aerial Fitness, LLC      46141     7920 S. Rainbow Blvd #105       6332       1      0.90 |
  |        8                       Poledancers Aerial, LLC     162420          8020 S. Rainbow Blvd      43042       1      0.90 |
  |------------------------------------------------------------------------------------------------------------------------------|
  |        9                            Price Backmon PLLC      77368     803 W Mohegan Ave Ste 120      38692       1      0.00 |
  |------------------------------------------------------------------------------------------------------------------------------|
  |       10                                 Tango Tanning     819323                  GRASSHILL RD      40064       1      0.85 |
  |       10                    Tango Tanning & Salon, LLC       4594       12345 S. Grasshill Road       6732       1      0.85 |
  |------------------------------------------------------------------------------------------------------------------------------|
  |       11                              Taqueria Momento     131880       3099 WEST RIDGE HIGHWAY      42158       1      0.00 |
  |------------------------------------------------------------------------------------------------------------------------------|
  |       11                      The Oasis Coffee and Tea     176450               421 E Thomas Rd      44725       1      0.00 |
  |------------------------------------------------------------------------------------------------------------------------------|
  |       12                      Marisa Bassett, Attorney      04954               12345 Legal Way       9871       1      0.50 |
  |       12                       Maria Clemens, Attorney     138810            3099 SOUTH RAILWAY       1101       1      0.50 |
  +------------------------------------------------------------------------------------------------------------------------------+

This is where I'm at at the moment. Apologies for having not the most beautiful code. (This comes immediately after defining the dataset from above.)

Code:

    //Setup values and inspect the data
    save "temp.dta", replace
    sort name
    groups name unique_id other_id, sepby(name) missing
    
    //Run matchit function, drop low propensity matches.
    matchit unique_id name using "temp.dta", override idusing(unique_id) txtusing(name) s(minsimple)
    gsort -similscore
    list if unique_id != unique_id1
    keep if similscore >= .95
    drop if unique_id == unique_id1
    
    //Inspect again, drop duplicates
    sort name
    gen morethan2 = (unique_id != unique_id1 & name == name1)
    duplicates drop name if morethan2 != 0, force
    duplicates drop name, force
    
    //Keep the list of singular businesses and merge them back into the original dataset
    keep unique_id
    merge 1:1 unique_id using "temp.dta", gen(flag)
       label define flag 2 keep 3 drop
       label values flag flag
    groups name unique_id other_id flag, sepby(name) missing show(freq)
    save "final.dta", replace

So this code doesn't work on the addresses at all, it's unable to match the "Artusi" and the "Rtusi", and it correctly identifies the "Tango's" as related, but doesn't figure out how to mark one as good and the other as duplicate. Same with the "Abseilon"

Any help would be greatly appreciated. Hopefully my question is clear enough and my terminology accurate enough to not incur the wrath of he-who-must-not-be-named-on-the-Statalist.

Tags: fuzzy, matchit, multiple variables, string

Andrew Musau

Join Date: Oct 2014
Posts: 10214

06 Aug 2020, 03:28

I will say that I am no fan of fuzzy matching. Names are one thing, but addresses are a completely different beast. Agglomeration is common in a number of industries, e.g., manufacturing, and as a result, you find that many businesses share the same physical address. Therefore, you need to exercise caution when using this variable as a matching criterion. Creating two scores, one based on name and the other on address and taking the maximum of the scores is one way to achieve your grouping. Defining what score constitutes a match is more an art than a science, so I leave that to you. 0.5 works well for the data example that you provide. matchit is from SSC (FAQ Advice #12).

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str50 name long(unique_id other_id) str50 street
"ARTUSI, TAP GURUS"                            2179  26244 "1000 E. Maple Dr."          
"The Oasis Coffee and Tea"                    44725 176450 "421 E Thomas Rd"            
"Dynasty Athletic and Dance, LLC"             29244 135168 "15985 W. Bartlett AVe"      
"Abseilon CA, LLC"                            22471 119740 "PARK VIEW LN"              
"Abseilon CA, LLC"                            22202 118718 "352 W. Park View Lane"      
"Abseilon CA, LLC"                             5557  46044 "352 W. Park View Lane"      
"Abseilon CAN, LLC"                           22201 118717 "352 W. Park View Lane"      
"Accurate Care Chiropractic & Acupuncture PC" 13839 102454 "35 CAVE RD STE 30"          
"Aruba Healthy Clinic, LLC"                   44732 176476 "4344 W. Colonial Rd Ste 8"  
"Aruba Healthy Clinic, LLC"                    6663  46297 " 4344 W. Colonial Rd, Ste 8"
"Pep 4 Free, LLC"                              9275  72133 "28068 Outlook Blvd #1005"  
"Poledancers Aerial, LLC"                     43042 162420 "8020 S. Rainbow Blvd"      
"Poledancers Aerial Fitness, LLC"              6332  46141 "7920 S. Rainbow Blvd #105"  
"Daniels Funeral Home, Inc."                   4159  30042 "3500 S. Western Ave"        
"Price Backmon PLLC"                          38692  77368 "803 W Mohegan Ave Ste 120"  
"Rtusi"                                       40065 156356 "1000 E. Maple Dr."          
"Tango Tanning"                               40064 819323 "GRASSHILL RD"              
"Tango Tanning & Salon, LLC"                   6732   4594 "12345 S. Grasshill Road"    
"Taqueria Momento"                            42158 131880 "3099 WEST RIDGE HIGHWAY"    
"Marisa Bassett, Attorney"                     9871   4954 "12345 Legal Way"            
"Maria Clemens, Attorney"                      1101 138810 "3099 SOUTH RAILWAY"        
end

tempfile list
frame put *, into(list)
frame put *, into(list2)
frame list{
    rename (*) (*2)
    gen name=name2
    save `list'
}
frame list2{
    cross using `list'
    drop if name<=name2
    *ssc install matchit
    *ssc install freqindex
    matchit name name2
    rename similscore score1
    matchit street street2
    egen score=rowmax(score1 similscore)
    keep if score>0.5
    keep name name2 score
    contract name name2
    replace _freq=_n
    rename name name1
    reshape long name, i(_freq) j(which)
    bys _freq (which): gen name2=name[1]
    keep name name2
    save `list',replace
}
merge m:1 name using `list',keep (master match) nogen
replace name2=name if missing(name2)
sort name2
l, sepby(name2)

Res.:

Code:

. l, sepby(name2)

     +-----------------------------------------------------------------------------------------------------------------------------------------------+
     |                                        name   unique~d   other_id                        street                                         name2 |
     |-----------------------------------------------------------------------------------------------------------------------------------------------|
  1. |                           Abseilon CAN, LLC      22201     118717         352 W. Park View Lane                             Abseilon CAN, LLC |
  2. |                            Abseilon CA, LLC      22202     118718         352 W. Park View Lane                             Abseilon CAN, LLC |
  3. |                            Abseilon CA, LLC       5557      46044         352 W. Park View Lane                             Abseilon CAN, LLC |
  4. |                            Abseilon CA, LLC      22471     119740                  PARK VIEW LN                             Abseilon CAN, LLC |
     |-----------------------------------------------------------------------------------------------------------------------------------------------|
  5. | Accurate Care Chiropractic & Acupuncture PC      13839     102454             35 CAVE RD STE 30   Accurate Care Chiropractic & Acupuncture PC |
     |-----------------------------------------------------------------------------------------------------------------------------------------------|
  6. |                   Aruba Healthy Clinic, LLC      44732     176476     4344 W. Colonial Rd Ste 8                     Aruba Healthy Clinic, LLC |
  7. |                   Aruba Healthy Clinic, LLC       6663      46297    4344 W. Colonial Rd, Ste 8                     Aruba Healthy Clinic, LLC |
     |-----------------------------------------------------------------------------------------------------------------------------------------------|
  8. |                  Daniels Funeral Home, Inc.       4159      30042           3500 S. Western Ave                    Daniels Funeral Home, Inc. |
     |-----------------------------------------------------------------------------------------------------------------------------------------------|
  9. |             Dynasty Athletic and Dance, LLC      29244     135168         15985 W. Bartlett AVe               Dynasty Athletic and Dance, LLC |
     |-----------------------------------------------------------------------------------------------------------------------------------------------|
 10. |                    Marisa Bassett, Attorney       9871       4954               12345 Legal Way                      Marisa Bassett, Attorney |
 11. |                     Maria Clemens, Attorney       1101     138810            3099 SOUTH RAILWAY                      Marisa Bassett, Attorney |
     |-----------------------------------------------------------------------------------------------------------------------------------------------|
 12. |                             Pep 4 Free, LLC       9275      72133      28068 Outlook Blvd #1005                               Pep 4 Free, LLC |
     |-----------------------------------------------------------------------------------------------------------------------------------------------|
 13. |                     Poledancers Aerial, LLC      43042     162420          8020 S. Rainbow Blvd                       Poledancers Aerial, LLC |
 14. |             Poledancers Aerial Fitness, LLC       6332      46141     7920 S. Rainbow Blvd #105                       Poledancers Aerial, LLC |
     |-----------------------------------------------------------------------------------------------------------------------------------------------|
 15. |                          Price Backmon PLLC      38692      77368     803 W Mohegan Ave Ste 120                            Price Backmon PLLC |
     |-----------------------------------------------------------------------------------------------------------------------------------------------|
 16. |                                       Rtusi      40065     156356             1000 E. Maple Dr.                                         Rtusi |
 17. |                           ARTUSI, TAP GURUS       2179      26244             1000 E. Maple Dr.                                         Rtusi |
     |-----------------------------------------------------------------------------------------------------------------------------------------------|
 18. |                               Tango Tanning      40064     819323                  GRASSHILL RD                    Tango Tanning & Salon, LLC |
 19. |                  Tango Tanning & Salon, LLC       6732       4594       12345 S. Grasshill Road                    Tango Tanning & Salon, LLC |
     |-----------------------------------------------------------------------------------------------------------------------------------------------|
 20. |                            Taqueria Momento      42158     131880       3099 WEST RIDGE HIGHWAY                              Taqueria Momento |
     |-----------------------------------------------------------------------------------------------------------------------------------------------|
 21. |                    The Oasis Coffee and Tea      44725     176450               421 E Thomas Rd                      The Oasis Coffee and Tea |
     +-----------------------------------------------------------------------------------------------------------------------------------------------+

Comment

Michael Costello

Join Date: Dec 2015

Posts: 30
#3

10 Aug 2020, 09:06

Andrew, this is basically perfect. Thank you so much, I really appreciate it!
Comment

Announcement

Grouping observations by several string variables needing fuzzy matching in order to group

Comment

Comment