checking string similarity within the same variable

Moniek Bresser

Join Date: Aug 2018

Posts: 29
#1

checking string similarity within the same variable

17 May 2019, 01:15

Dear all,

This message was initially posted in the discussion thread

HTML Code:

https://www.statalist.org/forums/forum/general-stata-discussion/general/1307980-matchit-command-to-match-two-datasets-based-on-similar-text-pattern,

, but was advised to post as a new post, with a title better matching my question, so here we go!

In most of the string similarity discussions on Statalist, users are trying to find similarities between variables. I however, would like to get a similarity score for observations within the same string variable. My data set contains more than 10000 person records and most likely there will be hundreds of people that occur in the data set multiple times, but with slightly different spelled names.

Do you have any experience with checking for string similarity within the same variable and may I ask what package you decided using in the end?

Thank you for sharing your experience!

Best wishes,

Moniek
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10254

17 May 2019, 15:30

Code:

sysuse auto
keep make
preserve
rename make make2
tempfile make2
save `make2'
restore
cross using `make2'
drop if make2>=make
matchit make make2
list if similscore>0.5, clean

Result:

Code:

. list if similscore>0.5, clean

        make                make2               similsc~e  
 106.   Buick Regal         Buick Opel          .52704628  
 116.   Buick Riviera       Buick Regal         .54772256  
 165.   Cad. Seville        Cad. Deville        .81818182  
 197.   Chev. Impala        Chev. Chevette      .55337157  
 213.   Chev. Malibu        Chev. Chevette      .55337157  
 214.   Chev. Malibu        Chev. Impala        .54545455  
 230.   Chev. Monte Carlo   Chev. Chevette      .57353933  
 248.   Chev. Monza         Chev. Chevette        .580381  
 250.   Chev. Monza         Chev. Malibu        .57207755  
 251.   Chev. Monza         Chev. Monte Carlo   .63245553  
 267.   Chev. Nova          Chev. Chevette      .61177529  
 268.   Chev. Nova          Chev. Impala        .50251891  
 269.   Chev. Nova          Chev. Malibu        .50251891  
 271.   Chev. Nova          Chev. Monza         .52704628  
 344.   Dodge Magnum        Dodge Colt          .50251891  
 433.   Ford Mustang        Ford Fiesta         .57207755  
 652.   Merc. Marquis       Merc. Cougar        .52223297  
 692.   Merc. Monarch       Merc. Cougar        .56407607  
 693.   Merc. Monarch       Merc. Marquis        .6172134  
 732.   Merc. XR-7          Merc. Bobcat        .50251891  
 733.   Merc. XR-7          Merc. Cougar        .50251891  
 735.   Merc. XR-7          Merc. Monarch       .53452248  
 778.   Merc. Zephyr        Merc. XR-7          .50251891  
 913.   Olds Cutlass        Olds Cutl Supr      .66899361  
1005.   Olds Omega          Olds 98             .54433105  
1318.   Plym. Sapporo       Plym. Arrow         .54772256  
1413.   Pont. Catalina      Linc. Continental   .55815631  
1670.   Pont. Phoenix       Pont. Grand Prix    .52174919  
1730.   Pont. Sunbird       Pont. Firebird      .67082039  
1797.   Datsun 210          Datsun 200          .77777778  
1819.   Datsun 510          Datsun 200          .66666667  
1820.   Datsun 510          Datsun 210          .77777778  
1842.   Datsun 810          Datsun 200          .66666667  
1843.   Datsun 810          Datsun 210          .77777778  
1844.   Datsun 810          Datsun 510          .77777778  
2284.   Toyota Corolla      Toyota Celica       .56044854  
2350.   Toyota Corona       Toyota Celica       .58333333  
2351.   Toyota Corona       Toyota Corolla      .80064077

Comment

Moniek Bresser

Join Date: Aug 2018

Posts: 29
#3

17 Jun 2019, 03:17

Thank you very much Andrew for suggesting cross me.

However, as my dataset contains over 10'000 patients, Stata could not create all unique pairs within a reasonable time frame (after 2 hours, still nothing).

In my case I checked for duplicates per town in the end. This way I create multiple smaller data subsets, which makes the number of unique pairs that Stata needs to create a lot lower. With this workaorund Stata was able to run cross on all my data in the end. .

I am wondering though for other people with large datasets, who are not also filtering for other variables such as town and cannot split their dataset into smaller subsets, is there a way to still use cross without it taking Stata multiple hours/days to run?

Thank you for sharing your ideas!

Best wishes,

Moniek
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10254

19 Jun 2019, 11:01

I guess that this has to do with memory available and processor speed. You can always break the task into several crosses. Replicating your task with 10000 observations, my computer refuses to provide memory if I do it all at once but breaking it up into tenths, it takes slightly under 4 minutes.

Code:

*CREATE DATA SET
clear
set obs 10000
gen make= string(_n)
tempfile data
save `data'
*TURN TIMER ON
timer on 1
*BREAK IT UP INTO TENTHS AND CROSS
forval i=1/10{
keep if inrange(_n,`i'000-999,`i'000)
rename make make2
tempfile ds`i'
save `ds`i''
use `data', clear
}

forval i=1/10{
cross using `ds`i''
tempfile cds`i'
save `cds`i''
use `data', clear
}

*APPEND DATA SETS
use `cds1', clear
forval i=2/10{
append using `cds`i''
}
*TURN OFF TIMER
timer off 1

Result:

Code:

. timer list 1
   1:    234.20 /        1 =     234.2040

Announcement

checking string similarity within the same variable

Comment

Comment

Comment