Merging datasets in Stata on long strings and less precise matching criterion

Attaullah Shah started a topic Merging datasets in Stata on long strings and less precise matching criterion

14 Dec 2018, 22:35
Merging datasets in Stata on long strings and less precise matching criterion

In my new blog post, I have discussed a problem that I usually encounter in merging two datasets on a key variable that is not precisely the same in the two datasets. I have first elaborated the problem and then presented two solutions: One without using an ado program and the other using an ado program. The program is initially called smerge, which of course is in beta stage and shall need a major overhaul and testing before making it formally available on SSC. Any comment or suggestions are welcomed. The blog post can be accessed from here http://fintechprofessor.com/2018/12/...ing-criterion/
Tags: None

1 like
Attaullah Shah replied

15 Dec 2018, 21:11
Thanks, William Lisowski for your reply. Unfortunately, or fortunately, I had not used matchitbefore writing the blog. As I can see from your post, matchit takes a different route than smerge when doing the merge and produces a different output than what smerge produces. I think both the programs have their own merits and might be specifically useful in different circumstances. smerge has to go a long way to be more useful than what it is in its present form.
Leave a comment:

William Lisowski replied

15 Dec 2018, 09:08

Note also that others here have had success with the user-written matchit command from Julio Raffo, as discussed in the following two topics.

http://www.statalist.org/forums/foru...s-observations

http://www.statalist.org/forums/foru...-e-fuzzy-match

Added in edit: here's an example of applying it to the data in the blog post.

Code:

clear
input str30 name str5 symbol
"The Pakistan General Insurance" "PKGI"
"Security Leasing Corporation L" "SLCL"
"NATIONAL BANK OF PAKISTAN"      "NBP"  
"THE BANK OF KHYBER"             "BOK"  
"THE BANK OF PUNJAB"             "BOP"  
"SILKBANK LIMITED"               "SILK"
"Awwal Modaraba"                 "AWWAL"
"B.F. Modaraba"                  "BFMOD"
end
replace name = lower(name)
generate idmem = _n
tempfile mem
save `"`mem'"'

clear
input str39 name
"The Pakistan General Insurance Co. Ltd."
"Security Leasing Corporation Ltd."      
"NATIONAL BANK OF PAKISTAN"              
"THE BANK OF KHYBER"                    
"THE BANK OF PUNJAB Ltd"                
"SILKBANK LIMITED ."                    
"Awwal Modaraba Comay"                  
"B.F. Modaraba."                        
end
replace name = lower(name)
generate iddat = _n
tempfile dat
save `dat'

matchit iddat name using `"`mem'"', idusing(idmem) txtusing(name)

Code:

. list iddat name idmem name1 similscore, clean noobs abbrev(12)

    iddat                                      name   idmem                            name1   similscore  
        1   the pakistan general insurance co. ltd.       1   the pakistan general insurance     .8922686  
        2         security leasing corporation ltd.       2   security leasing corporation l    .95742711  
        3                 national bank of pakistan       3        national bank of pakistan            1  
        4                        the bank of khyber       5               the bank of punjab    .64705882  
        4                        the bank of khyber       4               the bank of khyber            1  
        5                    the bank of punjab ltd       5               the bank of punjab    .89973541  
        5                    the bank of punjab ltd       4               the bank of khyber    .58218174  
        6                        silkbank limited .       6                 silkbank limited    .93933644  
        7                      awwal modaraba comay       8                    b.f. modaraba    .52981294  
        7                      awwal modaraba comay       7                   awwal modaraba    .82717019  
        8                            b.f. modaraba.       8                    b.f. modaraba    .96076892  
        8                            b.f. modaraba.       7                   awwal modaraba    .61538462

These results can then be used to generate a correspondence file that can be used to merge the two datasets and add the appropriate symbol from the first dataset to the observations in the second dataset.

Last edited by William Lisowski; 15 Dec 2018, 09:15.

Announcement

Merging datasets in Stata on long strings and less precise matching criterion

Leave a comment:

Leave a comment: