Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merging datasets in Stata on long strings and less precise matching criterion

    In my new blog post, I have discussed a problem that I usually encounter in merging two datasets on a key variable that is not precisely the same in the two datasets. I have first elaborated the problem and then presented two solutions: One without using an ado program and the other using an ado program. The program is initially called smerge, which of course is in beta stage and shall need a major overhaul and testing before making it formally available on SSC. Any comment or suggestions are welcomed. The blog post can be accessed from here http://fintechprofessor.com/2018/12/...ing-criterion/

  • Attaullah Shah
    replied
    Thanks, William Lisowski for your reply. Unfortunately, or fortunately, I had not used matchitbefore writing the blog. As I can see from your post, matchit takes a different route than smerge when doing the merge and produces a different output than what smerge produces. I think both the programs have their own merits and might be specifically useful in different circumstances. smerge has to go a long way to be more useful than what it is in its present form.

    Leave a comment:


  • William Lisowski
    replied
    Note also that others here have had success with the user-written matchit command from Julio Raffo, as discussed in the following two topics.

    http://www.statalist.org/forums/foru...s-observations

    http://www.statalist.org/forums/foru...-e-fuzzy-match

    Added in edit: here's an example of applying it to the data in the blog post.
    Code:
    clear
    input str30 name str5 symbol
    "The Pakistan General Insurance" "PKGI"
    "Security Leasing Corporation L" "SLCL"
    "NATIONAL BANK OF PAKISTAN"      "NBP"  
    "THE BANK OF KHYBER"             "BOK"  
    "THE BANK OF PUNJAB"             "BOP"  
    "SILKBANK LIMITED"               "SILK"
    "Awwal Modaraba"                 "AWWAL"
    "B.F. Modaraba"                  "BFMOD"
    end
    replace name = lower(name)
    generate idmem = _n
    tempfile mem
    save `"`mem'"'
    
    clear
    input str39 name
    "The Pakistan General Insurance Co. Ltd."
    "Security Leasing Corporation Ltd."      
    "NATIONAL BANK OF PAKISTAN"              
    "THE BANK OF KHYBER"                    
    "THE BANK OF PUNJAB Ltd"                
    "SILKBANK LIMITED ."                    
    "Awwal Modaraba Comay"                  
    "B.F. Modaraba."                        
    end
    replace name = lower(name)
    generate iddat = _n
    tempfile dat
    save `dat'
    
    matchit iddat name using `"`mem'"', idusing(idmem) txtusing(name)
    Code:
    . list iddat name idmem name1 similscore, clean noobs abbrev(12)
    
        iddat                                      name   idmem                            name1   similscore  
            1   the pakistan general insurance co. ltd.       1   the pakistan general insurance     .8922686  
            2         security leasing corporation ltd.       2   security leasing corporation l    .95742711  
            3                 national bank of pakistan       3        national bank of pakistan            1  
            4                        the bank of khyber       5               the bank of punjab    .64705882  
            4                        the bank of khyber       4               the bank of khyber            1  
            5                    the bank of punjab ltd       5               the bank of punjab    .89973541  
            5                    the bank of punjab ltd       4               the bank of khyber    .58218174  
            6                        silkbank limited .       6                 silkbank limited    .93933644  
            7                      awwal modaraba comay       8                    b.f. modaraba    .52981294  
            7                      awwal modaraba comay       7                   awwal modaraba    .82717019  
            8                            b.f. modaraba.       8                    b.f. modaraba    .96076892  
            8                            b.f. modaraba.       7                   awwal modaraba    .61538462
    These results can then be used to generate a correspondence file that can be used to merge the two datasets and add the appropriate symbol from the first dataset to the observations in the second dataset.
    Last edited by William Lisowski; 15 Dec 2018, 09:15.

    Leave a comment:

Working...
X