Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Fuzzy String Matches - optimize procedure; reclink(2) or matchit? Automated loops possible?

    Dear all,

    I trying for a new project to matching fuzzy strings together using -reclink-, -reclink2- and -matchit-. I am using STATA 15 (64-bit) and Windows 10.
    Since all of the aforementioned user-written commands were discussed in previous posts, I omit to post the code for them.

    Thus, my data example is the following: (I am sorry that it ended up being so long but this should give you a much better picture of what the issues are


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str34 region str75 mun float i_A
    "Bodenseekreis" "Ailingen"            1
    "Bodenseekreis" "Berg"                2
    "Bodenseekreis" "Ettenkirch"          3
    "Bodenseekreis" "Flunau"              4
    "Bodenseekreis" "Friedrichshafen"     5
    "Bodenseekreis" "Hemighofen"          6
    "Bodenseekreis" "Hirschlatt"          7
    "Bodenseekreis" "Langnau"             8
    "Bodenseekreis" "Liebenau"            9
    "Bodenseekreis" "Nonnenbach"         10
    "Bodenseekreis" "Obereisenbach"      11
    "Bodenseekreis" "Schnetzenhausen"    12
    "Bodenseekreis" "Tettnang"           13
    "Böblingen"    "Affstätt"          14
    "Böblingen"    "Böblingen"         15
    "Böblingen"    "Dachtel"            16
    "Böblingen"    "Dagersheim"         17
    "Böblingen"    "Darmsheim"          18
    "Böblingen"    "Deufringen"         19
    "Böblingen"    "Dätzingen"         20
    "Böblingen"    "Döffingen"         21
    "Böblingen"    "Eltingen"           22
    "Böblingen"    "Flacht"             23
    "Böblingen"    "Gebersheim"         24
    "Böblingen"    "Haslach"            25
    "Böblingen"    "Hausen"             26
    "Böblingen"    "Höfingen"          27
    "Böblingen"    "Kayh"               28
    "Böblingen"    "Kuppingen"          29
    "Böblingen"    "Leonberg"           30
    "Böblingen"    "Maichingen"         31
    "Böblingen"    "Malmsheim"          32
    "Böblingen"    "Merklingen"         33
    "Böblingen"    "Mönchberg"         34
    "Böblingen"    "Münklingen"        35
    "Böblingen"    "Neuweiler"          36
    "Böblingen"    "Oberjesingen"       37
    "Böblingen"    "Oberjettingen"      38
    "Böblingen"    "Perouse"            39
    "Böblingen"    "Rohrau"             40
    "Böblingen"    "Schaffhausen"       41
    "Böblingen"    "Schönaich"         42
    "Böblingen"    "Tailfingen"         43
    "Böblingen"    "Warmbronn"          44
    "Böblingen"    "Weil im Schönbuch" 45
    "Böblingen"    "Öschelbronn"       46
    end
    *this is part A of the file
    save A.dta,replace
     
    clear
    input str34 region str75 mun float i_B
    "Bodenseekreis" "Ahausen"                                                             1
    "Bodenseekreis" "Ailingen Stadt Friedrichshafen"                                      2
    "Bodenseekreis" "Altheim Gemeinde Frickingen"                                         3
    "Bodenseekreis" "Andelshofen Stadt Überlingen"                                       4
    "Bodenseekreis" "Baitenhausen"                                                        5
    "Bodenseekreis" "Berg Stadt Friedrichshafen"                                          6
    "Bodenseekreis" "Bergheim Stadt Markdorf"                                             7
    "Bodenseekreis" "Bermatingen"                                                         8
    "Bodenseekreis" "Betenbrunn Gemeinde Heiligenberg"                                    9
    "Bodenseekreis" "Beuren Gemeinde Salem"                                              10
    "Bodenseekreis" "Billafingen Gemeinde Owingen"                                       11
    "Bodenseekreis" "Billafingen Kirche Gemeinde Owingen"                                12
    "Bodenseekreis" "Bonndorf Gemeinde Überlingen"                                      13
    "Bodenseekreis" "Brochenzell Gemeinde Meckenbeuren"                                  14
    "Bodenseekreis" "Daisendorf"                                                         15
    "Bodenseekreis" "Daisendorf Kapelle"                                                 16
    "Bodenseekreis" "Deggenhausen Gemeinde Deggenhausertal"                              17
    "Bodenseekreis" "Ettenkirch Stadt Friedrichshafen"                                   18
    "Bodenseekreis" "Fischbach Stadt Friedrichshafen"                                    19
    "Bodenseekreis" "Frickingen"                                                         20
    "Bodenseekreis" "Friedrichshafen Uferanlage"                                         21
    "Bodenseekreis" "Gattnau Gemeinde Kressbronn"                                        22
    "Bodenseekreis" "Goppertsweiler Gemeinde Neukirch"                                   23
    "Bodenseekreis" "Grasbeuren"                                                         24
    "Bodenseekreis" "Hagnau"                                                             25
    "Bodenseekreis" "Hagnau am Bodensee"                                                 26
    "Bodenseekreis" "Heiligenberg"                                                       27
    "Bodenseekreis" "Hepbach Stadt Markdorf"                                             28
    "Bodenseekreis" "Hiltensweiler Gemeinde Tettnang"                                    29
    "Bodenseekreis" "Hödingen"                                                          30
    "Bodenseekreis" "Immenstaad am Bodensee"                                             31
    "Bodenseekreis" "Ittendorf"                                                          32
    "Bodenseekreis" "Jettenhausen Stadt Friedrichshafen"                                 33
    "Bodenseekreis" "Kippenhausen Gemeinde Immenstaad"                                   34
    "Bodenseekreis" "Kluftern Stadt Friedrichshafen"                                     35
    "Bodenseekreis" "Kressbronn am Bodensee"                                             36
    "Bodenseekreis" "Kressbronn am Bodensee-Tunau"                                       37
    "Bodenseekreis" "Krumbach Stadt Tettnang"                                            38
    "Bodenseekreis" "Lellwangen Gemeinde Deggenhausertal"                                39
    "Bodenseekreis" "Lellwangen Kirche Gemeinde Deggenhausertal"                         40
    "Bodenseekreis" "Leutkirch Gemeinde Salem"                                           41
    "Bodenseekreis" "Limpach Gemeinde Deggenhausertal"                                   42
    "Bodenseekreis" "Lippertsreute Stadt Überlingen"                                    43
    "Bodenseekreis" "Mariabrunn Friedhof"                                                44
    "Bodenseekreis" "Mariabrunn<"                                                        45
    "Bodenseekreis" "Markdorf Friedhof"                                                  46
    "Bodenseekreis" "Markdorf Rathausplatz"                                              47
    "Bodenseekreis" "Meersburg"                                                          48
    "Bodenseekreis" "Meersburg Chronik"                                                  49
    "Bodenseekreis" "Meersburg-Schiggendorf"                                             50
    "Bodenseekreis" "Mimmenhausen Gemeinde Salem"                                        51
    "Bodenseekreis" "Mühlhofen Einzelgräber Gemeinde Uhldingen-Mühlhofen"             52
    "Bodenseekreis" "Nesselwangen"                                                       53
    "Bodenseekreis" "Nußdorf Stadt Überlingen"                                         54
    "Bodenseekreis" "Obereisenbach Stadt Tettnang"                                       55
    "Bodenseekreis" "Oberhomberg Gemeinde Deggenhausertal"                               56
    "Bodenseekreis" "Owingen"                                                            57
    "Bodenseekreis" "Roggenbeuren Gemeinde Deggenhausertal Bodenseekreis"                58
    "Bodenseekreis" "Roggenbeuren Kirche Gemeinde Deggenhausertal"                       59
    "Bodenseekreis" "Röhrenbach Gemeinde Heiligenberg"                                  60
    "Bodenseekreis" "Schnetzenhausen Stadt Friedrichshafen"                              61
    "Bodenseekreis" "Seefelden Ortsteil von Oberuhldingen Gemeinde Uhldingen-Mühlhofen" 62
    "Bodenseekreis" "Sipplingen"                                                         63
    "Bodenseekreis" "Stefansfeld Gemeinde Salem"                                         64
    "Bodenseekreis" "Stetten"                                                            65
    "Bodenseekreis" "Taisersdorf Gemeinde Owingen"                                       66
    "Bodenseekreis" "Tettnang Bodenseekreis"                                             67
    "Bodenseekreis" "Tettnang-Tannau"                                                    68
    "Bodenseekreis" "Untersiggingen Kirche Gemeinde Deggenhausertal"                     69
    "Bodenseekreis" "Urnau Gemeinde Deggenhausertal"                                     70
    "Bodenseekreis" "Weildorf Gemeinde Salem"                                            71
    "Bodenseekreis" "Wildpoltsweiler Gemeinde Neukirch bei Tettnang"                     72
    "Bodenseekreis" "Überlingen Münster"                                               73
    "Böblingen"    "Affstätt Stadt Herrenberg"                                         74
    "Böblingen"    "Böblingen-Dagersheim"                                              75
    "Böblingen"    "Dachtel Gemeinde Aidlingen"                                         76
    "Böblingen"    "Darmsheim Stadt Sindelfingen"                                       77
    "Böblingen"    "Dätzingen Gemeinde Grafenau"                                       78
    "Böblingen"    "Döffingen Gemeinde Grafenau"                                       79
    "Böblingen"    "Flacht Gemeinde Weissach"                                           80
    "Böblingen"    "Gebersheim Stadt Leonberg"                                          81
    "Böblingen"    "Haslach Stadt Herrenberg"                                           82
    "Böblingen"    "Kayh Stadt Herreneberg"                                             83
    "Böblingen"    "Kuppingen Stadt Herrenberg"                                         84
    "Böblingen"    "Maichingen Stadt Sindelfingen"                                      85
    "Böblingen"    "Merklingen Stadt Weil der Stadt"                                    86
    "Böblingen"    "Oberjesingen Stadt Herrenberg"                                      87
    "Böblingen"    "Oberjettingen Gemeinde Jettingen"                                   88
    "Böblingen"    "Perouse Stadt Rutesheim"                                            89
    "Böblingen"    "Renningen-Malmsheim Germanuskirche"                                 90
    "Böblingen"    "Schafhausen Weil der Stadt"                                         91
    "Böblingen"    "Tailfingen Gemeinde Gäufelden"                                     92
    "Böblingen"    "Warmbronn Stadt Leonberg"                                           93
    "Böblingen"    "Weil der Stadt-Hausen"                                              94
    "Böblingen"    "Weil der Stadt-Münklingen"                                         95
    "Böblingen"    "Öschelbronn Gemeinde Gäufelden"                                   96
    end
     
    *this is part B
    save B.dta,replace
     
    *I start doing just simple merges using the two identifyers: mun region
    clear
    use A.dta,clear
    merge 1:1 region mun using B.dta
    preserve
    keep if _merge==3
    drop _merge
    save merge_1.dta,replace
    restore
    preserve
    keep if _merge==2
    drop _merge i_A
    save unmatched_B_1.dta,replace
    restore
    keep if _merge==1
    drop _merge i_B
    save unmatched_A_1.dta,replace
     
    *First try of reclink – I know with exclude the code could be shortened
     
    reclink mun region using unmatched_B_1.dta, gen(reclink_1) idm(i_A) idu(i_B) minscore(0.5) exactstr(region)
    *just to give you an example of how the output of reclink will be
     
    
    end

    The basic structure is the following: there are a finite number of regions but a large number of mun. Many of the mun observations contain unrelated strings and misspelled words.

    I stopped here after the first reclink since it would continue the same way: checking to set the minscore, checking the matches again for wrong matches (delete them by hand), adding the matches to the already matched and generating two new files: unmatched_A and unmatched_B.

    Unfortunately, the whole data set will be >10'000 observations looking like this. Moreover, there will be multiple instances when a observation in mun matches correctly despite having a different region (classification changed). To make it even harder, the observations in mun have often the structure 'municipality name 1' City/Municipality 'municipality name 2'. Two cases exist: if the municipality name 1 exists in the file A, then match it to municipality name 1; if not, then match it to municipality name 2. Multiple observations from B must be matched to just one observation in A as well.

    My general question (later follow more specific ones) is: what is the right matching procedure that saves time for this sample data set? (Please bear in mind that in full data sometimes regions might be different despite being a correct match). Should I combine -reclink-, -reclink2- and -matchit- in a certain way? Or are there other commands more suitable for my specific task?

    I hope that someone who faced similar challenges could describe how they would tackle such a task.


    It is quite time consuming to rewrite the same steps for each -reclink- (I usually run it three times: 1) exactstr(region) 2)wmatch( X Y) 3) only variable mun.
    Question 1: Is there a command that does the separation of files and appending for me in a step? Or could this done with a loop command of some sorts?

    I've also experimented with -reclink2- with the option npairs().
    Question 2: Do you think this would really speed up the matching? By using the option npairs() I end up spending much more time on deleting false matches. And I cannot set the minscore so easily.

    I tried to use -matchit- but I struggle to get it running with 2 variables as identifiers (mun and region).
    Question 3: Is there a way to perform this task as -reclink- does? Would -matchit- be preferable in my example given above? And what specification for the matching would be suitable?

    I have also used the user-written command -clrevmatch- but stopped using it since it took me much longer to compare matches. This was the case since I could not look at all the observations from file A.
    Question 4: Is there a better way to use this command? Are there alternatives? Does it make sense to use it with reclink2 (npairs(X) for X>3?)?



    Thanks you very much in advance!!

    Sincerely,
    Steven

    Last edited by Steven Gregory Hill; 31 Oct 2019, 17:05.
Working...
X