Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Yet another batch of updates. Some of them are cosmetic (like changes in what is reported in the output window) or simply new similarity functions added (like nysiis and other hybrid phonetic algorithms).

    But I think the most significant one is the introduction of the stopwordsauto option. This option generates a list of stopwords automatically based on the overall frequencies (i.e. grams per observation). In a nutshell, -matchit- will ignore a list of grams in the whole process (indexation, weights and computation of final results), which will likely improve the efficiency of indexation at the also likely risk of ignoring some potential matches.

    As you can see below, this option is applied to the same example from the previous post. Everything is set exactly the same but for the option stopw (short for stopwordsauto). It can be noted that the output of the option diagnose has changed slightly in order to refer more clearly to the stopwordsauto threshold (which can be set with the option swthreshold()). What before was reported as percent now is reported as grams_per_obs. By default this threshold is set to .2, which means that grams that are found in average more than once every five observations are ignored. In this case, these are only ", ", "an", and "er", as reported in the third table of the diagnose output.

    As you can compare from the two posts, what took slightly less than 7min now takes 2min. However, it is also worth mentioning that results may differ as the similarity score is not computed exactly in the same way.

    Code:
    . use medium, clear
    . matchit person_id person_name using mediumlarge.dta, idu(person_id) txtu(person_name) ti di f(1) stopw
    Matching current dataset with mediumlarge.dta
    Similarity function: bigram
     4 May 2016 10:35:58
     
    Performing preliminary diagnosis
    --------------------------------
     
    Analyzing Master file
    List of most frequent grams in Master file:
    
           grams   freq   grams_per_obs  
      1.      ,    1139          1.1390  
      2.      er    217          0.2170  
      3.      an    205          0.2050  
      4.       J    183          0.1830  
      5.       C    176          0.1760  
      6.      on    171          0.1710  
      7.      ar    167          0.1670  
      8.      or    162          0.1620  
      9.       I    149          0.1490  
     10.      en    141          0.1410  
     11.       S    124          0.1240  
     12.       M    121          0.1210  
     13.       R    114          0.1140  
     14.      ch    113          0.1130  
     15.      ra    111          0.1110  
     16.       A    110          0.1100  
     17.      in    110          0.1100  
     18.       D    109          0.1090  
     19.       L    106          0.1060  
     20.      n,    104          0.1040  
     
    Analyzing Using file
    List of most frequent grams in Using file:
    
           grams    freq   grams_per_obs  
      1.      ,    11079          1.1079  
      2.      an    2144          0.2144  
      3.      er    2115          0.2115  
      4.       J    1795          0.1795  
      5.      ar    1794          0.1794  
      6.      on    1632          0.1632  
      7.       C    1539          0.1539  
      8.       I    1448          0.1448  
      9.       M    1349          0.1349  
     10.      en    1307          0.1307  
     11.      or    1302          0.1302  
     12.       R    1260          0.1260  
     13.       A    1252          0.1252  
     14.      ic    1191          0.1191  
     15.       S    1132          0.1132  
     16.      n,    1125          0.1125  
     17.       D    1124          0.1124  
     18.      in    1085          0.1085  
     19.      ha    1025          0.1025  
     20.      ra    1024          0.1024  
    (638 real changes made)
    (1 real change made)
     
    Overall diagnosis
    Pairs being compared: Master(1000) x Using(10000) = 10000000
    Estimated maximum reduction by indexation (%):0
    (note: this is an indication, final results may differ)
     
    List of grams with greater negative impact to indexation:
    (note: values are estimated, final results may differ)
    
           grams   crosspairs   max_common_space   grams_per_obs  
      1.      ,      12618981             100.00          1.1107  
      2.      er       458955               4.59          0.2120  
      3.      an       439520               4.40          0.2135  
      4.       J       328485               3.28          0.1798  
      5.      ar       299598               3.00          0.1783  
      6.      on       279072               2.79          0.1639  
      7.       C       270864               2.71          0.1559  
      8.       I       215752               2.16          0.1452  
      9.      or       210924               2.11          0.1331  
     10.      en       184287               1.84          0.1316  
     11.       M       163229               1.63          0.1336  
     12.       R       143640               1.44          0.1249  
     13.       S       140368               1.40          0.1142  
     14.       A       137720               1.38          0.1238  
     15.       D       122516               1.23          0.1121  
     16.      in       119350               1.19          0.1086  
     17.      n,       117000               1.17          0.1117  
     18.      ra       113664               1.14          0.1032  
     19.      ch       112322               1.12          0.1006  
     20.      ic       104808               1.05          0.1163  
     
    Loading USING file: mediumlarge.dta
    Generating stopwords automatically, threshold set at:.2
    Done!
    Indexing USING file.
     4 May 2016 10:36:04-> 0%
     4 May 2016 10:36:04-> 1%
     4 May 2016 10:36:04-> 2%
     4 May 2016 10:36:04-> 3%
    ...
     4 May 2016 10:36:07-> 97%
     4 May 2016 10:36:07-> 98%
     4 May 2016 10:36:07-> 99%
     4 May 2016 10:36:07-> Done!
    Computing results
     4 May 2016 10:36:07->  Percent completed ...   (search space saved by index so far)
     4 May 2016 10:36:09->  1%                ...   (48%)
     4 May 2016 10:36:10->  2%                ...   (52%)
     4 May 2016 10:36:11->  3%                ...   (53%)
     4 May 2016 10:36:12->  4%                ...   (54%)
    ...
     4 May 2016 10:37:54->  97%               ...   (57%)
     4 May 2016 10:37:55->  98%               ...   (57%)
     4 May 2016 10:37:55->  99%               ...   (57%)
     4 May 2016 10:37:57->  Done!
    Total search space saved by index: 57%
     4 May 2016 10:37:57

    Comment


    • #17
      Hello Julio,

      Thank you for creating this new command! I am trying to fuzzy match two datasets with company names and websites here. An error message popped up saying “'weburl' found where numeric variable expected”. Do you have any suggestion how to deal with this error? Also, do you have an example for the similarity score option? Thank you!

      Comment


      • #18
        Hi Fiona,

        I probably need more information on what you are trying to do and what your variables are. It seems to me that you are using a str variable as “id” (master or using one) when you need to use a numeric one. If “weburl” is the correct identifier you want to use, just do something like the code below and use that new variable as id:

        Code:
        egen mynewid=group(weburl)


        The similarity scores are explained in the help section “Notes on the different scoring options”. My practical suggestion is to use minsimple if you do not care about what does not match as much as you care of what you actually match. For instance, if you do not care about the difference between “My Big Corporation” vs “The Small Company, part of My Big Corporation” or between “My Great Univeristy” and “My Great Univeristy, Lab of Smaller topics” then use minsimple. If you do care, use the default.

        Best,

        J.

        Comment


        • #19
          Thank you for your response! I actually need the weburlto be a string because I am fuzzy matching URLs from two different datasets. Would it be possible to work around this error?

          Thank you!

          Comment


          • #20
            You will need to create an numeric id for each weburl string variable (i.e. on each dataset). Assuming your files are named file1 and file2,your code will look something like the following:

            Code:
            use file1.dta
            egen id1=group(weburl)
            save newfile1.dta
            use file2.dta
            egen id2=group(weburl)
            save newfile2.dta
            matchit id2 weburl using newfile1.dta, idu(id1) txtu(weburl)

            Comment


            • #21
              Oh! I see what you are saying. Thanks a lot!

              Comment


              • #22
                Hi Julio, I was wondering it matchit is also able to determine similarities within one string variable. I have a variable with around 600000 self reported occupations. I would like to somehow cluster them first and then assign them to numbers.

                Example:

                Job Similarity of Jobs

                Starbucks 1
                Sterbuksch 1
                work at Starbucks 1
                brewing coffee at Starbucks 1
                waiter Starbucks since a while 1

                University Arkansas 2
                University Durham 2
                Eberhard University 2
                LMU Universität Deutschland 2


                Or do you know any other ado file or code to identify similar jobs? Thereafter, I would like to cluster these and create numerical values for them.

                Many thanks

                Philip




                Comment


                • #23
                  Hi Phillip,

                  simple using -matchit- of one file against the same could do the trick. Of course, fine tuning the precise algorithm might take some thought. But I give you a working example as follows:



                  Code:
                  // just the example file
                  tempfile myfile
                  clear all
                  input str244 Job
                  "Starbucks"
                  "Sterbuksch"
                  "work at Starbucks"
                  "brewing coffee at Starbucks"
                  "waiter Starbucks since a while"
                  "University Arkansas"
                  "University Durham"
                  "Eberhard University"
                  "LMU Universität Deutschland"
                  end
                  list
                  gen id=_n
                  save `myfile', replace
                  // This is the command
                  use `myfile', clear
                  matchit id Job using `myfile', idu(id) txtu(Job) s(minsimple)
                  gsort -similscore
                  list
                  keep if similscore>=.5 // You will need to think a threshold here
                  
                  // What follows rebuilds your data with the new group id
                  keep id*
                  ren id id2
                  gen long new_id = _n
                  reshape long id, i(new_id) j(n)
                  drop n
                  duplicates drop
                  * ssc install group_id // only if not already installed (by Robert Picard)
                  group_id new_id , matchby(id)
                  duplicates drop
                  merge 1:1 id using `myfile'
                  list

                  Comment


                  • #24
                    I'm adding here my slides to the past 2016 Swiss Stata Users Group meeting, which contain some useful examples. My previous post is based on slide #10.

                    The slides can also be found here: http://www.stata.com/meeting/switzerland16/#proceedings
                    Attached Files

                    Comment


                    • #25
                      Great! Thank you very much!

                      Comment


                      • #26
                        I tried it with the little dataset and it worked perfectly. Now with the 600000 occupations, it seems like the matchit command needs way to long (6 hours without any result). Is that possibly too much data? Or do I need to make an adjustment?

                        Comment


                        • #27
                          this is where it stop:

                          Indexing USING file.
                          0%
                          20%
                          40%
                          60%
                          80%
                          Done!
                          Computing results
                          Percent completed ... (search space saved by index so far)

                          Comment


                          • #28
                            Hi Philip, if I understand correctly it seems you are trying to compare 36*1010 pairs, which is a lot of computation. There are some tips that can help reducing the actual space you are searching:

                            - First, you should be sure you are removing any duplicate jobs in the original file before using matchit.
                            - Second, you could use a different algorithm aiming at reducing comparisons. By default -matchit- uses sim(bigram) (which is the same than sim(ngram,2)), but you could use sim(ngram,3) or sim(ngram,4) instead. The higher grams you select, the less comparing pairs of observations you should get (at the expense of taking longer to produce the index and maybe missing some potential matches).
                            - Third, you could use the stopw option which avoids comparing pairs based on grams which are too common. Default threshold is .2 grams per observation, but you can change it by setting the option swt(). The more you reduce it the less pairs are compared at the expense of potentially missing good matches (by missing the too frequent grams) and of inflating the similarity score (by ignoring the too frequent grams in the comparison).

                            Best,

                            J.


                            Comment


                            • #29
                              Hi Julio,

                              First of all props for writing this very useful package, and also for replying to questions and comments for three years now.

                              I come to you because something very odd happened today. I had been using the package throughout the day without running into any issues. Stata crashed 3 or 4 times, but the same has happened to me in this computer many times before. I only mention it because at some point after a crash I figured the computer could use some rest, so I turned it off normally and let it rest for 15 minutes. After turning it back on and using the same code I had been using earlier today, now I get the following error:
                              Code:
                              tokenwrap not found as a similarity function. Check spelling.
                              Mata run-time error
                              r(3499);
                              This happened while trying sim(tokenwrap, "soundex_fk") and sim(nysiis_fk). I then tried it with the default and it did run. I prefer soundex_fx because it runs MUCH faster than the default, and it also makes more sense given the string variables that I am using to match these datasets.

                              I tried reinstalling the package, I used adoupdate, update to make sure all my packages are updated, and my Stata is also up-to-date (version 14.2).

                              Have you run into similar issues, or do you have an idea of what could be going wrong and what I could do to fix it?

                              Thanks in advance,

                              Comment


                              • #30
                                Just in case you might want to inspect my code, here it is (after renaming the variables):
                                Code:
                                tempfile master using
                                
                                use tempmastersource , clear
                                keep idmaster txtmaster 
                                duplicates drop
                                drop if txtmaster==""
                                duplicates report idmaster
                                compress
                                save `master'
                                              
                                use tempusingsource, clear
                                keep idusing txtusing
                                duplicates drop
                                drop if txtusing==""
                                duplicates report idusing
                                compress
                                save `using'
                                
                                
                                use `master', clear
                                matchit idmaster txtmaster using `using', idusing(idusing) txtusing(txtusing) sim(tokenwrap, "soundex_fk") di time stopw gen(namematch11)
                                Everything runs well, but at the last step this happens:
                                Code:
                                . matchit idmaster txtmaster using `using', idusing(idusing) txtusing(txtusing) di time sim(tokenwrap, "soundex_fk") stopw gen(namematch11)
                                Matching current dataset with C:\Users\zambrana\AppData\Local\Temp\ST_04000002.tmp
                                Similarity function: tokenwrap
                                22 Mar 2017 17:36:11
                                 
                                Performing preliminary diagnosis
                                --------------------------------
                                 
                                Analyzing Master file
                                tokenwrap not found as a similarity function. Check spelling.
                                Mata run-time error
                                r(3499);
                                
                                end of do-file
                                
                                r(3499);
                                The text variables I am using to match the tables contain names of educational institutions, and neither has missing values. There are no duplicates by either ID variable. Also, the first file has 2457 observations, and the second one 6496.

                                Comment

                                Working...
                                X