Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to overcome problems in fuzzy match via matchit and reclink?

    Hi Statalisters,

    I try to use fuzzy match commands matchit and reclink to merge two datasets.

    Here is an example of master file. I am focusing on using the third column cnms (company name) to match data.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float fyear str58 conm str50 cnms
    2004 "180 CONNECT INC"    "DIRECTV GROUP INC"                                 
    2005 "180 CONNECT INC"    "DIRECTV GROUP INC"                                 
    2006 "180 CONNECT INC"    "DIRECTV GROUP INC"                                 
    2007 "180 CONNECT INC"    "DIRECTV GROUP INC"                                 
    2000 "1MAGE SOFTWARE INC" "Reynolds & Reynolds  -CL A"                        
    2001 "1MAGE SOFTWARE INC" "Reynolds & Reynolds  -CL A"                        
    2002 "1MAGE SOFTWARE INC" "Reynolds & Reynolds  -CL A"                        
    2003 "1MAGE SOFTWARE INC" "Reynolds & Reynolds  -CL A"                        
    2012 "2U INC"             "Georgetown University School of Nursing and Health"
    2012 "2U INC"             "University of Southern California"                 
    end
    Here is an example of using file. I will use cnms as the variable to match.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str50 cnms str6 gvkey_cus str9 cusip_cus
    "20TH CENTRY"                  "012886" "90130A101"
    "20TH CENTURY FOX"             "012886" "90130A101"
    "20TH CENTY"                   "012886" "90130A101"
    "TWENTY-FIRST CENTURY FOX INC" "012886" "90130A101"
    "2122UNITED NATURAL FOODS INC" "#N/A"   ""         
    "21ST CENTY TELECOM GROUP INC" "#N/A"   ""         
    "238 TELECOM LIMITED"          "#N/A"   ""         
    "24 HOUR FITNESS"              "#N/A"   ""         
    "24 HOUR FITNESS USA, INC."    "#N/A"   ""         
    "24 HOUR FITNESS WORLD, INC."  "#N/A"   ""         
    "24/7"                         "#N/A"   ""         
    end
    Here are my reclink and matching codes.

    Code:
    reclink cnms using final1000, idmaster(idmaster) idusing(idusing) gen(matchscore) _merge(_merge) minscore(.9)
    Code:
    matchit idmaster cnms using final1000, idusing(idusing) txtusing(cnms)
    The problem is after matching, both commands encounter similar problems, that is, (see the following example) commands seem to be confused by some common names among firms, such as CORP, INC, and LTD. For example, between observations "ARROW INTERNATIONAL" and "ADS INTERNATIONAL", the commands think they can be matched with a high score, however, the commands are confused by "INTERNATIONAL" and actually they are two distinct firms. Does anyone how to overcome such problems in fuzzy match? Can we allocate different weights within an observation to different words?

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float fyear str58 conm str6 gvkey str10 cusip str4 sic str6 naics str50(cnms Ucnms) str8 ctype double salecs float(idmaster matchscore idusing) str6 gvkey_cus str9 cusip_cus byte _merge
    2001 "ACURA PHARMACEUTICALS INC"   "011929" "00509L802" "2834" "325412" "WATSON PHARMACEUTICALS INC" "AGIOS PHARMACEUTICALS INC" "COMPANY"  14.559 3359 .9310636 827 "#N/A"   ""          3
    2002 "ACURA PHARMACEUTICALS INC"   "011929" "00509L802" "2834" "325412" "WATSON PHARMACEUTICALS INC" "AGIOS PHARMACEUTICALS INC" "COMPANY"   6.974 3361 .9310636 827 "#N/A"   ""          3
    2003 "ACURA PHARMACEUTICALS INC"   "011929" "00509L802" "2834" "325412" "WATSON PHARMACEUTICALS INC" "AGIOS PHARMACEUTICALS INC" "COMPANY"   3.335 3362 .9310636 827 "#N/A"   ""          3
    2009 "ADTRAN INC"                  "030576" "00738A106" "3661" "334210" "AT&T INC"                   "AT&T INC"                  "COMPANY" 106.521 4057        1 116 "009899" "00206R102" 3
    2010 "ADTRAN INC"                  "030576" "00738A106" "3661" "334210" "AT&T INC"                   "AT&T INC"                  "COMPANY" 109.021 4064        1 116 "009899" "00206R102" 3
    2001 "ADV NEUROMODULATION SYS INC" "008872" "00757T101" "3845" "334510" "ARROW INTERNATIONAL"        "ADS INTERNATIONAL"         "COMPANY"     1.8 4100 .9397588 530 "#N/A"   ""          3
    2002 "ADV NEUROMODULATION SYS INC" "008872" "00757T101" "3845" "334510" "ARROW INTERNATIONAL"        "ADS INTERNATIONAL"         "COMPANY"    2.78 4102 .9397588 530 "#N/A"   ""          3
    2003 "ADV NEUROMODULATION SYS INC" "008872" "00757T101" "3845" "334510" "ARROW INTERNATIONAL"        "ADS INTERNATIONAL"         "COMPANY"    1.44 4106 .9397588 530 "#N/A"   ""          3
    end
    Thanks in advance.

  • #2
    From the output of help matchit we see
    Code:
        weights(wgtfcn) specifies an specific weighting transformation for Grams.  Default is
            no weights (i.e. each one weights 1).  Built-in options are simple, log and root.
            Using weights is particularly recommended for large datasets where some Grams like
            "Inc", "Jr", "Av" are frequently found, because if not they increase the false
            positive matches.
    which suggests an approach to try.

    Comment


    • #3
      Originally posted by William Lisowski View Post
      From the output of help matchit we see
      Code:
      weights(wgtfcn) specifies an specific weighting transformation for Grams. Default is
      no weights (i.e. each one weights 1). Built-in options are simple, log and root.
      Using weights is particularly recommended for large datasets where some Grams like
      "Inc", "Jr", "Av" are frequently found, because if not they increase the false
      positive matches.
      which suggests an approach to try.
      Hi William,

      Thanks for your advice. What about -reclink-? Can this command realize the function? I also check this -help reclink- but do not find similar approach from this command.

      Comment


      • #4
        Originally posted by William Lisowski View Post
        From the output of help matchit we see
        Code:
        weights(wgtfcn) specifies an specific weighting transformation for Grams. Default is
        no weights (i.e. each one weights 1). Built-in options are simple, log and root.
        Using weights is particularly recommended for large datasets where some Grams like
        "Inc", "Jr", "Av" are frequently found, because if not they increase the false
        positive matches.
        which suggests an approach to try.
        Can you show hot to build weighting function in detail? I am a little confused about the coding. Thanks in advance

        Comment


        • #5
          As I understand matchit, the user does not build a weighting function - the user chooses one of the available options, which are discussed in the section of help matchit output headed "Notes on the different weighting options".

          Comment


          • #6
            Ok, this question can go to some of the hidden traits (& treats!) of matchit. So I probably would do an overkill for the sake of a wider audience (and as a big apology for not writing it in the documentation, which I hope to one day).

            First, the basic use:

            You can turn on/off weights by selecting one of the given weights functions. By default is off, which is equal to writing the option weights(noweights). If you want to turn it on, you simply need to pick one of the basic "canned" options: weights(simple), weights(log), or weights(root). In my experience (i.e. inductive not deductive), adding any of these weights options makes a lot of mostly positive difference in the results. At the same time, I found little impact of picking one of these over the other two. You can guess why as follows.

            Second, medium-to-advanced use (requires MATA coding):

            Matchit can be very dumb, but sometimes can be a little bit clever. If you add a MATA function named with the stub weight_ (e.g. weight_mysuperwgt), then you can use it in matchit by writing the option, for instance, weights(mysuperwgt). How can you do it? Well this how matchit codes the "canned" ones in MATA:

            Code:
            // GRAM weighting functions
            function weight_simple(real scalar gramfreq) {
              return (1/gramfreq)
             }
            function weight_root(real scalar gramfreq) {
              return (1/sqrt(gramfreq))
             }
            function weight_log(real scalar gramfreq)  {
              return (1/(log(gramfreq)+1))
             }
            Basically, you can create your own transformation to the "gramfreq" based on whatever you need. Typically, you would aim at having decreasing functions, as you want those grams that are too frequent in your data to be less meaningful in the final similarity score. But you may have reasons to do otherwise. Just remember that matchit will pass only one positive scalarto your function (i.e. integers >=1) and will expect you to return only one scalar. If you don't comply with these two rules, it will fail (that's why it is just a little bit clever). Also, coding a weight function which return zeroes or negative values (even if only for some cases) may conduct to unpredictable behavior (most likely crash), so I suggest avoiding it.

            For example, you can code in MATA:
            Code:
            function weight_mysuperwgt(real scalar x)  {
               return (1)
             }
            function weight_mysuperwgt2(real scalar x)  {
              return (x+5)
             }
            function weight_mysuperwgt3(real scalar x)  {
             if (x>=1 & x<=100) return (1)
             else if  (x>100 & x<=1000) return (.1)
             else return (.0001)
             }
            In this cases, weights(mysuperwgt) will simply return the same results as specifying noweights; weights(mysuperwgt2) will offset the frequencies of each gram by 5 (note that this is an increasing function); and, weights(mysuperwgt3) returns a decreasing function in three segments.

            Third, the advanced use:

            Matchit allows you also to pass your own weights calculated however you think it is best for your case. This is the option wgtfile(filename). How can you do your own weights file? The weights file is just a STATA data file (i.e. .dta) having two variables (grams and freq) listing all the grams and their respective frequencies. For instance, if you are using words as grams (i.e. sim(token)), you may like to have a list of prepositions and articles having less impact in the similarity score. So you can create a STATA data file where grams like "the", "in", "under", "over", etc. get freq values of 5000 (or any large number you prefer), and you put a low value (e.g. 1) to anything else. But how do you know what exactly is "anything else"? Well, matchit will consider any missing gram in the weights file as having a 1 value (again, just a little bit clever).

            You want to do a more exhaustive list than that? No problem, just use the matchit companion: freqindex. Freqindex was exactly created to help you (and poor old matchit) to do weights files on the fly (and it also helps matchit with the diagnose and stopwordauto options). Freqindex will generate a list of frequencies for your file using the same gram transformations (i.e. the similmethod() option) than matchit does. You can then edit this file as much as you want and later pass it to matchit using the option wgtfile(filename).

            The following code will generate a list of frequencies for your_file.dta using the same bigram transformation and use it later for the matching with your_file2.dta:

            Code:
            use your_file.dta
            freqindex mytextvar , sim(bigram)
            gsort  -freq // not essential but you can use it to browse what are the most frequent grams in your data.
            /*
            Do here whatever you want with your data, but keep the names of the variables.
            */
            save mywgtfile.dta
            matchit id1 text1 using your_file2.dta,  wgtfile(mywgtfile.dta) w(simple) sim(bigram) idu(id2) txtu(text2)
            Note that you can combine the second and third options.

            I hope this helps.

            Best,

            J.

            Comment


            • #7
              After reading post #5, I can confidently say that my understanding of matchit weighting functions expressed in post #5 was incorrect.

              For the task at hand, using the wgtfile() option to feed in high weights on words like INTERNATIONAL seems to be a useful way to proceed.

              Anybody working with matchit can do well to use the Statalist advanced search dialog box to search for posts written by Julio Raffo that contain the word matchit. There are other tutorial-style posts here that are equally helpful, especially in the following topics.

              http://www.statalist.org/forums/foru...s-observations

              http://www.statalist.org/forums/foru...-e-fuzzy-match

              https://www.statalist.org/forums/for...ng-using-lists
              Last edited by William Lisowski; 18 May 2020, 13:45.

              Comment


              • #8
                Hi, Julio. Thanks for your sharing.

                Another question. I try to use diagnose to report a preliminary analysis about common words appeared in two datasets, however, the result is blank. Both datasets are not that large in size (10,000 obs and 1,000 obs) and I am using Stata 15.1 via Mac.

                This problems seems quite weird. The following is my code and result.

                An example of master file.

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input float fyear str58 conm str6 gvkey str10 cusip str50 cnms float idmaster
                2004 "180 CONNECT INC"    "160475" "682343108" "DIRECTV GROUP INC"                                   1
                2005 "180 CONNECT INC"    "160475" "682343108" "DIRECTV GROUP INC"                                   2
                2006 "180 CONNECT INC"    "160475" "682343108" "DIRECTV GROUP INC"                                   3
                2007 "180 CONNECT INC"    "160475" "682343108" "DIRECTV GROUP INC"                                   4
                2000 "1MAGE SOFTWARE INC" "005962" "45244M102" "REYNOLDS & REYNOLDS -CL A"                           5
                2001 "1MAGE SOFTWARE INC" "005962" "45244M102" "REYNOLDS & REYNOLDS -CL A"                           6
                2002 "1MAGE SOFTWARE INC" "005962" "45244M102" "REYNOLDS & REYNOLDS -CL A"                           7
                2003 "1MAGE SOFTWARE INC" "005962" "45244M102" "REYNOLDS & REYNOLDS -CL A"                           8
                2012 "2U INC"             "019881" "90214J101" "GEORGETOWN UNIVERSITY SCHOOL OF NURSING AND HEALTH"  9
                2012 "2U INC"             "019881" "90214J101" "UNIVERSITY OF SOUTHERN CALIFORNIA"                  10
                end
                An example of using file.

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input str50 cnms str6 gvkey_cus str9 cusip_cus float idusing
                "1DRI"                "#N/A"   ""           1
                "1INSURER LTD"        "#N/A"   ""           2
                "1MC-AGRICO"          "#N/A"   ""           3
                "1ST FL NA"           "#N/A"   ""           4
                "1ST SOLUTIONS"       "#N/A"   ""           5
                "1ST UNION FL"        "#N/A"   ""           6
                "2 WIRELESS CARRIERS" "#N/A"   ""           7
                "20/20 SPORT"         "#N/A"   ""           8
                "20TH CENTRY"         "012886" "90130A101"  9
                "20TH CENTURY FOX"    "012886" "90130A101" 10
                end

                -matchit- diagnose coding and result.

                Code:
                matchit idmaster cnms using final1000, idusing(idusing) txtusing(cnms) di sim(token)
                Matching current dataset with final1000.dta
                Similarity function: token
                 
                Performing preliminary diagnosis
                --------------------------------
                 
                Analyzing Master file
                List of most frequent grams in Master file:
                 
                Analyzing Using file
                List of most frequent grams in Using file:
                (3,574 real changes made)
                (0 real changes made)
                 
                Overall diagnosis
                Pairs being compared: Master(10000) x Using(1000) = 10000000
                Estimated maximum reduction by indexation (%):97.5
                (note: this is an indication, final results may differ)
                 
                List of grams with greater negative impact to indexation:
                (note: values are estimated, final results may differ)
                 
                Loading USING file: final1000.dta
                Indexing USING file.
                0%
                20%
                40%
                60%
                80%
                Done!
                Computing results
                        Percent completed ...   (search space saved by index so far)
                        20%               ...   (96%)
                        40%               ...   (96%)
                        60%               ...   (96%)
                        80%               ...   (96%)
                        Done!
                Total search space saved by index: 96%

                Luckily, I can use the following code to list the top common words appeared in two datasets. But anyway, I just do not understand why diagnose cannot work with my code.

                Code:
                local myN=_N
                preserve
                freqindex cnms, sim(ngram, 3)
                gen share=freq/`myN'
                gsort -freq
                list in 1/20
                restore

                Comment


                • #9
                  Hi Chris,

                  This concerns your last comment about the diagnose option. I finally found the reason which is a minor bug in the code for the latest version (1.5.1). This does not affect the versions 1.5 or earlier. The good news is that (as you detected) the bug was a cosmetic one. Matchit was working appropriately in the full matching procedure, but it was just failing to output the results of the diagnose option (a capture without the noisily option, my bad).

                  Please get the 1.5.2 version from any of these sources:

                  - the attached file
                  - Github (https://github.com/julioraffo/matchit)
                  - SSC (once updated, it may take some days)

                  Let me know if this solves the issue.

                  Best,

                  J.
                  Attached Files
                  Last edited by Julio Raffo; 19 May 2020, 07:22. Reason: spelling typo corrected

                  Comment


                  • #10
                    Originally posted by Julio Raffo View Post
                    Hi Chris,

                    This concerns your last comment about the diagnose option. I finally found the reason which is a minor bug in the code for the latest version (1.5.1). This does not affect the versions 1.5 or earlier. The good news is that (as you detected) the bug was a cosmetic one. Matchit was working appropriately in the full matching procedure, but it was just failing to output the results of the diagnose option (a capture without the noisily option, my bad).

                    Please get the 1.5.2 version from any of this sources:

                    - the attached file
                    - Github (https://github.com/julioraffo/matchit)
                    - SSC (once updated, it may take some days)

                    Let me know if this solves the issue.

                    Best,

                    J.

                    Thanks for updating that news.

                    I am also curious about the use of freqindex. When I use the following code to detect the top common words in the dataset, it may be dangerous that sometimes freqindex cannot recognize which are redundant common words and which are firms' indispensable names. e.g. "LTD", "CORP", and "INC" are more likely to be redundant common words, however, "CO", "ION", and "ING" are more likely to be parts of firms' identifiable names. In this way, is there any advanced method to distinguish this confusion?

                    Code:
                    local myN=_N
                    preserve
                    freqindex cnms, sim(ngram, 3)
                    gen share=freq/`myN'
                    gsort -freq
                    list in 1/20
                    restore

                    Comment


                    • #11
                      Freqindex can also be quite dumb, as in it will do exactly what you ask it to do, but it does no "magic". But you have to keep in mind that weights (or stopwords) is as much about loosing relevant information as it is about keeping irrelevant one. The trade off between these is something than only the researcher working with the data can tell.

                      Comment


                      • #12
                        Originally posted by Julio Raffo View Post

                        Please get the 1.5.2 version from any of these sources:

                        - the attached file
                        - Github (https://github.com/julioraffo/matchit)
                        - SSC (once updated, it may take some days)

                        Let me know if this solves the issue.

                        Best,

                        J.
                        Hi Julio. I run the code in the attached file via Stata and restart it. However, Stata still cannot work out the diagnose and have an even worse result, that is, an error message.

                        Code:
                        matchit idmaster cnms using final1000, idusing(idusing) txtusing(cnms) di sim(token)
                        invalid syntax
                        r(197);
                        [P] error . . . . . . . . . . . . . . . . . . . . . . . . Return code 197
                        invalid syntax
                        This error is produced by syntax and other parsing commands when
                        there is a syntax error in the use of the command itself rather
                        than in what is being parsed.

                        Comment


                        • #13
                          Hi Chris,

                          This is weird but I cannot replicate your error. When I run the following code based on your examples (note that I added four entries to have something matched):
                          Code:
                          tempfile mymaster
                          tempfile myusing
                          clear
                          input str50 cnms str6 gvkey_cus str9 cusip_cus float idusing
                          "1DRI"                "#N/A"   ""           1
                          "1INSURER LTD"        "#N/A"   ""           2
                          "1MC-AGRICO"          "#N/A"   ""           3
                          "1ST FL NA"           "#N/A"   ""           4
                          "1ST SOLUTIONS"       "#N/A"   ""           5
                          "1ST UNION FL"        "#N/A"   ""           6
                          "2 WIRELESS CARRIERS" "#N/A"   ""           7
                          "20/20 SPORT"         "#N/A"   ""           8
                          "20TH CENTRY"         "012886" "90130A101"  9
                          "20TH CENTURY FOX"    "012886" "90130A101" 10
                          end
                          save `myusing'
                          clear
                          input float fyear str58 conm str6 gvkey str10 cusip str50 cnms float idmaster
                          2004 "180 CONNECT INC"    "160475" "682343108" "DIRECTV GROUP INC"                                   1
                          2005 "180 CONNECT INC"    "160475" "682343108" "DIRECTV GROUP INC"                                   2
                          2006 "180 CONNECT INC"    "160475" "682343108" "DIRECTV GROUP INC"                                   3
                          2007 "180 CONNECT INC"    "160475" "682343108" "DIRECTV GROUP INC"                                   4
                          2000 "1MAGE SOFTWARE INC" "005962" "45244M102" "REYNOLDS & REYNOLDS -CL A"                           5
                          2001 "1MAGE SOFTWARE INC" "005962" "45244M102" "REYNOLDS & REYNOLDS -CL A"                           6
                          2002 "1MAGE SOFTWARE INC" "005962" "45244M102" "REYNOLDS & REYNOLDS -CL A"                           7
                          2003 "1MAGE SOFTWARE INC" "005962" "45244M102" "REYNOLDS & REYNOLDS -CL A"                           8
                          2012 "2U INC"             "019881" "90214J101" "GEORGETOWN UNIVERSITY SCHOOL OF NURSING AND HEALTH"  9
                          2012 "2U INC"             "019881" "90214J101" "UNIVERSITY OF SOUTHERN CALIFORNIA"                  10
                          2012 "2U INC"             "019881" "90214J101" "20TH CENTRY"  9
                          2012 "2U INC"             "019881" "90214J101" "20TH CENTRY FX"                  10
                          2012 "2U INC"             "019881" "90214J101" "20TH CENTURY FOX"  9
                          2012 "2U INC"             "019881" "90214J101" "20 CENTURY FOX"                  10
                          end
                          save `mymaster'
                          matchit idmaster cnms using `myusing', idusing(idusing) txtusing(cnms) di sim(token)
                          which matchit
                          gsort - similscore
                          list
                          It works for me:
                          Code:
                          . tempfile mymaster
                          . tempfile myusing
                          . clear
                          . input str50 cnms str6 gvkey_cus str9 cusip_cus float idusing
                          /* OUTPUT OMITTED */
                          . save `myusing'
                          file C:\Users\XXXXXX\ST_00000002.tmp saved
                          . clear
                          . input float fyear str58 conm str6 gvkey str10 cusip str50 cnms float idmaster
                          /* OUTPUT OMITTED */
                          . save `mymaster'
                          file C:\Users\XXXXXXXX\ST_00000001.tmp saved
                          
                          . matchit idmaster cnms using `myusing', idusing(idusing) txtusing(cnms) di sim(token)
                          
                          Matching current dataset with C:\Users\JULIOR~1\AppData\Local\Temp\ST_00000002.tmp
                          Similarity function: token
                           
                          Performing preliminary diagnosis
                          --------------------------------
                           
                          Analyzing Master file
                          List of most frequent grams in Master file:
                          
                                      grams   freq   grams_per_obs  
                            1.     REYNOLDS      8          0.5714  
                            2.          INC      4          0.2857  
                            3.      DIRECTV      4          0.2857  
                            4.            &      4          0.2857  
                            5.          -CL      4          0.2857  
                            6.        GROUP      4          0.2857  
                            7.            A      4          0.2857  
                            8.         20TH      3          0.2143  
                            9.       CENTRY      2          0.1429  
                           10.      CENTURY      2          0.1429  
                           11.   UNIVERSITY      2          0.1429  
                           12.          FOX      2          0.1429  
                           13.           OF      2          0.1429  
                           14.       HEALTH      1          0.0714  
                           15.           20      1          0.0714  
                           16.      NURSING      1          0.0714  
                           17.   CALIFORNIA      1          0.0714  
                           18.           FX      1          0.0714  
                           19.          AND      1          0.0714  
                           20.       SCHOOL      1          0.0714  
                           
                          Analyzing Using file
                          List of most frequent grams in Using file:
                          
                                      grams   freq   grams_per_obs  
                            1.          1ST      3          0.3000  
                            2.         20TH      2          0.2000  
                            3.           FL      2          0.2000  
                            4.            2      1          0.1000  
                            5.     CARRIERS      1          0.1000  
                            6.    SOLUTIONS      1          0.1000  
                            7.      CENTURY      1          0.1000  
                            8.          FOX      1          0.1000  
                            9.     1INSURER      1          0.1000  
                           10.        20/20      1          0.1000  
                           11.         1DRI      1          0.1000  
                           12.        SPORT      1          0.1000  
                           13.     WIRELESS      1          0.1000  
                           14.       CENTRY      1          0.1000  
                           15.        UNION      1          0.1000  
                           16.          LTD      1          0.1000  
                           17.           NA      1          0.1000  
                           18.   1MC-AGRICO      1          0.1000  
                           
                          Overall diagnosis
                          Pairs being compared: Master(14) x Using(10) = 140
                          Estimated maximum reduction by indexation (%):95.71
                          (note: this is an indication, final results may differ)
                           
                          List of grams with greater negative impact to indexation:
                          (note: values are estimated, final results may differ)
                          
                                      grams   crosspairs   max_common_space   grams_per_obs  
                            1.         20TH            6               4.29          0.2083  
                            2.       CENTRY            2               1.43          0.1250  
                            3.      CENTURY            2               1.43          0.1250  
                            4.          FOX            2               1.43          0.1250  
                            5.        UNION            .               0.00               .  
                            6.      DIRECTV            .               0.00               .  
                            7.        20/20            .               0.00               .  
                            8.          1ST            .               0.00               .  
                            9.       HEALTH            .               0.00               .  
                           10.     REYNOLDS            .               0.00               .  
                           11.           NA            .               0.00               .  
                           12.     CARRIERS            .               0.00               .  
                           13.            A            .               0.00               .  
                           14.   GEORGETOWN            .               0.00               .  
                           15.         1DRI            .               0.00               .  
                           16.   UNIVERSITY            .               0.00               .  
                           17.           FL            .               0.00               .  
                           18.    SOLUTIONS            .               0.00               .  
                           19.          AND            .               0.00               .  
                           20.          -CL            .               0.00               .  
                           
                          Loading USING file: C:\Users\JULIOR~1\AppData\Local\Temp\ST_00000002.tmp
                          Indexing USING file.
                          0%
                          40%
                          60%
                          80%
                          100%
                          Done!
                          Computing results
                                  Percent completed ...   (search space saved by index so far)
                                  20%               ...   (100%)
                                  40%               ...   (100%)
                                  60%               ...   (100%)
                                  80%               ...   (97%)
                                  Done!
                          Total search space saved by index: 95%
                          
                          . which matchit
                          c:\XXXXXX\matchit.ado
                          *! 1.5.2 J.D. Raffo May 2020
                          
                          . gsort - similscore
                          
                          . list
                          
                               +----------------------------------------------------------------------+
                               | idmaster               cnms   idusing              cnms1   similsc~e |
                               |----------------------------------------------------------------------|
                            1. |        9   20TH CENTURY FOX        10   20TH CENTURY FOX           1 |
                            2. |        9        20TH CENTRY         9        20TH CENTRY           1 |
                            3. |       10     20TH CENTRY FX         9        20TH CENTRY   .81649658 |
                            4. |       10     20 CENTURY FOX        10   20TH CENTURY FOX   .66666667 |
                               +----------------------------------------------------------------------+
                          
                          end of do-file

                          Comment


                          • #14
                            Chris Jiao -

                            Do you have freqindex installed? I duplicated your problem using the code from post #13 before installed it. Once I installed it, the code ran successfully.

                            Julio Raffo -

                            Here is a trace. It is very strange because the trace shows two functioning display commands, but like Chris I did not see those display results in my Results window, only the syntax error message. The error message is some sort of generic r(111) error message, because you captured the output of the which command with its specific error message shown toward the end of the help which output. I think maybe instead of display you need display as error but that is just a guess.

                            Code:
                            . set trace on
                            
                            . matchit idmaster cnms using `myusing', idusing(idusing) txtusing(cnms) di sim(token)
                              --------------------------------------------------------------------------- begin matchit ---
                              - version 12
                              - syntax varlist(min=2 max=2) [using/] [, IDUsing(name) TXTUsing(name)] [SIMilmethod(string a
                            > sis)] [Weights(string)] [WGTFile(string)] [Score(string)] [Threshold(real .5)] [Flag(real 20)
                            > ] [DIagnose] [STOPWordsauto] [SWThreshold(real .2)] [OVERride] [Generate(string)] [KEEPMata] 
                            > [TIme]
                              - cap which freqindex
                              - if (_rc!=0){
                              - di "freqindex not found."
                            freqindex not found.
                              - di "matchit requires freqindex to be installed. You can get it in SSC."
                            matchit requires freqindex to be installed. You can get it in SSC.
                              - error _rc
                            invalid syntax
                                }
                              ----------------------------------------------------------------------------- end matchit ---
                            r(111);

                            Comment


                            • #15
                              Thanks William Lisowski. By removing freqindex I replicate the same mistake than you. I do get to see the messages displayed and I get the same error code than you (i.e. 111):

                              Code:
                              . matchit idmaster cnms using `myusing', idusing(idusing) txtusing(cnms) di sim(token)
                              freqindex not found.
                              matchit requires freqindex to be installed. You can get it in SSC.
                              invalid syntax
                              r(111);
                              However, Chris Jiao output has a different error code (197) and displays no messages. Chris, can you please confirm or dismiss the freqindex hypothesis?

                              Comment

                              Working...
                              X