How to overcome problems in fuzzy match via matchit and reclink?

Chris Jiao

Join Date: Jan 2020
Posts: 66

How to overcome problems in fuzzy match via matchit and reclink?

17 May 2020, 11:27

Hi Statalisters,

I try to use fuzzy match commands matchit and reclink to merge two datasets.

Here is an example of master file. I am focusing on using the third column cnms (company name) to match data.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float fyear str58 conm str50 cnms
2004 "180 CONNECT INC"    "DIRECTV GROUP INC"                                 
2005 "180 CONNECT INC"    "DIRECTV GROUP INC"                                 
2006 "180 CONNECT INC"    "DIRECTV GROUP INC"                                 
2007 "180 CONNECT INC"    "DIRECTV GROUP INC"                                 
2000 "1MAGE SOFTWARE INC" "Reynolds & Reynolds  -CL A"                        
2001 "1MAGE SOFTWARE INC" "Reynolds & Reynolds  -CL A"                        
2002 "1MAGE SOFTWARE INC" "Reynolds & Reynolds  -CL A"                        
2003 "1MAGE SOFTWARE INC" "Reynolds & Reynolds  -CL A"                        
2012 "2U INC"             "Georgetown University School of Nursing and Health"
2012 "2U INC"             "University of Southern California"                 
end

Here is an example of using file. I will use cnms as the variable to match.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str50 cnms str6 gvkey_cus str9 cusip_cus
"20TH CENTRY"                  "012886" "90130A101"
"20TH CENTURY FOX"             "012886" "90130A101"
"20TH CENTY"                   "012886" "90130A101"
"TWENTY-FIRST CENTURY FOX INC" "012886" "90130A101"
"2122UNITED NATURAL FOODS INC" "#N/A"   ""         
"21ST CENTY TELECOM GROUP INC" "#N/A"   ""         
"238 TELECOM LIMITED"          "#N/A"   ""         
"24 HOUR FITNESS"              "#N/A"   ""         
"24 HOUR FITNESS USA, INC."    "#N/A"   ""         
"24 HOUR FITNESS WORLD, INC."  "#N/A"   ""         
"24/7"                         "#N/A"   ""         
end

Here are my reclink and matching codes.

Code:

reclink cnms using final1000, idmaster(idmaster) idusing(idusing) gen(matchscore) _merge(_merge) minscore(.9)

Code:

matchit idmaster cnms using final1000, idusing(idusing) txtusing(cnms)

The problem is after matching, both commands encounter similar problems, that is, (see the following example) commands seem to be confused by some common names among firms, such as CORP, INC, and LTD. For example, between observations "ARROW INTERNATIONAL" and "ADS INTERNATIONAL"， the commands think they can be matched with a high score, however, the commands are confused by "INTERNATIONAL" and actually they are two distinct firms. Does anyone how to overcome such problems in fuzzy match? Can we allocate different weights within an observation to different words?

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float fyear str58 conm str6 gvkey str10 cusip str4 sic str6 naics str50(cnms Ucnms) str8 ctype double salecs float(idmaster matchscore idusing) str6 gvkey_cus str9 cusip_cus byte _merge
2001 "ACURA PHARMACEUTICALS INC"   "011929" "00509L802" "2834" "325412" "WATSON PHARMACEUTICALS INC" "AGIOS PHARMACEUTICALS INC" "COMPANY"  14.559 3359 .9310636 827 "#N/A"   ""          3
2002 "ACURA PHARMACEUTICALS INC"   "011929" "00509L802" "2834" "325412" "WATSON PHARMACEUTICALS INC" "AGIOS PHARMACEUTICALS INC" "COMPANY"   6.974 3361 .9310636 827 "#N/A"   ""          3
2003 "ACURA PHARMACEUTICALS INC"   "011929" "00509L802" "2834" "325412" "WATSON PHARMACEUTICALS INC" "AGIOS PHARMACEUTICALS INC" "COMPANY"   3.335 3362 .9310636 827 "#N/A"   ""          3
2009 "ADTRAN INC"                  "030576" "00738A106" "3661" "334210" "AT&T INC"                   "AT&T INC"                  "COMPANY" 106.521 4057        1 116 "009899" "00206R102" 3
2010 "ADTRAN INC"                  "030576" "00738A106" "3661" "334210" "AT&T INC"                   "AT&T INC"                  "COMPANY" 109.021 4064        1 116 "009899" "00206R102" 3
2001 "ADV NEUROMODULATION SYS INC" "008872" "00757T101" "3845" "334510" "ARROW INTERNATIONAL"        "ADS INTERNATIONAL"         "COMPANY"     1.8 4100 .9397588 530 "#N/A"   ""          3
2002 "ADV NEUROMODULATION SYS INC" "008872" "00757T101" "3845" "334510" "ARROW INTERNATIONAL"        "ADS INTERNATIONAL"         "COMPANY"    2.78 4102 .9397588 530 "#N/A"   ""          3
2003 "ADV NEUROMODULATION SYS INC" "008872" "00757T101" "3845" "334510" "ARROW INTERNATIONAL"        "ADS INTERNATIONAL"         "COMPANY"    1.44 4106 .9397588 530 "#N/A"   ""          3
end

Thanks in advance.

Tags: None

William Lisowski

Join Date: Dec 2014
Posts: 10150

17 May 2020, 11:53

From the output of help matchit we see

Code:

    weights(wgtfcn) specifies an specific weighting transformation for Grams.  Default is
        no weights (i.e. each one weights 1).  Built-in options are simple, log and root.
        Using weights is particularly recommended for large datasets where some Grams like
        "Inc", "Jr", "Av" are frequently found, because if not they increase the false
        positive matches.

which suggests an approach to try.

Comment

Chris Jiao

Join Date: Jan 2020

Posts: 66
#3

17 May 2020, 20:09

Originally posted by William Lisowski View Post

From the output of help matchit we see

Code:

weights(wgtfcn) specifies an specific weighting transformation for Grams. Default is no weights (i.e. each one weights 1). Built-in options are simple, log and root. Using weights is particularly recommended for large datasets where some Grams like "Inc", "Jr", "Av" are frequently found, because if not they increase the false positive matches.

which suggests an approach to try.

Hi William,

Thanks for your advice. What about -reclink-? Can this command realize the function? I also check this -help reclink- but do not find similar approach from this command.
Comment

Chris Jiao

Join Date: Jan 2020
Posts: 66

17 May 2020, 20:21

Originally posted by William Lisowski View Post

From the output of help matchit we see

Code:

weights(wgtfcn) specifies an specific weighting transformation for Grams. Default is
no weights (i.e. each one weights 1). Built-in options are simple, log and root.
Using weights is particularly recommended for large datasets where some Grams like
"Inc", "Jr", "Av" are frequently found, because if not they increase the false
positive matches.

which suggests an approach to try.

Can you show hot to build weighting function in detail? I am a little confused about the coding. Thanks in advance

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

18 May 2020, 02:50

As I understand matchit, the user does not build a weighting function - the user chooses one of the available options, which are discussed in the section of help matchit output headed "Notes on the different weighting options".
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#6

18 May 2020, 12:37

Ok, this question can go to some of the hidden traits (& treats!) of matchit. So I probably would do an overkill for the sake of a wider audience (and as a big apology for not writing it in the documentation, which I hope to one day).

First, the basic use:

You can turn on/off weights by selecting one of the given weights functions. By default is off, which is equal to writing the option weights(noweights). If you want to turn it on, you simply need to pick one of the basic "canned" options: weights(simple), weights(log), or weights(root). In my experience (i.e. inductive not deductive), adding any of these weights options makes a lot of mostly positive difference in the results. At the same time, I found little impact of picking one of these over the other two. You can guess why as follows.

Second, medium-to-advanced use (requires MATA coding):

Matchit can be very dumb, but sometimes can be a little bit clever. If you add a MATA function named with the stub weight_ (e.g. weight_mysuperwgt), then you can use it in matchit by writing the option, for instance, weights(mysuperwgt). How can you do it? Well this how matchit codes the "canned" ones in MATA:

Code:

// GRAM weighting functions function weight_simple(real scalar gramfreq) { return (1/gramfreq) } function weight_root(real scalar gramfreq) { return (1/sqrt(gramfreq)) } function weight_log(real scalar gramfreq) { return (1/(log(gramfreq)+1)) }

Basically, you can create your own transformation to the "gramfreq" based on whatever you need. Typically, you would aim at having decreasing functions, as you want those grams that are too frequent in your data to be less meaningful in the final similarity score. But you may have reasons to do otherwise. Just remember that matchit will pass only one positive scalarto your function (i.e. integers >=1) and will expect you to return only one scalar. If you don't comply with these two rules, it will fail (that's why it is just a little bit clever). Also, coding a weight function which return zeroes or negative values (even if only for some cases) may conduct to unpredictable behavior (most likely crash), so I suggest avoiding it.

For example, you can code in MATA:

Code:

function weight_mysuperwgt(real scalar x) { return (1) } function weight_mysuperwgt2(real scalar x) { return (x+5) } function weight_mysuperwgt3(real scalar x) { if (x>=1 & x<=100) return (1) else if (x>100 & x<=1000) return (.1) else return (.0001) }

In this cases, weights(mysuperwgt) will simply return the same results as specifying noweights; weights(mysuperwgt2) will offset the frequencies of each gram by 5 (note that this is an increasing function); and, weights(mysuperwgt3) returns a decreasing function in three segments.

Third, the advanced use:

Matchit allows you also to pass your own weights calculated however you think it is best for your case. This is the option wgtfile(filename). How can you do your own weights file? The weights file is just a STATA data file (i.e. .dta) having two variables (grams and freq) listing all the grams and their respective frequencies. For instance, if you are using words as grams (i.e. sim(token)), you may like to have a list of prepositions and articles having less impact in the similarity score. So you can create a STATA data file where grams like "the", "in", "under", "over", etc. get freq values of 5000 (or any large number you prefer), and you put a low value (e.g. 1) to anything else. But how do you know what exactly is "anything else"? Well, matchit will consider any missing gram in the weights file as having a 1 value (again, just a little bit clever).

You want to do a more exhaustive list than that? No problem, just use the matchit companion: freqindex. Freqindex was exactly created to help you (and poor old matchit) to do weights files on the fly (and it also helps matchit with the diagnose and stopwordauto options). Freqindex will generate a list of frequencies for your file using the same gram transformations (i.e. the similmethod() option) than matchit does. You can then edit this file as much as you want and later pass it to matchit using the option wgtfile(filename).

The following code will generate a list of frequencies for your_file.dta using the same bigram transformation and use it later for the matching with your_file2.dta:

Code:

use your_file.dta freqindex mytextvar , sim(bigram) gsort -freq // not essential but you can use it to browse what are the most frequent grams in your data. /* Do here whatever you want with your data, but keep the names of the variables. */ save mywgtfile.dta matchit id1 text1 using your_file2.dta, wgtfile(mywgtfile.dta) w(simple) sim(bigram) idu(id2) txtu(text2)

Note that you can combine the second and third options.

I hope this helps.

Best,

J.
2 likes
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

18 May 2020, 13:39

After reading post #5, I can confidently say that my understanding of matchit weighting functions expressed in post #5 was incorrect.

For the task at hand, using the wgtfile() option to feed in high weights on words like INTERNATIONAL seems to be a useful way to proceed.

Anybody working with matchit can do well to use the Statalist advanced search dialog box to search for posts written by Julio Raffo that contain the word matchit. There are other tutorial-style posts here that are equally helpful, especially in the following topics.

http://www.statalist.org/forums/foru...s-observations

http://www.statalist.org/forums/foru...-e-fuzzy-match

https://www.statalist.org/forums/for...ng-using-lists

Last edited by William Lisowski; 18 May 2020, 13:45.
Comment

Chris Jiao

Join Date: Jan 2020
Posts: 66

18 May 2020, 20:16

Hi, Julio. Thanks for your sharing.

Another question. I try to use diagnose to report a preliminary analysis about common words appeared in two datasets, however, the result is blank. Both datasets are not that large in size (10,000 obs and 1,000 obs) and I am using Stata 15.1 via Mac.

This problems seems quite weird. The following is my code and result.

An example of master file.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float fyear str58 conm str6 gvkey str10 cusip str50 cnms float idmaster
2004 "180 CONNECT INC"    "160475" "682343108" "DIRECTV GROUP INC"                                   1
2005 "180 CONNECT INC"    "160475" "682343108" "DIRECTV GROUP INC"                                   2
2006 "180 CONNECT INC"    "160475" "682343108" "DIRECTV GROUP INC"                                   3
2007 "180 CONNECT INC"    "160475" "682343108" "DIRECTV GROUP INC"                                   4
2000 "1MAGE SOFTWARE INC" "005962" "45244M102" "REYNOLDS & REYNOLDS -CL A"                           5
2001 "1MAGE SOFTWARE INC" "005962" "45244M102" "REYNOLDS & REYNOLDS -CL A"                           6
2002 "1MAGE SOFTWARE INC" "005962" "45244M102" "REYNOLDS & REYNOLDS -CL A"                           7
2003 "1MAGE SOFTWARE INC" "005962" "45244M102" "REYNOLDS & REYNOLDS -CL A"                           8
2012 "2U INC"             "019881" "90214J101" "GEORGETOWN UNIVERSITY SCHOOL OF NURSING AND HEALTH"  9
2012 "2U INC"             "019881" "90214J101" "UNIVERSITY OF SOUTHERN CALIFORNIA"                  10
end

An example of using file.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str50 cnms str6 gvkey_cus str9 cusip_cus float idusing
"1DRI"                "#N/A"   ""           1
"1INSURER LTD"        "#N/A"   ""           2
"1MC-AGRICO"          "#N/A"   ""           3
"1ST FL NA"           "#N/A"   ""           4
"1ST SOLUTIONS"       "#N/A"   ""           5
"1ST UNION FL"        "#N/A"   ""           6
"2 WIRELESS CARRIERS" "#N/A"   ""           7
"20/20 SPORT"         "#N/A"   ""           8
"20TH CENTRY"         "012886" "90130A101"  9
"20TH CENTURY FOX"    "012886" "90130A101" 10
end

-matchit- diagnose coding and result.

Code:

matchit idmaster cnms using final1000, idusing(idusing) txtusing(cnms) di sim(token)
Matching current dataset with final1000.dta
Similarity function: token
 
Performing preliminary diagnosis
--------------------------------
 
Analyzing Master file
List of most frequent grams in Master file:
 
Analyzing Using file
List of most frequent grams in Using file:
(3,574 real changes made)
(0 real changes made)
 
Overall diagnosis
Pairs being compared: Master(10000) x Using(1000) = 10000000
Estimated maximum reduction by indexation (%):97.5
(note: this is an indication, final results may differ)
 
List of grams with greater negative impact to indexation:
(note: values are estimated, final results may differ)
 
Loading USING file: final1000.dta
Indexing USING file.
0%
20%
40%
60%
80%
Done!
Computing results
        Percent completed ...   (search space saved by index so far)
        20%               ...   (96%)
        40%               ...   (96%)
        60%               ...   (96%)
        80%               ...   (96%)
        Done!
Total search space saved by index: 96%

Luckily, I can use the following code to list the top common words appeared in two datasets. But anyway, I just do not understand why diagnose cannot work with my code.

Code:

local myN=_N
preserve
freqindex cnms, sim(ngram, 3)
gen share=freq/`myN'
gsort -freq
list in 1/20
restore

Comment

Julio Raffo

Join Date: May 2014

Posts: 132
#9

19 May 2020, 06:34

Hi Chris,

This concerns your last comment about the diagnose option. I finally found the reason which is a minor bug in the code for the latest version (1.5.1). This does not affect the versions 1.5 or earlier. The good news is that (as you detected) the bug was a cosmetic one. Matchit was working appropriately in the full matching procedure, but it was just failing to output the results of the diagnose option (a capture without the noisily option, my bad).

Please get the 1.5.2 version from any of these sources:

- the attached file
- Github (https://github.com/julioraffo/matchit)
- SSC (once updated, it may take some days)

Let me know if this solves the issue.

Best,

J.

Attached Files

matchit.ado (36.7 KB, 1 view)

Last edited by Julio Raffo; 19 May 2020, 07:22. Reason: spelling typo corrected
Comment
Chris Jiao

Join Date: Jan 2020

Posts: 66
#10

19 May 2020, 06:48

Originally posted by Julio Raffo View Post

Hi Chris,

This concerns your last comment about the diagnose option. I finally found the reason which is a minor bug in the code for the latest version (1.5.1). This does not affect the versions 1.5 or earlier. The good news is that (as you detected) the bug was a cosmetic one. Matchit was working appropriately in the full matching procedure, but it was just failing to output the results of the diagnose option (a capture without the noisily option, my bad).

Please get the 1.5.2 version from any of this sources:

- the attached file
- Github (https://github.com/julioraffo/matchit)
- SSC (once updated, it may take some days)

Let me know if this solves the issue.

Best,

J.

Thanks for updating that news.

I am also curious about the use of freqindex. When I use the following code to detect the top common words in the dataset, it may be dangerous that sometimes freqindex cannot recognize which are redundant common words and which are firms' indispensable names. e.g. "LTD", "CORP", and "INC" are more likely to be redundant common words, however, "CO", "ION", and "ING" are more likely to be parts of firms' identifiable names. In this way, is there any advanced method to distinguish this confusion?

Code:

local myN=_N preserve freqindex cnms, sim(ngram, 3) gen share=freq/`myN' gsort -freq list in 1/20 restore
Comment
Julio Raffo

Join Date: May 2014

Posts: 132
#11

19 May 2020, 07:28

Freqindex can also be quite dumb, as in it will do exactly what you ask it to do, but it does no "magic". But you have to keep in mind that weights (or stopwords) is as much about loosing relevant information as it is about keeping irrelevant one. The trade off between these is something than only the researcher working with the data can tell.
1 like
Comment
Chris Jiao

Join Date: Jan 2020

Posts: 66
#12

19 May 2020, 08:22

Originally posted by Julio Raffo View Post

Please get the 1.5.2 version from any of these sources:

- the attached file
- Github (https://github.com/julioraffo/matchit)
- SSC (once updated, it may take some days)

Let me know if this solves the issue.

Best,

J.

Hi Julio. I run the code in the attached file via Stata and restart it. However, Stata still cannot work out the diagnose and have an even worse result, that is, an error message.

Code:

matchit idmaster cnms using final1000, idusing(idusing) txtusing(cnms) di sim(token) invalid syntax r(197);

[P] error . . . . . . . . . . . . . . . . . . . . . . . . Return code 197
invalid syntax
This error is produced by syntax and other parsing commands when
there is a syntax error in the use of the command itself rather
than in what is being parsed.
Comment

Julio Raffo

Join Date: May 2014
Posts: 132

#13

19 May 2020, 09:22

Hi Chris,

This is weird but I cannot replicate your error. When I run the following code based on your examples (note that I added four entries to have something matched):

Code:

tempfile mymaster
tempfile myusing
clear
input str50 cnms str6 gvkey_cus str9 cusip_cus float idusing
"1DRI"                "#N/A"   ""           1
"1INSURER LTD"        "#N/A"   ""           2
"1MC-AGRICO"          "#N/A"   ""           3
"1ST FL NA"           "#N/A"   ""           4
"1ST SOLUTIONS"       "#N/A"   ""           5
"1ST UNION FL"        "#N/A"   ""           6
"2 WIRELESS CARRIERS" "#N/A"   ""           7
"20/20 SPORT"         "#N/A"   ""           8
"20TH CENTRY"         "012886" "90130A101"  9
"20TH CENTURY FOX"    "012886" "90130A101" 10
end
save `myusing'
clear
input float fyear str58 conm str6 gvkey str10 cusip str50 cnms float idmaster
2004 "180 CONNECT INC"    "160475" "682343108" "DIRECTV GROUP INC"                                   1
2005 "180 CONNECT INC"    "160475" "682343108" "DIRECTV GROUP INC"                                   2
2006 "180 CONNECT INC"    "160475" "682343108" "DIRECTV GROUP INC"                                   3
2007 "180 CONNECT INC"    "160475" "682343108" "DIRECTV GROUP INC"                                   4
2000 "1MAGE SOFTWARE INC" "005962" "45244M102" "REYNOLDS & REYNOLDS -CL A"                           5
2001 "1MAGE SOFTWARE INC" "005962" "45244M102" "REYNOLDS & REYNOLDS -CL A"                           6
2002 "1MAGE SOFTWARE INC" "005962" "45244M102" "REYNOLDS & REYNOLDS -CL A"                           7
2003 "1MAGE SOFTWARE INC" "005962" "45244M102" "REYNOLDS & REYNOLDS -CL A"                           8
2012 "2U INC"             "019881" "90214J101" "GEORGETOWN UNIVERSITY SCHOOL OF NURSING AND HEALTH"  9
2012 "2U INC"             "019881" "90214J101" "UNIVERSITY OF SOUTHERN CALIFORNIA"                  10
2012 "2U INC"             "019881" "90214J101" "20TH CENTRY"  9
2012 "2U INC"             "019881" "90214J101" "20TH CENTRY FX"                  10
2012 "2U INC"             "019881" "90214J101" "20TH CENTURY FOX"  9
2012 "2U INC"             "019881" "90214J101" "20 CENTURY FOX"                  10
end
save `mymaster'
matchit idmaster cnms using `myusing', idusing(idusing) txtusing(cnms) di sim(token)
which matchit
gsort - similscore
list

It works for me:

Code:

. tempfile mymaster
. tempfile myusing
. clear
. input str50 cnms str6 gvkey_cus str9 cusip_cus float idusing
/* OUTPUT OMITTED */
. save `myusing'
file C:\Users\XXXXXX\ST_00000002.tmp saved
. clear
. input float fyear str58 conm str6 gvkey str10 cusip str50 cnms float idmaster
/* OUTPUT OMITTED */
. save `mymaster'
file C:\Users\XXXXXXXX\ST_00000001.tmp saved

. matchit idmaster cnms using `myusing', idusing(idusing) txtusing(cnms) di sim(token)

Matching current dataset with C:\Users\JULIOR~1\AppData\Local\Temp\ST_00000002.tmp
Similarity function: token
 
Performing preliminary diagnosis
--------------------------------
 
Analyzing Master file
List of most frequent grams in Master file:

            grams   freq   grams_per_obs  
  1.     REYNOLDS      8          0.5714  
  2.          INC      4          0.2857  
  3.      DIRECTV      4          0.2857  
  4.            &      4          0.2857  
  5.          -CL      4          0.2857  
  6.        GROUP      4          0.2857  
  7.            A      4          0.2857  
  8.         20TH      3          0.2143  
  9.       CENTRY      2          0.1429  
 10.      CENTURY      2          0.1429  
 11.   UNIVERSITY      2          0.1429  
 12.          FOX      2          0.1429  
 13.           OF      2          0.1429  
 14.       HEALTH      1          0.0714  
 15.           20      1          0.0714  
 16.      NURSING      1          0.0714  
 17.   CALIFORNIA      1          0.0714  
 18.           FX      1          0.0714  
 19.          AND      1          0.0714  
 20.       SCHOOL      1          0.0714  
 
Analyzing Using file
List of most frequent grams in Using file:

            grams   freq   grams_per_obs  
  1.          1ST      3          0.3000  
  2.         20TH      2          0.2000  
  3.           FL      2          0.2000  
  4.            2      1          0.1000  
  5.     CARRIERS      1          0.1000  
  6.    SOLUTIONS      1          0.1000  
  7.      CENTURY      1          0.1000  
  8.          FOX      1          0.1000  
  9.     1INSURER      1          0.1000  
 10.        20/20      1          0.1000  
 11.         1DRI      1          0.1000  
 12.        SPORT      1          0.1000  
 13.     WIRELESS      1          0.1000  
 14.       CENTRY      1          0.1000  
 15.        UNION      1          0.1000  
 16.          LTD      1          0.1000  
 17.           NA      1          0.1000  
 18.   1MC-AGRICO      1          0.1000  
 
Overall diagnosis
Pairs being compared: Master(14) x Using(10) = 140
Estimated maximum reduction by indexation (%):95.71
(note: this is an indication, final results may differ)
 
List of grams with greater negative impact to indexation:
(note: values are estimated, final results may differ)

            grams   crosspairs   max_common_space   grams_per_obs  
  1.         20TH            6               4.29          0.2083  
  2.       CENTRY            2               1.43          0.1250  
  3.      CENTURY            2               1.43          0.1250  
  4.          FOX            2               1.43          0.1250  
  5.        UNION            .               0.00               .  
  6.      DIRECTV            .               0.00               .  
  7.        20/20            .               0.00               .  
  8.          1ST            .               0.00               .  
  9.       HEALTH            .               0.00               .  
 10.     REYNOLDS            .               0.00               .  
 11.           NA            .               0.00               .  
 12.     CARRIERS            .               0.00               .  
 13.            A            .               0.00               .  
 14.   GEORGETOWN            .               0.00               .  
 15.         1DRI            .               0.00               .  
 16.   UNIVERSITY            .               0.00               .  
 17.           FL            .               0.00               .  
 18.    SOLUTIONS            .               0.00               .  
 19.          AND            .               0.00               .  
 20.          -CL            .               0.00               .  
 
Loading USING file: C:\Users\JULIOR~1\AppData\Local\Temp\ST_00000002.tmp
Indexing USING file.
0%
40%
60%
80%
100%
Done!
Computing results
        Percent completed ...   (search space saved by index so far)
        20%               ...   (100%)
        40%               ...   (100%)
        60%               ...   (100%)
        80%               ...   (97%)
        Done!
Total search space saved by index: 95%

. which matchit
c:\XXXXXX\matchit.ado
*! 1.5.2 J.D. Raffo May 2020

. gsort - similscore

. list

     +----------------------------------------------------------------------+
     | idmaster               cnms   idusing              cnms1   similsc~e |
     |----------------------------------------------------------------------|
  1. |        9   20TH CENTURY FOX        10   20TH CENTURY FOX           1 |
  2. |        9        20TH CENTRY         9        20TH CENTRY           1 |
  3. |       10     20TH CENTRY FX         9        20TH CENTRY   .81649658 |
  4. |       10     20 CENTURY FOX        10   20TH CENTURY FOX   .66666667 |
     +----------------------------------------------------------------------+

end of do-file

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

#14

19 May 2020, 10:52

Chris Jiao -

Do you have freqindex installed? I duplicated your problem using the code from post #13 before installed it. Once I installed it, the code ran successfully.

Julio Raffo -

Here is a trace. It is very strange because the trace shows two functioning display commands, but like Chris I did not see those display results in my Results window, only the syntax error message. The error message is some sort of generic r(111) error message, because you captured the output of the which command with its specific error message shown toward the end of the help which output. I think maybe instead of display you need display as error but that is just a guess.

Code:

. set trace on

. matchit idmaster cnms using `myusing', idusing(idusing) txtusing(cnms) di sim(token)
  --------------------------------------------------------------------------- begin matchit ---
  - version 12
  - syntax varlist(min=2 max=2) [using/] [, IDUsing(name) TXTUsing(name)] [SIMilmethod(string a
> sis)] [Weights(string)] [WGTFile(string)] [Score(string)] [Threshold(real .5)] [Flag(real 20)
> ] [DIagnose] [STOPWordsauto] [SWThreshold(real .2)] [OVERride] [Generate(string)] [KEEPMata] 
> [TIme]
  - cap which freqindex
  - if (_rc!=0){
  - di "freqindex not found."
freqindex not found.
  - di "matchit requires freqindex to be installed. You can get it in SSC."
matchit requires freqindex to be installed. You can get it in SSC.
  - error _rc
invalid syntax
    }
  ----------------------------------------------------------------------------- end matchit ---
r(111);

Comment

Julio Raffo

Join Date: May 2014

Posts: 132
#15

19 May 2020, 13:48

Thanks William Lisowski. By removing freqindex I replicate the same mistake than you. I do get to see the messages displayed and I get the same error code than you (i.e. 111):

Code:

. matchit idmaster cnms using `myusing', idusing(idusing) txtusing(cnms) di sim(token) freqindex not found. matchit requires freqindex to be installed. You can get it in SSC. invalid syntax r(111);

However, Chris Jiao output has a different error code (197) and displays no messages. Chris, can you please confirm or dismiss the freqindex hypothesis?
Comment

Announcement

How to overcome problems in fuzzy match via matchit and reclink?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment