Yet another batch of updates. Some of them are cosmetic (like changes in what is reported in the output window) or simply new similarity functions added (like nysiis and other hybrid phonetic algorithms).
But I think the most significant one is the introduction of the stopwordsauto option. This option generates a list of stopwords automatically based on the overall frequencies (i.e. grams per observation). In a nutshell, -matchit- will ignore a list of grams in the whole process (indexation, weights and computation of final results), which will likely improve the efficiency of indexation at the also likely risk of ignoring some potential matches.
As you can see below, this option is applied to the same example from the previous post. Everything is set exactly the same but for the option stopw (short for stopwordsauto). It can be noted that the output of the option diagnose has changed slightly in order to refer more clearly to the stopwordsauto threshold (which can be set with the option swthreshold()). What before was reported as percent now is reported as grams_per_obs. By default this threshold is set to .2, which means that grams that are found in average more than once every five observations are ignored. In this case, these are only ", ", "an", and "er", as reported in the third table of the diagnose output.
As you can compare from the two posts, what took slightly less than 7min now takes 2min. However, it is also worth mentioning that results may differ as the similarity score is not computed exactly in the same way.
But I think the most significant one is the introduction of the stopwordsauto option. This option generates a list of stopwords automatically based on the overall frequencies (i.e. grams per observation). In a nutshell, -matchit- will ignore a list of grams in the whole process (indexation, weights and computation of final results), which will likely improve the efficiency of indexation at the also likely risk of ignoring some potential matches.
As you can see below, this option is applied to the same example from the previous post. Everything is set exactly the same but for the option stopw (short for stopwordsauto). It can be noted that the output of the option diagnose has changed slightly in order to refer more clearly to the stopwordsauto threshold (which can be set with the option swthreshold()). What before was reported as percent now is reported as grams_per_obs. By default this threshold is set to .2, which means that grams that are found in average more than once every five observations are ignored. In this case, these are only ", ", "an", and "er", as reported in the third table of the diagnose output.
As you can compare from the two posts, what took slightly less than 7min now takes 2min. However, it is also worth mentioning that results may differ as the similarity score is not computed exactly in the same way.
Code:
. use medium, clear
. matchit person_id person_name using mediumlarge.dta, idu(person_id) txtu(person_name) ti di f(1) stopw
Matching current dataset with mediumlarge.dta
Similarity function: bigram
4 May 2016 10:35:58
Performing preliminary diagnosis
--------------------------------
Analyzing Master file
List of most frequent grams in Master file:
grams freq grams_per_obs
1. , 1139 1.1390
2. er 217 0.2170
3. an 205 0.2050
4. J 183 0.1830
5. C 176 0.1760
6. on 171 0.1710
7. ar 167 0.1670
8. or 162 0.1620
9. I 149 0.1490
10. en 141 0.1410
11. S 124 0.1240
12. M 121 0.1210
13. R 114 0.1140
14. ch 113 0.1130
15. ra 111 0.1110
16. A 110 0.1100
17. in 110 0.1100
18. D 109 0.1090
19. L 106 0.1060
20. n, 104 0.1040
Analyzing Using file
List of most frequent grams in Using file:
grams freq grams_per_obs
1. , 11079 1.1079
2. an 2144 0.2144
3. er 2115 0.2115
4. J 1795 0.1795
5. ar 1794 0.1794
6. on 1632 0.1632
7. C 1539 0.1539
8. I 1448 0.1448
9. M 1349 0.1349
10. en 1307 0.1307
11. or 1302 0.1302
12. R 1260 0.1260
13. A 1252 0.1252
14. ic 1191 0.1191
15. S 1132 0.1132
16. n, 1125 0.1125
17. D 1124 0.1124
18. in 1085 0.1085
19. ha 1025 0.1025
20. ra 1024 0.1024
(638 real changes made)
(1 real change made)
Overall diagnosis
Pairs being compared: Master(1000) x Using(10000) = 10000000
Estimated maximum reduction by indexation (%):0
(note: this is an indication, final results may differ)
List of grams with greater negative impact to indexation:
(note: values are estimated, final results may differ)
grams crosspairs max_common_space grams_per_obs
1. , 12618981 100.00 1.1107
2. er 458955 4.59 0.2120
3. an 439520 4.40 0.2135
4. J 328485 3.28 0.1798
5. ar 299598 3.00 0.1783
6. on 279072 2.79 0.1639
7. C 270864 2.71 0.1559
8. I 215752 2.16 0.1452
9. or 210924 2.11 0.1331
10. en 184287 1.84 0.1316
11. M 163229 1.63 0.1336
12. R 143640 1.44 0.1249
13. S 140368 1.40 0.1142
14. A 137720 1.38 0.1238
15. D 122516 1.23 0.1121
16. in 119350 1.19 0.1086
17. n, 117000 1.17 0.1117
18. ra 113664 1.14 0.1032
19. ch 112322 1.12 0.1006
20. ic 104808 1.05 0.1163
Loading USING file: mediumlarge.dta
Generating stopwords automatically, threshold set at:.2
Done!
Indexing USING file.
4 May 2016 10:36:04-> 0%
4 May 2016 10:36:04-> 1%
4 May 2016 10:36:04-> 2%
4 May 2016 10:36:04-> 3%
...
4 May 2016 10:36:07-> 97%
4 May 2016 10:36:07-> 98%
4 May 2016 10:36:07-> 99%
4 May 2016 10:36:07-> Done!
Computing results
4 May 2016 10:36:07-> Percent completed ... (search space saved by index so far)
4 May 2016 10:36:09-> 1% ... (48%)
4 May 2016 10:36:10-> 2% ... (52%)
4 May 2016 10:36:11-> 3% ... (53%)
4 May 2016 10:36:12-> 4% ... (54%)
...
4 May 2016 10:37:54-> 97% ... (57%)
4 May 2016 10:37:55-> 98% ... (57%)
4 May 2016 10:37:55-> 99% ... (57%)
4 May 2016 10:37:57-> Done!
Total search space saved by index: 57%
4 May 2016 10:37:57

Comment