The Stata command strgroup is updated to version 1.0.5 with a 4.5x performance improvement. Install with:
strgroup performs fuzzy string matching using the Levenshtein edit distance. It groups similar strings based on a user-specified similarity threshold, which is useful for identifying potential matches between datasets that don't merge cleanly due to typos, abbreviations, or other inconsistencies. Detailed documentation is available on https://github.com/reifjulian/strgroup. Syntax and usage instructions can be accessed directly in Stata by typing help strgroup at the command prompt.
Example: Identify potential matches between two datasets that didn't merge.
Code:
net install strgroup, from("https://raw.githubusercontent.com/reifjulian/strgroup/master") replace
Example: Identify potential matches between two datasets that didn't merge.
Code:
sysuse auto, clear
tempfile t
keep make price
replace make = make + "a" in 5
save "`t'"
sysuse auto, clear
keep make
merge 1:1 make using "`t'"
strgroup make if _merge!=3, gen(group) threshold(0.25)
list make group if _merge!=3
+-------------------------------+
| make group |
|-------------------------------|
5. | Buick Electra 225a 1 |
79. | Buick Electra 225 1 |
+-------------------------------+
