fuzzy match

Tia Landry

Join Date: Jan 2016
Posts: 49

08 Apr 2021, 18:56

I am new to using the matchit command and finding it challenging to understand what the different options mean and which one would be most suitable for my needs.

Dataset 1

SOME_KIND_OF_NAME

THESQUIRL WAS YELLOW AND SMOOTH

THESQUIRL WAS YELLOW SUNSHINE AND SMOOTH

THESQUIRLWASPURPLE

BLUE MUFFINS ARE-AWESOME

BLUE-RAY MUFFINS ARE

Dataset 2 –look up table

COLORS

GREEN

PURPLE

YELLOW SUNSHINE

BLUE-RAY

The code I am using is the following for example:

use "DIRECTORY-dataset1 ", clear
matchit SAMPLE_ID SOME_KIND_OF_NAME using "directory-dataset2 ", idu(ID) txtu(colors) sim(token) t(0)

MATCH

THESQUIRL WAS YELLOW AND SMOOTH	YELLOW SUNSHINE > wrong (I only want it to match if it contains exactly YELLOW SUNSHINE, the words together in the long string)
THESQUIRL WAS YELLOW SUNSHINE AND SMOOTH	YELLOW SUNSHINE
THESQUIRLWASPURPLE	PURPLE
BLUE MUFFINS ARE-AWESOME	BLUE-RAY >wrong (I only want it to match if it contains exactly BLUE-RAY, the words together in the long string)
BLUE-RAY MUFFINS ARE	BLUE-RAY

I am not sure if for this example it would be helpful if I created dummy variables in the proper stata dataex. If so, let me know and I can try to ask my question in a different way with actual data.

Thank you!

Last edited by Tia Landry; 08 Apr 2021, 18:59.

Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10187

09 Apr 2021, 02:15

Fuzzy matching is mainly for non-exact matches, so I would not recommend it here. You can use a number of Stata string functions. Here is a way using regular expressions.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str15 text
"GREEN"          
"PURPLE"        
"YELLOW SUNSHINE"
"BLUE-RAY"      
end
tempfile dataset2
save `dataset2'

* Example generated by -dataex-. To install: ssc install dataex
clear
input str40 fulltext
"SOME_KIND_OF_NAME"                      
"THESQUIRL WAS YELLOW AND SMOOTH"        
"THESQUIRL WAS YELLOW SUNSHINE AND SMOOTH"
"THESQUIRLWASPURPLE"                      
"BLUE MUFFINS ARE-AWESOME"                
"BLUE-RAY MUFFINS ARE"                    
end
cross using `dataset2'
gen match = regexm(" " + lower(fulltext) + " ", "['!?,\. ]("+lower(text)+")['!?,\. ]")
keep if match

Res.:

Code:

. l

     +--------------------------------------------------------------------+
     |                                 fulltext              text   match |
     |--------------------------------------------------------------------|
  1. | THESQUIRL WAS YELLOW SUNSHINE AND SMOOTH   YELLOW SUNSHINE       1 |
  2. |                     BLUE-RAY MUFFINS ARE          BLUE-RAY       1 |
     +--------------------------------------------------------------------+

Now, "PURPLE" was not found as "WASPURPLE" is not the same word as "PURPLE". It is possible to find substrings, but if you choose this route, you will run into all sorts of problems as, e.g., "red" be matched to entries containing the words reduced, redirect, etc.

Announcement

fuzzy match

Comment