Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • fuzzy match

    I am new to using the matchit command and finding it challenging to understand what the different options mean and which one would be most suitable for my needs.

    Dataset 1
    SOME_KIND_OF_NAME
    THESQUIRL WAS YELLOW AND SMOOTH
    THESQUIRL WAS YELLOW SUNSHINE AND SMOOTH
    THESQUIRLWASPURPLE
    BLUE MUFFINS ARE-AWESOME
    BLUE-RAY MUFFINS ARE

    Dataset 2 –look up table
    COLORS
    GREEN
    PURPLE
    YELLOW SUNSHINE
    BLUE-RAY
    The code I am using is the following for example:

    use "DIRECTORY-dataset1 ", clear
    matchit SAMPLE_ID SOME_KIND_OF_NAME using "directory-dataset2 ", idu(ID) txtu(colors) sim(token) t(0)

    MATCH
    THESQUIRL WAS YELLOW AND SMOOTH YELLOW SUNSHINE > wrong (I only want it to match if it contains exactly YELLOW SUNSHINE, the words together in the long string)
    THESQUIRL WAS YELLOW SUNSHINE AND SMOOTH YELLOW SUNSHINE
    THESQUIRLWASPURPLE PURPLE
    BLUE MUFFINS ARE-AWESOME BLUE-RAY >wrong (I only want it to match if it contains exactly BLUE-RAY, the words together in the long string)
    BLUE-RAY MUFFINS ARE BLUE-RAY
    I am not sure if for this example it would be helpful if I created dummy variables in the proper stata dataex. If so, let me know and I can try to ask my question in a different way with actual data.

    Thank you!
    Last edited by Tia Landry; 08 Apr 2021, 18:59.

  • #2
    Fuzzy matching is mainly for non-exact matches, so I would not recommend it here. You can use a number of Stata string functions. Here is a way using regular expressions.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str15 text
    "GREEN"          
    "PURPLE"        
    "YELLOW SUNSHINE"
    "BLUE-RAY"      
    end
    tempfile dataset2
    save `dataset2'
    
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str40 fulltext
    "SOME_KIND_OF_NAME"                      
    "THESQUIRL WAS YELLOW AND SMOOTH"        
    "THESQUIRL WAS YELLOW SUNSHINE AND SMOOTH"
    "THESQUIRLWASPURPLE"                      
    "BLUE MUFFINS ARE-AWESOME"                
    "BLUE-RAY MUFFINS ARE"                    
    end
    cross using `dataset2'
    gen match = regexm(" " + lower(fulltext) + " ", "['!?,\. ]("+lower(text)+")['!?,\. ]")
    keep if match
    Res.:

    Code:
    . l
    
         +--------------------------------------------------------------------+
         |                                 fulltext              text   match |
         |--------------------------------------------------------------------|
      1. | THESQUIRL WAS YELLOW SUNSHINE AND SMOOTH   YELLOW SUNSHINE       1 |
      2. |                     BLUE-RAY MUFFINS ARE          BLUE-RAY       1 |
         +--------------------------------------------------------------------+

    Now, "PURPLE" was not found as "WASPURPLE" is not the same word as "PURPLE". It is possible to find substrings, but if you choose this route, you will run into all sorts of problems as, e.g., "red" be matched to entries containing the words reduced, redirect, etc.

    Comment

    Working...
    X