On Fuzzy Match: How to Clean up ID Variable?

Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 500
#1

On Fuzzy Match: How to Clean up ID Variable?

18 Dec 2023, 03:25

Hi again everyone,

Related again to fuzzy match, I would like to know if there is a way in stata to clean up my main variable ID, to be used next to try a fuzzy match. In particular in my dataex below:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str141 model "ALFA 156 SPORTWAGONGON 2.0 JTWIN SPARK DISTINCTIVE SELESPEED" "ALFA 156 SPORTWAGONGON 2.0 JTWIN SPARK SELESPEED" "ALFA 156 SPORTWAGONGON 2.0 JTWIN SPARK PROGRESSION / DISTINCTIVE" "ALFA 156 SPORTWAGONGON 2.0 TWIN SPARK SELESPEED" "ALFA 156 SPORTWAGONGON 2.0 TWIN SPARK 16V LUJO / TWIN SPARK DISTINTIVE" "ALFA 156 SPORTWAGONGON 2.4 JTD 20V DISTINCTIVE" "ALFA 156 SPORTWAGONGON 2.4 JTD DISTINCTIVE" "ALFA 156 SPORTWAGONGON 2.5 V6 24V Q-SYSTEM" "ALFA 156 SPORTWAGONGON 2.5 V6 24V Q-SYSTEM (2000)" "ALFA 156 SPORTWAGONGON 2.5 V6 DISTINCTIVE" end

------------------

I observe that "SPORTWAGONGON" is replicated a lot. However, the true string related to it should be "SPORTWAGON". I would like to know if there is a way in stata to preserve a part of a string that shows up more times. I explain myself:

Let's imagine that "SPORTWAGAGON" appears in all my data in 75% of the channels, and that "SPORTWAGONGON" appears 50% of the time.
Would there be a way to keep the part of the string that appears the most times, and then have stata make the necessary replacements in my "model" string?

Thanks in advance for your advice. If you have any suggestions for solving this little headache, I'd love to hear from you!

Michael
Tags: None

Announcement

On Fuzzy Match: How to Clean up ID Variable?