Challenging -merge-: How to merge 1:1 when variables that we want to merge are written in completely different ways (but are the same)

Michael Duarte Goncalves

Join Date: Oct 2022
Posts: 500

Challenging -merge-: How to merge 1:1 when variables that we want to merge are written in completely different ways (but are the same)

28 Nov 2023, 01:38

Hi everyone,

I'm writing to you because I have a rather complex question. I want to match two databases, one concerning the average prices of cars purchased according to model, the other on a number of characteristics of these same models, and the first registration of a given vehicle.

Problem: the two sets come from different sources, and the variable names and model names may differ slightly or greatly, even though they are identical models.

Here's an example of my Dataset 1 about average car prices:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str23 description_regist str80 model str3 fuel_prices str14 number1_regist long price
"ALFA ROMEO" "GIULIETTA 2.0JTD 150"         "D" "2"    20500
"ALFA ROMEO" "ALFA SPIDER"                  "G" ""     25900
"ALFA ROMEO" "1750 4R"                      ""  "1750"     .
"ALFA ROMEO" "ALFA ROMEO 159 SPORTWAGONGON" ""  "159"      .
"ALFA ROMEO" "ALFA GIULIETTA150 DIS"        "D" "150"  20800
"ALFA ROMEO" "ALFA ROMEO 147"               "D" "147"  18600
"ALFA ROMEO" "147"                          ""  "147"      .
"ALFA ROMEO" "147"                          "D" "147"  19000
"ALFA ROMEO" "ALFA ROME0 147"               ""  "0"        .
"ALFA ROMEO" "GT JUNIOR 1.6"                ""  "1"        .
"ALFA ROMEO" "ALFA BRERA"                   ""  ""         .
"ALFA ROMEO" "4C"                           "G" "4"    42100
"ALFA ROMEO" "159"                          "D" "159"  22000
"ALFA ROMEO" "A.R.SPIDER 916"               ""  "916"      .
"ALFA ROMEO" "ALFA BRERA"                   ""  ""         .
"ALFA ROMEO" "GIULIETTA PROGRESSION"        "D" ""     18600
"ALFA ROMEO" "AKFA GIULIETTA"               "D" ""     21600
"ALFA ROMEO" "ALFA ROMEO SZ"                ""  ""         .
"ALFA ROMEO" "FIAT 500"                     "G" "500"  12800
"ALFA ROMEO" "AL GIULIETTA"                 "D" ""     18100
"ALFA ROMEO" "ALFA ROMEO 156"               ""  "156"      .
end

And here is my Dataset 2 about cars registrations:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str23 description str25 model int year1st str1 new str8 FEC_MATRICULA int co2 byte seats_max str1 TIPO_ALIMENTACION_ITV
"ALFA ROMEO" "GIULIA"                    . "N" "04052018" 109 5 "M"
"ALFA ROMEO" "GIULIA"                    . "N" "31052018" 109 5 "M"
"ALFA ROMEO" "GIULIA"                    . "N" "26052017" 109 5 "M"
"ALFA ROMEO" "STELVIO"                   . "N" "06102017" 161 5 "M"
"ALFA ROMEO" "GIULIETTA"              2012 "U" "30092016" 149 5 "M"
"ALFA ROMEO" "STELVIO"                   . "N" "14072017" 161 5 "M"
"ALFA ROMEO" "ALFA 155T.SPARK 1.8 CA" 1995 "U" "16042019"   . 5 "M"
"ALFA ROMEO" "ALFA ROMEO 159"         2009 "U" "29052018" 179 5 "M"
"ALFA ROMEO" "SPIDER GRADUATE"           . "N" "09012017"   . 2 "M"
"ALFA ROMEO" "GIULIA"                    . "N" "28022019" 158 5 "M"
"ALFA ROMEO" "GIULIA QUADRIFOGLIO"       . "N" "29122017" 198 4 "M"
"ALFA ROMEO" "ALFA GIULIETTA"            . "N" "26072016" 104 5 "M"
"ALFA ROMEO" "GIULIA QUADRIFOGLIO"       . "N" "29062017" 189 4 "M"
"ALFA ROMEO" "ALFA ROMEO 147"         2006 "U" "11052017" 157 5 "M"
"ALFA ROMEO" "ALFA MITO"                 . "N" "26052016" 130 4 "M"
"ALFA ROMEO" "GIULIA"                    . "N" "21062018" 109 5 "M"
"ALFA ROMEO" "GIULIA QUADRIFOGLIO"       . "N" "25012017" 189 4 "M"
"ALFA ROMEO" "GIULIETTA"              2012 "U" "11052017" 124 5 "M"
"ALFA ROMEO" "ALFA GIULIETTA"            . "N" "25052018" 103 5 "M"
"ALFA ROMEO" "GIULIA QUADRIFOGLIO"       . "N" "16112017" 189 4 "M"
"ALFA ROMEO" "GIULIA"                    . "N" "21112017" 109 5 "M"
end

We can see that sometimes the model names diverge only slightly, sometimes a lot. For example, look at the description-model in purple or in dark orange above.
Your help is really appreciated!

In advance, I would likt to thank you for your time and help.
Have a lovely day/evening/night.

All the best,

Michael

Edit: I did not include all the variables for both -dataex-. But if it is needed to more clarity, do not hesitate to ask me, please. Thanks!

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30152
#2

28 Nov 2023, 09:20

You need a "fuzzy match," which cannot be done within -merge-. I think the best tool for this in Stata is Julio Raffo's -matchit- command, available from SSC. It offers many different options to fine-tune the matching process. I suggest that you at least start using just the default options, and then explore others if the results using default options are not satisfactory.

Be aware that fuzzy matching is not guaranteed to pick the correct match for each one. What it can do is identify pairs and rank them according to the probability that they represent a correct match. In the end, if your data set is small enough, you will have to then go through the results and manually select the best matches from among all those proposed. If your data set is too large for that to be feasible, you will have to select a cutoff for the match score and accept those matches that meet or exceed that score (or the highest scoring one that exceeds that score). A manual search of the proposed matches for those that remain unmatched after doing that might be added to the process if that set is sufficiently small.
1 like
Comment
Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 500
#3

28 Nov 2023, 09:39

Hello Clyde,

Thank you very much for your feedback!

Ok, great, I'm going to try this order and hope it works. Indeed, my dataset is way too big to do this manually.
I'll do my preliminary tests and come back to statalist if any doubt remains. Thanks again for your help.

Have a nice day.
Best regards,

Michael

Last edited by Michael Duarte Goncalves; 28 Nov 2023, 10:05.
Comment

Announcement

Challenging -merge-: How to merge 1:1 when variables that we want to merge are written in completely different ways (but are the same)

Comment

Comment