Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Challenging -merge-: How to merge 1:1 when variables that we want to merge are written in completely different ways (but are the same)

    Hi everyone,

    I'm writing to you because I have a rather complex question. I want to match two databases, one concerning the average prices of cars purchased according to model, the other on a number of characteristics of these same models, and the first registration of a given vehicle.

    Problem: the two sets come from different sources, and the variable names and model names may differ slightly or greatly, even though they are identical models.

    Here's an example of my Dataset 1 about average car prices:


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str23 description_regist str80 model str3 fuel_prices str14 number1_regist long price
    "ALFA ROMEO" "GIULIETTA 2.0JTD 150"         "D" "2"    20500
    "ALFA ROMEO" "ALFA SPIDER"                  "G" ""     25900
    "ALFA ROMEO" "1750 4R"                      ""  "1750"     .
    "ALFA ROMEO" "ALFA ROMEO 159 SPORTWAGONGON" ""  "159"      .
    "ALFA ROMEO" "ALFA GIULIETTA150 DIS"        "D" "150"  20800
    "ALFA ROMEO" "ALFA ROMEO 147"               "D" "147"  18600
    "ALFA ROMEO" "147"                          ""  "147"      .
    "ALFA ROMEO" "147"                          "D" "147"  19000
    "ALFA ROMEO" "ALFA ROME0 147"               ""  "0"        .
    "ALFA ROMEO" "GT JUNIOR 1.6"                ""  "1"        .
    "ALFA ROMEO" "ALFA BRERA"                   ""  ""         .
    "ALFA ROMEO" "4C"                           "G" "4"    42100
    "ALFA ROMEO" "159"                          "D" "159"  22000
    "ALFA ROMEO" "A.R.SPIDER 916"               ""  "916"      .
    "ALFA ROMEO" "ALFA BRERA"                   ""  ""         .
    "ALFA ROMEO" "GIULIETTA PROGRESSION"        "D" ""     18600
    "ALFA ROMEO" "AKFA GIULIETTA"               "D" ""     21600
    "ALFA ROMEO" "ALFA ROMEO SZ"                ""  ""         .
    "ALFA ROMEO" "FIAT 500"                     "G" "500"  12800
    "ALFA ROMEO" "AL GIULIETTA"                 "D" ""     18100
    "ALFA ROMEO" "ALFA ROMEO 156"               ""  "156"      .
    end
    And here is my Dataset 2 about cars registrations:


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str23 description str25 model int year1st str1 new str8 FEC_MATRICULA int co2 byte seats_max str1 TIPO_ALIMENTACION_ITV
    "ALFA ROMEO" "GIULIA"                    . "N" "04052018" 109 5 "M"
    "ALFA ROMEO" "GIULIA"                    . "N" "31052018" 109 5 "M"
    "ALFA ROMEO" "GIULIA"                    . "N" "26052017" 109 5 "M"
    "ALFA ROMEO" "STELVIO"                   . "N" "06102017" 161 5 "M"
    "ALFA ROMEO" "GIULIETTA"              2012 "U" "30092016" 149 5 "M"
    "ALFA ROMEO" "STELVIO"                   . "N" "14072017" 161 5 "M"
    "ALFA ROMEO" "ALFA 155T.SPARK 1.8 CA" 1995 "U" "16042019"   . 5 "M"
    "ALFA ROMEO" "ALFA ROMEO 159"         2009 "U" "29052018" 179 5 "M"
    "ALFA ROMEO" "SPIDER GRADUATE"           . "N" "09012017"   . 2 "M"
    "ALFA ROMEO" "GIULIA"                    . "N" "28022019" 158 5 "M"
    "ALFA ROMEO" "GIULIA QUADRIFOGLIO"       . "N" "29122017" 198 4 "M"
    "ALFA ROMEO" "ALFA GIULIETTA"            . "N" "26072016" 104 5 "M"
    "ALFA ROMEO" "GIULIA QUADRIFOGLIO"       . "N" "29062017" 189 4 "M"
    "ALFA ROMEO" "ALFA ROMEO 147"         2006 "U" "11052017" 157 5 "M"
    "ALFA ROMEO" "ALFA MITO"                 . "N" "26052016" 130 4 "M"
    "ALFA ROMEO" "GIULIA"                    . "N" "21062018" 109 5 "M"
    "ALFA ROMEO" "GIULIA QUADRIFOGLIO"       . "N" "25012017" 189 4 "M"
    "ALFA ROMEO" "GIULIETTA"              2012 "U" "11052017" 124 5 "M"
    "ALFA ROMEO" "ALFA GIULIETTA"            . "N" "25052018" 103 5 "M"
    "ALFA ROMEO" "GIULIA QUADRIFOGLIO"       . "N" "16112017" 189 4 "M"
    "ALFA ROMEO" "GIULIA"                    . "N" "21112017" 109 5 "M"
    end
    We can see that sometimes the model names diverge only slightly, sometimes a lot. For example, look at the description-model in purple or in dark orange above.
    Your help is really appreciated!

    In advance, I would likt to thank you for your time and help.
    Have a lovely day/evening/night.

    All the best,

    Michael

    Edit: I did not include all the variables for both -dataex-. But if it is needed to more clarity, do not hesitate to ask me, please. Thanks!

  • #2
    You need a "fuzzy match," which cannot be done within -merge-. I think the best tool for this in Stata is Julio Raffo's -matchit- command, available from SSC. It offers many different options to fine-tune the matching process. I suggest that you at least start using just the default options, and then explore others if the results using default options are not satisfactory.

    Be aware that fuzzy matching is not guaranteed to pick the correct match for each one. What it can do is identify pairs and rank them according to the probability that they represent a correct match. In the end, if your data set is small enough, you will have to then go through the results and manually select the best matches from among all those proposed. If your data set is too large for that to be feasible, you will have to select a cutoff for the match score and accept those matches that meet or exceed that score (or the highest scoring one that exceeds that score). A manual search of the proposed matches for those that remain unmatched after doing that might be added to the process if that set is sufficiently small.

    Comment


    • #3
      Hello Clyde,

      Thank you very much for your feedback!

      Ok, great, I'm going to try this order and hope it works. Indeed, my dataset is way too big to do this manually.
      I'll do my preliminary tests and come back to statalist if any doubt remains. Thanks again for your help.

      Have a nice day.
      Best regards,

      Michael
      Last edited by Michael Duarte Goncalves; 28 Nov 2023, 10:05.

      Comment

      Working...
      X