Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • -matchit-: Problem in matching correctly between two different datasets

    Hi everyone,

    I need to do a "fuzzy match" with two datasets. I have two sets of data.
    1. In the first, I have the number of observations and zip codes for a large number of zip codes in Spain.
    2. In the second, I have the number of observations per zip code for those that don't exist in the first, due for example to the inclusion/exclusion of some of the zip codes over the years.
    The idea would be to add the number of observations of unmatched zip codes (my using) to those that the algorithm finds most similar in the first. I want to do that only for zip codes that have more than 20 observations.
    I will provide a smal dataex for both datasets.

    Master dataset:
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str5 sp_zipcode_str long tag_zip_house_mun byte _merge
    "08189"   9 3
    "08191" 529 3
    "08192" 149 3
    "08193"  18 3
    "08194"  18 3
    "08195" 135 3
    "08196"   9 3
    "08197"  72 3
    "08198"  47 3
    "08199"   . 1
    "08201" 164 3
    "08202" 202 3
    "08203" 204 3
    "08204" 203 3
    "08205" 162 3
    "08206" 329 3
    "08207" 183 3
    "08208" 187 3
    "08210" 246 3
    "08211" 167 3
    "08212"  24 3
    end
    label values _merge _merge
    label def _merge 1 "Master only (1)", modify
    label def _merge 3 "Matched (3)", modify


    Using dataset:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str5 sp_zipcode_str long tag_zip_house_mun byte _merge
    "01872"   1 2
    "03180"  14 2
    "03787"   2 2
    "03800"   2 2
    "04602"   1 2
    "06083"   1 2
    "08092"   1 2
    "08190" 264 2
    "08200"   2 2
    "08220"   2 2
    "08240"  15 2
    "08400"  51 2
    "10683"   1 2
    "13197"   1 2
    "15192"   6 2
    "15407"   1 2
    "15708"   1 2
    "17092"   1 2
    "17254"   1 2
    "18849"   1 2
    "18857"   1 2
    "19019"   1 2
    "21102"   1 2
    "21458"   1 2
    "21646"   1 2
    end
    label values _merge _merge
    label def _merge 2 "Using only (2)", modify
    For example, for zip code "08190" in my using data set, I would like to add the number of observations for this zip code (264) to "08191" or "08192" in my master dataset, to have a final dataset with my master dataset, with the added observations. So, if I match "08190" with "08191", I end up with 529+264=793 observations for "08191".

    If needed, here is the chunk of code that I tried:

    Code:
    merge 1:1 sp_zipcode_str using `house_info_munic'
    
        Result                      Number of obs
        -----------------------------------------
        Not matched                         3,033
            from master                     2,956  (_merge==1)
            from using                         77  (_merge==2)
    
        Matched                             7,918  (_merge==3)
        -----------------------------------------
    
    save Spain_municip, replace
    file Spain_municip.dta saved
    
    use "Spain_municip.dta", clear
    keep if _merge == 2
    sort sp_zipcode_str
    egen i = group(sp_zipcode_str)
    tempfile not_matched_using
    save `not_matched_using', replace
    
    use "Spain_municip.dta", clear
    keep if _merge!= 2
    sort sp_zipcode_str
    egen j = group(sp_zipcode_str)
    save Spain_municip, replace
    
    
    matchit j sp_zipcode_str using `not_matched_using', idusing(i) txtusing(sp_zipcode_str)
    But I ended up without any observations!

    I hope my explanations are sufficiently clear.
    Thank you in advance.
    Michael

    Edit: This will be then used to plot that on a map, by using -geoplot- command.
    Last edited by Michael Duarte Goncalves; 11 Dec 2023, 07:16.

  • #2
    If I understand you correct then you like to compare numbers - and most likely the nearest matches are from the last digit. I think that the default -matchit- way is not the right way. If you go the -matchit- way I would opt for a (ngram,1) method. I have no clue abpout postcodes in Spain but my guess is that there should be common aggregates (e.g. city) in which you can put the postcodes for matching.

    Comment


    • #3
      Hi Marc Kaulisch,
      Thank you very much for your time and help! Yes, it is exactly what I want. So:
      1. Match the non-matched zip codes with the nearest matches.
      2. After the matching process, I'd like to add the observations of these postcodes nearest matched to the observations of the respective matches.
      • Could anyone please give me a better way to do that? Sorry, I was thinking about -matchit-, but I don't know either.
      I apologize about the -matchit- command's confusion.
      Lovely day.

      Michael

      Comment

      Working...
      X