-matchit-: Problem in matching correctly between two different datasets

Michael Duarte Goncalves

Join Date: Oct 2022
Posts: 500

-matchit-: Problem in matching correctly between two different datasets

11 Dec 2023, 07:12

Hi everyone,

I need to do a "fuzzy match" with two datasets. I have two sets of data.

In the first, I have the number of observations and zip codes for a large number of zip codes in Spain.
In the second, I have the number of observations per zip code for those that don't exist in the first, due for example to the inclusion/exclusion of some of the zip codes over the years.

The idea would be to add the number of observations of unmatched zip codes (my using) to those that the algorithm finds most similar in the first. I want to do that only for zip codes that have more than 20 observations.
I will provide a smal dataex for both datasets.

Master dataset:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str5 sp_zipcode_str long tag_zip_house_mun byte _merge
"08189"   9 3
"08191" 529 3
"08192" 149 3
"08193"  18 3
"08194"  18 3
"08195" 135 3
"08196"   9 3
"08197"  72 3
"08198"  47 3
"08199"   . 1
"08201" 164 3
"08202" 202 3
"08203" 204 3
"08204" 203 3
"08205" 162 3
"08206" 329 3
"08207" 183 3
"08208" 187 3
"08210" 246 3
"08211" 167 3
"08212"  24 3
end
label values _merge _merge
label def _merge 1 "Master only (1)", modify
label def _merge 3 "Matched (3)", modify

Using dataset:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str5 sp_zipcode_str long tag_zip_house_mun byte _merge
"01872"   1 2
"03180"  14 2
"03787"   2 2
"03800"   2 2
"04602"   1 2
"06083"   1 2
"08092"   1 2
"08190" 264 2
"08200"   2 2
"08220"   2 2
"08240"  15 2
"08400"  51 2
"10683"   1 2
"13197"   1 2
"15192"   6 2
"15407"   1 2
"15708"   1 2
"17092"   1 2
"17254"   1 2
"18849"   1 2
"18857"   1 2
"19019"   1 2
"21102"   1 2
"21458"   1 2
"21646"   1 2
end
label values _merge _merge
label def _merge 2 "Using only (2)", modify

For example, for zip code "08190" in my using data set, I would like to add the number of observations for this zip code (264) to "08191" or "08192" in my master dataset, to have a final dataset with my master dataset, with the added observations. So, if I match "08190" with "08191", I end up with 529+264=793 observations for "08191".

If needed, here is the chunk of code that I tried:

Code:

merge 1:1 sp_zipcode_str using `house_info_munic'

    Result                      Number of obs
    -----------------------------------------
    Not matched                         3,033
        from master                     2,956  (_merge==1)
        from using                         77  (_merge==2)

    Matched                             7,918  (_merge==3)
    -----------------------------------------

save Spain_municip, replace
file Spain_municip.dta saved

use "Spain_municip.dta", clear
keep if _merge == 2
sort sp_zipcode_str
egen i = group(sp_zipcode_str)
tempfile not_matched_using
save `not_matched_using', replace

use "Spain_municip.dta", clear
keep if _merge!= 2
sort sp_zipcode_str
egen j = group(sp_zipcode_str)
save Spain_municip, replace


matchit j sp_zipcode_str using `not_matched_using', idusing(i) txtusing(sp_zipcode_str)

But I ended up without any observations!

I hope my explanations are sufficiently clear.
Thank you in advance.
Michael

Edit: This will be then used to plot that on a map, by using -geoplot- command.

Last edited by Michael Duarte Goncalves; 11 Dec 2023, 07:16.

Tags: None

Marc Kaulisch

Join Date: Jan 2016

Posts: 184
#2

12 Dec 2023, 05:50

If I understand you correct then you like to compare numbers - and most likely the nearest matches are from the last digit. I think that the default -matchit- way is not the right way. If you go the -matchit- way I would opt for a (ngram,1) method. I have no clue abpout postcodes in Spain but my guess is that there should be common aggregates (e.g. city) in which you can put the postcodes for matching.
Comment
Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 500
#3

13 Dec 2023, 01:21

Hi Marc Kaulisch,
Thank you very much for your time and help! Yes, it is exactly what I want. So:
Match the non-matched zip codes with the nearest matches.

After the matching process, I'd like to add the observations of these postcodes nearest matched to the observations of the respective matches.

Could anyone please give me a better way to do that? Sorry, I was thinking about -matchit-, but I don't know either.

I apologize about the -matchit- command's confusion.
Lovely day.

Michael
Comment

Announcement

-matchit-: Problem in matching correctly between two different datasets

Comment

Comment