Match according to two key variables: exact numeric variable & closest string variable

Paula de Souza Leao Spinola

Join Date: Jun 2015

Posts: 384
#1

Match according to two key variables: exact numeric variable & closest string variable

03 Oct 2022, 03:48

I have two datasets containing identifiers of some of the same individuals.

df1 is more complete and has three variables: (1) the name of the individual, (2) a regional registration id, (3) the region code where the id was registered.
df2 only informs (1) and (2) (i.e., name and regional registration id, but not the region code).

Individuals are uniquely identified by region code and regional registration id. Different regions may inform the same id number. Besides, the same individual always has the same regional registration id across the two datasets, but may not have the exact same spelling of their name across these datasets.

I would like to add the region code to df2 by matching each one of its rows to df1 using as key variables the exact regional registration id as well as the closest name between the datasets (the one that has the most similar spelling).

Consider the simple example below:

Code:

clear input str30 name regional_registration_id str1 region_code "flaviasillva" 1 "A" "rodrigomaia" 1 "B" "mariaconceicao" 1 , "C" "felipesantos" 2 "B" "mateussauro" 2 "D" "flaviasilva" 3 "F" end save "df1", replace clear input str30 name regional_registration_id str1 region_code "flaviasilva" 1 "A" "filipesantos" 2 "B" end save "df2", replace

My desired output is:

+------------------------------------+
| name region~d region~e |
|------------------------------------|
1. | flaviasilva 1 A |
2. | filipesantos 2 B |
+------------------------------------+
Tags: None

George Ford

Join Date: Aug 2014
Posts: 3337

03 Oct 2022, 07:55

Earlier post for similar issue:

HTML Code:

https://www.statalist.org/forums/forum/general-stata-discussion/general/1477795-comparing-multiple-string-variables-for-similarity

Announcement

Match according to two key variables: exact numeric variable & closest string variable

Comment