Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Match according to two key variables: exact numeric variable & closest string variable

    I have two datasets containing identifiers of some of the same individuals.

    df1 is more complete and has three variables: (1) the name of the individual, (2) a regional registration id, (3) the region code where the id was registered.
    df2 only informs (1) and (2) (i.e., name and regional registration id, but not the region code).

    Individuals are uniquely identified by region code and regional registration id. Different regions may inform the same id number. Besides, the same individual always has the same regional registration id across the two datasets, but may not have the exact same spelling of their name across these datasets.

    I would like to add the region code to df2 by matching each one of its rows to df1 using as key variables the exact regional registration id as well as the closest name between the datasets (the one that has the most similar spelling).

    Consider the simple example below:

    Code:
    clear
    input str30 name regional_registration_id str1 region_code
    "flaviasillva" 1 "A"
    "rodrigomaia" 1 "B"
    "mariaconceicao" 1 , "C"
    "felipesantos" 2 "B"
    "mateussauro" 2 "D"
    "flaviasilva" 3 "F"
    end
    save "df1", replace
    
    clear
    input str30 name regional_registration_id str1 region_code
    "flaviasilva" 1 "A"
    "filipesantos" 2 "B"
    end
    save "df2", replace
    My desired output is:

    +------------------------------------+
    | name region~d region~e |
    |------------------------------------|
    1. | flaviasilva 1 A |
    2. | filipesantos 2 B |
    +------------------------------------+



  • #2
    Earlier post for similar issue:
    HTML Code:
    https://www.statalist.org/forums/forum/general-stata-discussion/general/1477795-comparing-multiple-string-variables-for-similarity

    Comment

    Working...
    X