Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Match it Output

    Hi all,

    I am trying to match administrative names from two datasets that have multiple lengths/spellings. The master dataset has 400 unique names and the using has 379. I merge using the following:
    Code:
    matchit id municipality_name using `audit_nodup', idusing(id) txtusing(Municipio) override
    where id is the numeric code for the municipality in each data set.

    A portion of the results include:
    input int id str49 municipality_name int id1 str49 Municipio double similscore
    1 "Acambaro" 292 "Tacámbaro" .5345224838248488
    1 "Acambaro" 10 "Acámbaro" .7142857142857143
    1 "Acambaro" 2 "Acambay" .7715167498104595
    2 "Acaponeta" 3 "Acaponeta" 1
    3 "Acapulco De Juarez" 4 "Acapulco de Juárez" .7647058823529411
    4 "Acatlan" 264 "San Luis Acatlán" .5270462766947299
    4 "Acatlan" 180 "Matlapa" .5
    4 "Acatlan" 8 "Acayucan" .5443310539518174
    5 "Acayucan" 298 "Tantoyuca" .5892556509887896
    5 "Acayucan" 2 "Acambay" .5443310539518174
    5 "Acayucan" 8 "Acayucan" 1
    5 "Acayucan" 337 "Tzucacab" .5555555555555556
    6 "Actopan" 370 "Zapopan" .5
    7 "Acuna" 9 "Acuña" .5
    9 "Aguascalientes" 11 "Aguascalientes" 1
    10 "Ahome" 12 "Ahome" 1
    11 "Alamo Temapache" 377 "Álamo Temapache" .9285714285714286
    12 "Alamos" 378 "Álamos" .8
    13 "Allende" 15 "Allende" 1
    13 "Allende" 162 "La Independencia" .5345224838248488
    13 "Allende" 353 "Villa de Allende" .7492686492653552
    13 "Allende" 270 "San Miguel de Allende" .6092717958449424
    14 "Altamira" 16 "Altamira" 1
    14 "Altamira" 17 "Altamirano" .8819171036881969
    15 "Ameca" 21 "Ameca" 1
    16 "Amecameca" 21 "Ameca" .9354143466934853
    17 "Anahuac" 150 "Ixtlahuaca" .5443310539518174
    17 "Anahuac" 69 "Chihuahua" .5892556509887896
    18 "Apatzingan" 22 "Apatzingán" .7777777777777778
    20 "Apodaca" 23 "Apodaca" 1
    21 "Arandas" 27 "Arandas" 1
    22 "Arcelia" 190 "Morelia" .5
    23 "Arizpe" 234 "Ramos Arizpe" .674199862463242
    24 "Arriaga" 28 "Arteaga" .5
    25 "Arteaga" 28 "Arteaga" 1
    27 "Atlixco" 306 "Temixco" .5
    27 "Atlixco" 30 "Atlixco" 1
    27 "Atlixco" 31 "Atlixtac" .6172133998483676
    28 "Atotonilco El Alto" 18 "Altotonga" .659380473395787
    29 "Atoyac De Alvarez" 32 "Atoyac de Álvarez" .75
    33 "Banderilla" 52 "Candela" .5443310539518174
    33 "Banderilla" 353 "Villa de Allende" .5353033790313108
    33 "Banderilla" 53 "Candelaria" .5555555555555556
    34 "Benito Juarez" 323 "Tlacotepec de Benito Juárez" .5661385170722978
    34 "Benito Juarez" 41 "Benito Juárez" .8333333333333334
    36 "Boca Del Rio" 42 "Boca del Río" .6363636363636364
    38 "Cajeme" 46 "Cajeme" 1
    39 "Calkini" 48 "Calkiní" .8333333333333334
    40 "Calpulalpan" 14 "Ajalpan" .6546536707079772
    40 "Calpulalpan" 47 "Calakmul" .5050762722761054
    41 "Calvillo" 49 "Calvillo" 1
    41 "Calvillo" 245 "Saltillo" .5714285714285714
    43 "Campeche" 50 "Campeche" 1
    44 "Cananea" 115 "Galeana" .5773502691896258
    44 "Cananea" 259 "San Juan Cancuc" .5276448530110863
    44 "Cananea" 338 "Técpan de Galeana" .5
    44 "Cananea" 51 "Canatlán" .5345224838248488
    45 "Candela" 52 "Candela" 1
    45 "Candela" 53 "Candelaria" .816496580927726
    47 "Cardenas" 96 "Cárdenas" .7142857142857143
    47 "Cardenas" 54 "Carmen" .50709255283711
    48 "Cardonal" 330 "Tonalá" .50709255283711
    49 "Carmen" 54 "Carmen" 1
    50 "Castanos" 55 "Castaños" .7142857142857143
    51 "Celaya" 56 "Celaya" 1
    52 "Centro" 57 "Centla" .6
    52 "Centro" 58 "Centro" 1
    54 "Chalcatongo de Hidalgo" 132 "Hidalgo" .6531972647421809
    56 "Champoton" 61 "Champotón" .75
    58 "Chignahuapan" 68 "Chignahuapan" 1
    58 "Chignahuapan" 69 "Chihuahua" .6092717958449424
    59 "Chignautla" 68 "Chignahuapan" .502518907629606
    59 "Chignautla" 91 "Cuautla" .5443310539518174
    59 "Chignautla" 138 "Huautla" .5443310539518174
    60 "Chihuahua" 68 "Chignahuapan" .6092717958449424
    60 "Chihuahua" 73 "Chimalhuacán" .5222329678670935
    60 "Chihuahua" 364 "Yahualica" .5103103630798288
    60 "Chihuahua" 317 "Tihuatlán" .5103103630798288
    60 "Chihuahua" 69 "Chihuahua" 1
    62 "Chilpancingo De Los Bravo" 71 "Chilpancingo de los Bravo" .8333333333333334
    63 "Cihuatlan" 317 "Tihuatlán" .625
    63 "Cihuatlan" 69 "Chihuahua" .5103103630798288
    63 "Cihuatlan" 135 "Huamantla" .5
    64 "Cintalapa" 358 "Xalapa" .6324555320336759
    64 "Cintalapa" 212 "Papantla" .5345224838248488
    64 "Cintalapa" 74 "Cintalapa" 1
    65 "Ciudad Valles" 76 "Ciudad Valles" 1
    65 "Ciudad Valles" 75 "Ciudad Madero" .5400617248673216
    66 "Coalcoman De Vazquez Pallares" 79 "Comalcalco" .5063696835418333
    67 "Coatzacoalcos" 77 "Coatzacoalcos" 1
    70 "Comala" 79 "Comalcalco" .6201736729460423
    70 "Comala" 151 "Jala" .5163977794943222
    70 "Comala" 114 "Frontera Comalapa" .5590169943749475
    71 "Comalcalco" 79 "Comalcalco" 1
    71 "Comalcalco" 60 "Chalco" .6201736729460423
    72 "Comitan De Dominguez" 80 "Comitán de Domínguez" .7419408268023742
    73 "Comondu" 81 "Comondú" .8333333333333334
    73 "Comondu" 82 "Comonfort" .5773502691896258
    74 "Compostela" 83 "Compostela" 1
    75 "Cordoba" 97 "Córdoba" .6666666666666666
    end
    [/CODE]


    I want to first isolate those observations that have a perfect match, but
    Code:
    by id: keep if similscore==1
    also deletes all ids for which there are multiple observations but where none of them has a similscore value equal to 1. Second, I'm wondering if there is a good way to proceed (besides manual inspections) for those ids that have multiple observations, none of which is similscore==1, and the observations with the highest score is not actually the best match (ex. id#1, the 2nd match is correct but it has a lower score than the third observation).
Working...
X