Dear all, I want for each firm in the patent data to be matched with a firm in the Amadues (sabi) data, I do not want a given firm from sabi to be repeated for different firms in the patent data. Therefore, I first create a matchit and eliminate those obs with the same sabi firm and lower score. Then, I keep only the firm with the best score from patent. This is kept in a file (matchit1). The idea is to merge this with the original patent data to discard from the patent file those firms that were matched. And I do the same for sabi. Therefore, I assume that what is left in the patent and sabi data are firms that were unmatched.
Using a "while" loop, I redo the matchit using the new datasets for patent and sabi (matchit2), and again I eliminate duplicates sabi firm (serving as a matched firm for more than one patent firm) and keep only the best matches for patent firm (in terms of score). Thereafter, I append these new set of matched firms (matchit2) with the previous best matches (matchit1). The idea is to repeat the process until there is not more duplicate firms (a single firm from sabi serving as match for different firms in the patent data, and therefore, being duplicated). Of course, the number of duplicates firms will depend on the threshold in the matchit comand.
However, I am having huge problems, since I cannot understand why it is not doing the "while" loop. If instead, I copy/paste the block of codes within the loop right below (several times), it continues working (since dup is greater than 0). A very short data example and code is below.
Any idea will be more than welcome!
Using a "while" loop, I redo the matchit using the new datasets for patent and sabi (matchit2), and again I eliminate duplicates sabi firm (serving as a matched firm for more than one patent firm) and keep only the best matches for patent firm (in terms of score). Thereafter, I append these new set of matched firms (matchit2) with the previous best matches (matchit1). The idea is to repeat the process until there is not more duplicate firms (a single firm from sabi serving as match for different firms in the patent data, and therefore, being duplicated). Of course, the number of duplicates firms will depend on the threshold in the matchit comand.
However, I am having huge problems, since I cannot understand why it is not doing the "while" loop. If instead, I copy/paste the block of codes within the loop right below (several times), it continues working (since dup is greater than 0). A very short data example and code is below.
Any idea will be more than welcome!
Code:
clear input long han_id str129 pat_cp_name 890 "3 D MITA SL 28500" 4280 "3P BIOPHARMACEUTICALS SL 31110" 6541 "A INGENIERIA DE AUTOMATISMOS SA 48213" 11608 "A3 ADVANCED AUTOMOTIVE ANTENNAS 08100" 14850 "ABB POWER TECH SA 28037" 15754 "ABENGOA SOLAR NEW TECH SA " 16081 "AB BIOTICS SA 08193" 16926 "ABENGOA SOLAR NEW TECNOLOGIES SA 41014" 17564 "ABN PIPE SYSTEMS SLU 15008" 18182 "ABUNDANCIA DE SANTIAGO RAMON 35017" 19340 "ABENGOA HIDROGENO SA 41014" 19410 "ABAD MUNOZ FERNANDO 14009" 19821 "ABUNDANCIA NAVARRO CRISTINA 35017" 19822 "ABUNDANCIA NAVARRO JUAN CARLOS 35017" 21281 "ACENER INVESTIGACION Y DESARROLLO SL 28030" 23363 "ACCIONA ENERGIA SA 31621" 23366 "ACCIONA TOWERS SA 28020" 23451 "ABRAHAM VENEGAS FRANCO 29730" 23728 "ABERTIS AUTOPISTAS ESPANA SA 08040" 23910 "ACOSTA APARICIO VICTOR 28006" 26111 "ACTUALITY SISTEMS SL 03113" 26529 "ACEITES DEL SUR COOSUR SA 23220" 26552 "ACERALIA TRANSFORMADOS SA 31014" 28806 "ACRONIMUS TECH SL " 30133 "ACERIA COMPACTA DE BIZKAIA SA 48910" 30280 "ACCIONA INFRAESTRUCTURAS SA 28108" 31274 "ACITURRI ENGINEERING SLU 47151" 31355 "IMMOSOLAR ACTIVE BUILDING TECH SL 07180" 31402 "ACTIVO MARK SL 46002" 31940 "ADMINISTRACION GENERAL DE LA COMUNIDAD AUTONOMA DE EUSKADI 01010" 32563 "ACONDICIONAMIENTO TARRASENSE 08225" 33140 "ADELTE AIRPORT TECH SL 08029" 35500 "ACORDE TECH SA 39002" 35722 "ADAICO SL 31006" 36403 "ACS SERVICIOS COMUNICACIONES Y ENERGIA SL 28016" 36537 "ADVANCED IN VITRO CELL TECH SL 08028" 37061 "ADVANCED MEDICAL PROJECTS 28760" 37779 "ADVANCED SCIENTIFIC TECH EUROPE SA 08700" 37791 "ADVANCED SIMULATION TECH SL 33203" 40071 "AERNNOVA ENGINEERING SOLUTIONS IBERICA 28050" 40197 "ADT ESPANA SL 50008" 41964 "ADN CONTEXT AWARE MOBILE SOLUTIONS SL 33203" 43015 "AEROSPACE CONSULTING CORP SPAIN SL 08034" 43561 "ADVANCED DIGITAL DESIGN SA 50018" 45346 "ADVANCELL ADVANCED IN VITRO CELL TECH SA 08028" 47630 "AGENCIA PUBLICA EMPRESARIAL SANITARIA HOSPITAL ALTO GUADALQUIVIR 23740" 47631 "AGENCIA PUBLICA EMPRESARIAL SANITARIA HOSPITAL DE PONIENTE 04700" 48483 "AF SISTEMAS SA 28041" 48564 "AFINITICA TECH SL 08193" 48740 "AGQ TECHNOLOGICAL CORPORATE SA 41220" end tempfile patent save `patent' clear input long n_nif str126 sabi_cp_name 181 "ACERIA DE ALAVA SA 01470" 1221 "AERNNOVA MANUFACTURING ENGINEERING SA 01013" 9480 "LA AUXILIAR TARRASENSE SA 08221" 10625 "INMOBILIARIA TARRASENSE SA 08221" 52084 "ACCIONA SOLAR SA 31621" 61858 "ABENGOA SA 41014" 92568 "ADVANCELL ADVANCED IN VITRO CELL TECHNOLOGIES SA 08006" 93543 "AB BIOTICS SA 08172" 94636 "ABERTIS INTERNACIONAL SA 28046" 101792 "INVERSIONES AYERBE SA 28007" 102429 "ABB POWER TECHNOLOGY SA 28037" 106616 "ABB ENERGIA SA 28043" 106853 "ACCIONA MANTENIMIENTO DE INFRAESTRUCTURAS SA 28108" 108195 "ACEITES DEL SUR COOSUR SA 23220" 110005 "AF INMUEBLES SA 28014" 111016 "ADVANCED DIGITAL SA 28100" 111303 "ABERTIS AUTOPISTAS ESPAÑA SAU 28046" 114255 "ACS SERVICIOS COMUNICACIONES Y ENERGIA SA 28016" 120532 "AERNNOVA ENGINEERING SOLUTIONS IBERICA SA 28050" 124260 "ABENGOA SOLAR SA 41014" 151797 "ACENER SXXI SL 02002" 316001 "CONSTRUCCIONES PLAZA SL 14009" 320670 "DISTRIBUCIONES GOMEZ VILLA SL 14009" 347630 "ABN PIPE GESTION SL 15008" 348503 "ABN PIPE SYSTEMS SL 15008" 415960 "ADVANCED MEDICAL PROJECTS SL 28805" 428956 "ARCELORMITTAL ACERALIA BASQUE HOLDING SL 48910" 454807 "ADELTE AIRPORT TECHNOLOGIES SL 08029" 542232 "ABRAHAM RUIZ SL 29620" 581950 "ADAICO RECAMBIOS SL 31006" 582139 "3 P BIOPHARMACEUTICALS SL 31110" 582896 "ACCIONA ENERGIA SOLAR SL 31621" 606385 "ADVANCED SIMULATION TECHNOLOGIES SL 33203" 608869 "ADN CONTEXT AWARE MOBILE SOLUTIONS SL 33211" 715793 "INSTITUTO AWARE SL 46010" 720591 "INMOBILIARIA ACOSTA SL 41500" 833815 "ADT ESPAÑA SL 50008" 838620 "AGRO INDUSTRIAL AYERBE SL 50007" 862439 "ACOSTA APARICIO SL 03540" 924840 "ENERGIA AUTONOMA SL 43201" 942960 "ACTIVE BUILDING TECHNOLOGIES INTELLIGENT SYSTEMS SL 07009" 992759 "IMMOSOLAR SL 08830" 1013603 "ADT TELECOMUNICACIONES SL 08110" 1059704 "ADVANCED AUTOMOTIVE ANTENNAS SL 08028" 1064429 "BARCELONA MARK CENTER SL 08029" 1122507 "AF SOL GRUP SL 08036" 1137540 "ACRONIMUS TECHNOLOGY SL 08242" 1138273 "IN VITRO MEDIA SL 08008" 1142883 "ADELTE GROUP SL 08029" 1151815 "MITA COSTA SL 08013" 1154199 "AFINITICA TECHNOLOGIES SL 08193" 1177769 "AFINITICA PROCESS TECHNOLOGY SL 08193" 1182462 "HIDROGENO CONSULTING SL 08006" 1296991 "COMERCIAL AGAR SL 28027" 1319680 "AEROSPACE SL 28007" 1334880 "ACITURRI ENGINEERING SL 47151" 1363136 "MADRID SCIENTIFIC FILMS SL 28691" 1379457 "ACORDE CONSULTORES SL 28003" 1380552 "INVERSIONES ACORDE SL 28001" 1393146 "3 D MITA INGENIERIA SL 28500" 1400325 "ACS SERVICIOS COMUNICACIONES Y ENERGIA INTERNACIONAL SL 28016" 1410031 "TOWERS CONSULTING E INVERSIONES SL 28016" 1410223 "GRAN ABUNDANCIA SL 28006" 1433549 "INNOVA SCIENTIFIC SL 28290" 1437034 "ADVANCED MEDICAL SYSTEMS SL 28012" 1453788 "ACTUALITY EVENTOS Y COMUNICACION SL 28027" 1454045 "ACENER RENOVA SL 28006" 1475987 "MADRID AEROSPACE SERVICES SL 28850" 1476213 "ACTUALITY SALUD SL 28028" 1485810 "TOWERS INVERSIONES Y PROYECTOS SL 28028" 1500934 "ADVANCED MATERIAL SIMULATION SL 48008" 1521569 "ABRAHAM PRODUCCIONES SL 28014" 1536575 "DESARROLLO EMPRESARIAL ADVANCED SL 28014" 1540823 "CORPORACION ACCIONA INFRAESTRUCTURAS SL 28108" 1543325 "FOOD MARK GROUP SL 28007" 1548989 "CERES BIOTICS TECH SL 28830" 1575537 "ACITURRI GETAFE SL 28906" 1582869 "AGAR MANAGEMENT SL 28039" 1611422 "AGQ TECHNOLOGICAL CORPORATE SL 41013" 1616563 "LABS & TECHNOLOGICAL SERVICES AGQ SL 41220" 1662970 "LA JOYA DE LA ABUNDANCIA SL 29602" 1686003 "CARPINTERIA SEGUI SL 48213" 1688079 "ACERALIA CONSTRUCCIONES SL 48910" 1695829 "LUMINOSOS VERA SL 48213" 1704855 "HIDROGENO DEL NORTE SL 48960" 1707355 "ADAICO TRUCK & TRAILER SL 48340" 1824467 "COOP AUTONOMA VALENCIANA DE TRANSPORTES SC V 46920" 1830045 "AGENCIA SANITARIA ALTO GUADALQUIVIR 23740" 1830156 "AGENCIA PUBLICA EMPRESARIAL SANITARIA DEL BAJO GUADALQUIVIR 41710" 1830195 "AGENCIA PUBLICA EMPRESARIAL SANITARIA HOSPITAL DE PONIENTE DE ALMERIA 04700" 1830205 "AGENCIA PUBLICA EMPRESARIAL SANITARIA COSTA DEL SOL 29600" 1830463 "ABENGOA SA Y ABENGOA SERVICIOS URBANOS UTE 41018" 92568 "ADVANCELL ADVANCED IN VITRO CELL TECHNOLOGIES SA 08006" 92568 "ADVANCELL ADVANCED IN VITRO CELL TECHNOLOGIES SA 08006" 124260 "ABENGOA SOLAR SA 41014" 1138273 "IN VITRO MEDIA SL 08008" 1410223 "GRAN ABUNDANCIA SL 28006" 1410223 "GRAN ABUNDANCIA SL 28006" 1662970 "LA JOYA DE LA ABUNDANCIA SL 29602" 1662970 "LA JOYA DE LA ABUNDANCIA SL 29602" end tempfile sabi save `sabi' use `patent', clear matchit han_id pat_cp_name using "`sabi'", idu(n_nif) txtu(sabi_cp_name) weights(simple) threshold(0.5) sim(token) override gsort han_id -similscore preserve gsort sabi_cp_name -similscore duplicates drop sabi_cp_name, force duplicates drop pat_cp_name, force keep if similscore>0.95 gen v1 = _n sum v1 tempfile matchit1 save `matchit1', replace restore use `matchit1', clear merge 1:1 han_id pat_cp_name using "`patent'" keep if _merge==2 drop _merge tempfile patent save `patent', replace use `matchit1', clear merge 1:m n_nif sabi_cp_name using "`sabi'" keep if _merge==2 drop _merge tempfile sabi save `sabi', replace use `patent', clear matchit han_id pat_cp_name using "`sabi'", idu(n_nif) txtu(sabi_cp_name) weights(simple) threshold(0.5) sim(token) override gsort han_id -similscore duplicates report sabi_cp_name duplicates tag sabi_cp_name, gen(dup) tab dup
Code:
while dup>0 { gsort sabi_cp_name -similscore duplicates drop sabi_cp_name, force gsort pat_cp_name -similscore duplicates drop pat_cp_name, force tempfile matchit2 save `matchit2', replace use `matchit1', clear duplicates report han_id pat_cp_name, force append using `matchit2' tempfile matchit1 save `matchit1', replace use `matchit1', clear merge 1:1 han_id pat_cp_name using "`patent'" keep if _merge==2 drop _merge tempfile patent save `patent', replace use `matchit1', clear merge 1:m n_nif sabi_cp_name using "`sabi'" keep if _merge==2 drop _merge tempfile sabi save `sabi', replace use `patent', clear matchit han_id pat_cp_name using "`sabi'", idu(n_nif) txtu(sabi_cp_name) weights(simple) threshold(0.1) sim(token) override gsort han_id -similscore duplicates report sabi_cp_name duplicates tag sabi_cp_name, gen(dup) tab dup }
Comment