Eliminate duplicates matches from matchit

Doris Rivera

Join Date: Feb 2020

Posts: 172
#1

Eliminate duplicates matches from matchit

19 Jul 2022, 06:44

Dear Statalist, I would like to ask you if possible, about the matchit command, however, as this questions is a little bit different from another question I did about matchit, I put it in a different post (my apologize in advance).

I want to find for each firm listed in the Patent dataset, that same firm in Amadeus (firm) dataset. I would like to know how to eliminate the matched firm from both datasets (patents and firms [Amadeus]) after doing the matchit in a recursive way. I mean, let’s say that first I do a merge for equal cases (same firm names) between the two datasets. Those firms matched should be eliminated from the matchit process I will do next in both datasets.

For those not merged, then I would use matchit. However, when using matchit, it is possible that some firms from Amadeus dataset duplicates in different matches in the patent dataset (two different firm's name from patent data uses the same matched firm name from Amadeus data). For those cases, I would need to keep as matched the line with the highest score (and eliminate those firm's name that were matched from both datasets (patents and firms). And redo the matchit for those with a lower score (poorer matching) and for those that were using the same duplicated firm's name but with a lower score. This should be done until all firms in the patent dataset are matched without these duplications.

Thereafter, I would need to joint all the matches in a single file. Is it possible to do this matchit process in a recursive way that store the matches and redo the matchit for the duplicated (with lower score) cases and poorer (say lower than 0.9) matches?

I do not know how to copy/paste the dataex from my previous post about matchit. But you may find it in #1: Matching patent and firm-level data using matchit - Statalist
Tags: None

Doris Rivera

Join Date: Feb 2020
Posts: 172

19 Jul 2022, 11:48

Just for if it help, I have been trying to do a very raw version of what I need. For that you will see next two dataex. The first is the patent data example. The second one is the firm data example. Below you will see the process I am looking for. That is, I do a first matchit and keep the best matches (eliminating the duplicate firms) which is "matchit1". Then, I merge this "matchit1" with the patent and also with sabi to eliminate the cases that were matched. Then, using the unmatched patent cases (patent_to_match), I do a second matchit with those unmatched Firm data cases (sabi_to_match).
Then I repeat the same steps, that is, from this second matchit, I elimimnate the duplicates coming from patent and firm datasets, and save it as (matchit2). Then I append this (matchit2) to the (matchit1) which are the cases that should not be pursue in the following matchit processes (I re-save this as matchit2). Then, I merge this new (matchit2) with the original patent and firm datasets to discard those that were previously matched from the next matchit processes.

My point is that it should be a more clean way to do this. But the main reason is that I do not know how many time should I have to repeat the same process. I do not know how to tell a loop that it should stop doing matchit when there is not more duplicates from both datasets.

Do you have any clue of how to do this in a better way?

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long han_id str129 pat_cp_name
  890 "3 D MITA SL 28500"                                                     
 4280 "3P BIOPHARMACEUTICALS SL 31110"                                        
 6541 "A INGENIERIA DE AUTOMATISMOS SA 48213"                                 
11608 "A3 ADVANCED AUTOMOTIVE ANTENNAS 08100"                                 
14850 "ABB POWER TECH SA 28037"                                               
15754 "ABENGOA SOLAR NEW TECH SA "                                            
16081 "AB BIOTICS SA 08193"                                                   
16926 "ABENGOA SOLAR NEW TECNOLOGIES SA 41014"                                
17564 "ABN PIPE SYSTEMS SLU 15008"                                            
18182 "ABUNDANCIA DE SANTIAGO RAMON 35017"                                    
19340 "ABENGOA HIDROGENO SA 41014"                                            
19410 "ABAD MUNOZ FERNANDO 14009"                                             
19821 "ABUNDANCIA NAVARRO CRISTINA 35017"                                     
19822 "ABUNDANCIA NAVARRO JUAN CARLOS 35017"                                  
21281 "ACENER INVESTIGACION Y DESARROLLO SL 28030"                            
23363 "ACCIONA ENERGIA SA 31621"                                              
23366 "ACCIONA TOWERS SA 28020"                                               
23451 "ABRAHAM VENEGAS FRANCO 29730"                                          
23728 "ABERTIS AUTOPISTAS ESPANA SA 08040"                                    
23910 "ACOSTA APARICIO VICTOR 28006"                                          
26111 "ACTUALITY SISTEMS SL 03113"                                            
26529 "ACEITES DEL SUR COOSUR SA 23220"                                       
26552 "ACERALIA TRANSFORMADOS SA 31014"                                       
28806 "ACRONIMUS TECH SL "                                                    
30133 "ACERIA COMPACTA DE BIZKAIA SA 48910"                                   
30280 "ACCIONA INFRAESTRUCTURAS SA 28108"                                     
31274 "ACITURRI ENGINEERING SLU 47151"                                        
31355 "IMMOSOLAR ACTIVE BUILDING TECH SL 07180"                               
31402 "ACTIVO MARK SL 46002"                                                  
31940 "ADMINISTRACION GENERAL DE LA COMUNIDAD AUTONOMA DE EUSKADI 01010"      
32563 "ACONDICIONAMIENTO TARRASENSE 08225"                                    
33140 "ADELTE AIRPORT TECH SL 08029"                                          
35500 "ACORDE TECH SA 39002"                                                  
35722 "ADAICO SL 31006"                                                       
36403 "ACS SERVICIOS COMUNICACIONES Y ENERGIA SL 28016"                       
36537 "ADVANCED IN VITRO CELL TECH SL 08028"                                  
37061 "ADVANCED MEDICAL PROJECTS 28760"                                       
37779 "ADVANCED SCIENTIFIC TECH EUROPE SA 08700"                              
37791 "ADVANCED SIMULATION TECH SL 33203"                                     
40071 "AERNNOVA ENGINEERING SOLUTIONS IBERICA 28050"                          
40197 "ADT ESPANA SL 50008"                                                   
41964 "ADN CONTEXT AWARE MOBILE SOLUTIONS SL 33203"                           
43015 "AEROSPACE CONSULTING CORP SPAIN SL 08034"                              
43561 "ADVANCED DIGITAL DESIGN SA 50018"                                      
45346 "ADVANCELL ADVANCED IN VITRO CELL TECH SA 08028"                        
47630 "AGENCIA PUBLICA EMPRESARIAL SANITARIA HOSPITAL ALTO GUADALQUIVIR 23740"
47631 "AGENCIA PUBLICA EMPRESARIAL SANITARIA HOSPITAL DE PONIENTE 04700"      
48483 "AF SISTEMAS SA 28041"                                                  
48564 "AFINITICA TECH SL 08193"                                               
48740 "AGQ TECHNOLOGICAL CORPORATE SA 41220"                                  
end

Code:

tempfile patent
save `patent'

Firm dataset

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long n_nif str126 sabi_cp_name
    181 "ACERIA DE ALAVA SA 01470"                                                   
   1221 "AERNNOVA MANUFACTURING ENGINEERING SA 01013"                                
   9480 "LA AUXILIAR TARRASENSE SA 08221"                                            
  10625 "INMOBILIARIA TARRASENSE SA 08221"                                           
  52084 "ACCIONA SOLAR SA 31621"                                                     
  61858 "ABENGOA SA 41014"                                                           
  92568 "ADVANCELL ADVANCED IN VITRO CELL TECHNOLOGIES SA 08006"                     
  93543 "AB BIOTICS SA 08172"                                                        
  94636 "ABERTIS INTERNACIONAL SA 28046"                                             
 101792 "INVERSIONES AYERBE SA 28007"                                                
 102429 "ABB POWER TECHNOLOGY SA 28037"                                              
 106616 "ABB ENERGIA SA 28043"                                                       
 106853 "ACCIONA MANTENIMIENTO DE INFRAESTRUCTURAS SA 28108"                         
 108195 "ACEITES DEL SUR COOSUR SA 23220"                                            
 110005 "AF INMUEBLES SA 28014"                                                      
 111016 "ADVANCED DIGITAL SA 28100"                                                  
 111303 "ABERTIS AUTOPISTAS ESPAÑA SAU 28046"                                       
 114255 "ACS SERVICIOS COMUNICACIONES Y ENERGIA SA 28016"                            
 120532 "AERNNOVA ENGINEERING SOLUTIONS IBERICA SA 28050"                            
 124260 "ABENGOA SOLAR SA 41014"                                                     
 151797 "ACENER SXXI SL 02002"                                                       
 316001 "CONSTRUCCIONES PLAZA SL 14009"                                              
 320670 "DISTRIBUCIONES GOMEZ VILLA SL 14009"                                        
 347630 "ABN PIPE GESTION SL 15008"                                                  
 348503 "ABN PIPE SYSTEMS SL 15008"                                                  
 415960 "ADVANCED MEDICAL PROJECTS SL 28805"                                         
 428956 "ARCELORMITTAL ACERALIA BASQUE HOLDING SL 48910"                             
 454807 "ADELTE AIRPORT TECHNOLOGIES SL 08029"                                       
 542232 "ABRAHAM RUIZ SL 29620"                                                      
 581950 "ADAICO RECAMBIOS SL 31006"                                                  
 582139 "3 P BIOPHARMACEUTICALS SL 31110"                                            
 582896 "ACCIONA ENERGIA SOLAR SL 31621"                                             
 606385 "ADVANCED SIMULATION TECHNOLOGIES SL 33203"                                  
 608869 "ADN CONTEXT AWARE MOBILE SOLUTIONS SL 33211"                                
 715793 "INSTITUTO AWARE SL 46010"                                                   
 720591 "INMOBILIARIA ACOSTA SL 41500"                                               
 833815 "ADT ESPAÑA SL 50008"                                                       
 838620 "AGRO INDUSTRIAL AYERBE SL 50007"                                            
 862439 "ACOSTA APARICIO SL 03540"                                                   
 924840 "ENERGIA AUTONOMA SL 43201"                                                  
 942960 "ACTIVE BUILDING TECHNOLOGIES INTELLIGENT SYSTEMS SL 07009"                  
 992759 "IMMOSOLAR SL 08830"                                                         
1013603 "ADT TELECOMUNICACIONES SL 08110"                                            
1059704 "ADVANCED AUTOMOTIVE ANTENNAS SL 08028"                                      
1064429 "BARCELONA MARK CENTER SL 08029"                                             
1122507 "AF SOL GRUP SL 08036"                                                       
1137540 "ACRONIMUS TECHNOLOGY SL 08242"                                              
1138273 "IN VITRO MEDIA SL 08008"                                                    
1142883 "ADELTE GROUP SL 08029"                                                      
1151815 "MITA COSTA SL 08013"                                                        
1154199 "AFINITICA TECHNOLOGIES SL 08193"                                            
1177769 "AFINITICA PROCESS TECHNOLOGY SL 08193"                                      
1182462 "HIDROGENO CONSULTING SL 08006"                                              
1296991 "COMERCIAL AGAR SL 28027"                                                    
1319680 "AEROSPACE SL 28007"                                                         
1334880 "ACITURRI ENGINEERING SL 47151"                                              
1363136 "MADRID SCIENTIFIC FILMS SL 28691"                                           
1379457 "ACORDE CONSULTORES SL 28003"                                                
1380552 "INVERSIONES ACORDE SL 28001"                                                
1393146 "3 D MITA INGENIERIA SL 28500"                                               
1400325 "ACS SERVICIOS COMUNICACIONES Y ENERGIA INTERNACIONAL SL 28016"              
1410031 "TOWERS CONSULTING E INVERSIONES SL 28016"                                   
1410223 "GRAN ABUNDANCIA SL 28006"                                                   
1433549 "INNOVA SCIENTIFIC SL 28290"                                                 
1437034 "ADVANCED MEDICAL SYSTEMS SL 28012"                                          
1453788 "ACTUALITY EVENTOS Y COMUNICACION SL 28027"                                  
1454045 "ACENER RENOVA SL 28006"                                                     
1475987 "MADRID AEROSPACE SERVICES SL 28850"                                         
1476213 "ACTUALITY SALUD SL 28028"                                                   
1485810 "TOWERS INVERSIONES Y PROYECTOS SL 28028"                                    
1500934 "ADVANCED MATERIAL SIMULATION SL 48008"                                      
1521569 "ABRAHAM PRODUCCIONES SL 28014"                                              
1536575 "DESARROLLO EMPRESARIAL ADVANCED SL 28014"                                   
1540823 "CORPORACION ACCIONA INFRAESTRUCTURAS SL 28108"                              
1543325 "FOOD MARK GROUP SL 28007"                                                   
1548989 "CERES BIOTICS TECH SL 28830"                                                
1575537 "ACITURRI GETAFE SL 28906"                                                   
1582869 "AGAR MANAGEMENT SL 28039"                                                   
1611422 "AGQ TECHNOLOGICAL CORPORATE SL 41013"                                       
1616563 "LABS & TECHNOLOGICAL SERVICES AGQ SL 41220"                                 
1662970 "LA JOYA DE LA ABUNDANCIA SL 29602"                                          
1686003 "CARPINTERIA SEGUI SL 48213"                                                 
1688079 "ACERALIA CONSTRUCCIONES SL 48910"                                           
1695829 "LUMINOSOS VERA SL 48213"                                                    
1704855 "HIDROGENO DEL NORTE SL 48960"                                               
1707355 "ADAICO TRUCK & TRAILER SL 48340"                                            
1824467 "COOP AUTONOMA VALENCIANA DE TRANSPORTES SC V 46920"                         
1830045 "AGENCIA SANITARIA ALTO GUADALQUIVIR 23740"                                  
1830156 "AGENCIA PUBLICA EMPRESARIAL SANITARIA DEL BAJO GUADALQUIVIR 41710"          
1830195 "AGENCIA PUBLICA EMPRESARIAL SANITARIA HOSPITAL DE PONIENTE DE ALMERIA 04700"
1830205 "AGENCIA PUBLICA EMPRESARIAL SANITARIA COSTA DEL SOL 29600"                  
1830463 "ABENGOA SA Y ABENGOA SERVICIOS URBANOS UTE 41018"                           
  92568 "ADVANCELL ADVANCED IN VITRO CELL TECHNOLOGIES SA 08006"                     
  92568 "ADVANCELL ADVANCED IN VITRO CELL TECHNOLOGIES SA 08006"                     
 124260 "ABENGOA SOLAR SA 41014"                                                     
1138273 "IN VITRO MEDIA SL 08008"                                                    
1410223 "GRAN ABUNDANCIA SL 28006"                                                   
1410223 "GRAN ABUNDANCIA SL 28006"                                                   
1662970 "LA JOYA DE LA ABUNDANCIA SL 29602"                                          
1662970 "LA JOYA DE LA ABUNDANCIA SL 29602"                                          
end

Code:

tempfile sabi
save `sabi'

Code:

use `patent', clear
dataex 
matchit han_id pat_cp_name using "`sabi'", idu(n_nif) txtu(sabi_cp_name) weights(simple) threshold(0.5) sim(token) override
gsort han_id -similscore

preserve
gsort sabi_cp_name -similscore
duplicates drop sabi_cp_name, force
duplicates drop pat_cp_name, force
keep if similscore>0.95
tempfile matchit1
save `matchit1', replace
restore

use `matchit1', clear
merge 1:1 han_id pat_cp_name using "`patent'"
keep if _merge!=3
drop _merge
tempfile patent_to_match
save `patent_to_match', replace

use `matchit1', clear
merge 1:1 n_nif sabi_cp_name using "`sabi'"
keep if _merge!=3
drop _merge
tempfile sabi_to_match
save `sabi_to_match', replace

use `patent_to_match', clear
matchit han_id pat_cp_name using "`sabi_to_match'", idu(n_nif) txtu(sabi_cp_name) weights(simple) threshold(0.5) sim(token) override
gsort han_id -similscore

preserve
gsort sabi_cp_name -similscore
duplicates drop sabi_cp_name, force
duplicates drop pat_cp_name, force
keep if similscore>0.9
tempfile matchit2
save `matchit2', replace
restore

use `matchit1', clear
append using `matchit2'
tempfile matchit2
save `matchit2', replace

use `matchit2', clear
merge 1:1 han_id pat_cp_name using "`patent'"
keep if _merge!=3
drop _merge
tempfile patent_to_match
save `patent_to_match', replace

use `matchit2', clear
merge 1:1 n_nif sabi_cp_name using "`sabi'"
keep if _merge!=3
drop _merge
tempfile sabi_to_match
save `sabi_to_match', replace

use `patent_to_match', clear
matchit han_id pat_cp_name using "`sabi_to_match'", idu(n_nif) txtu(sabi_cp_name) weights(simple) threshold(0.5) sim(token) override
gsort han_id -similscore
duplicates report pat_cp_name
duplicates report sabi_cp_name

Comment

Doris Rivera

Join Date: Feb 2020

Posts: 172
#3

21 Jul 2022, 11:01

Dear all, sorry to bother you again with this issue, but after trying several times, I could not yet solve it. I do not see where the error is, since I assume I am following logical steps. Again, this possibly should be done in a better way, but this is my rudimentary version.

Since I want for each firm in the patent data to be matched with a firm in the Amadues (sabi) data, I do not want a given firm from sabi to be repeated for different firms in the patent data. Therefore, I first create a matchit and eliminate those obs with the same sabi firm and lower score. Then, I keep only the firm with the best score from patent. This is kept in a file (matchit1). The idea is to merge this with the original patent data to discard from the patent file those firms that were matched. And I do the same for sabi. Therefore, I assume that what is left in the patent and sabi data are firms that were unmatched.

I redo the matchit using the new datasets for patent and sabi (matchit2), and again I eliminate duplicates sabi firm (serving as a matched firm for more than one patent firm) and keep only the best matches for patent firm (in terms of score). Thereafter, I append these new set of firms (matchit2) with the previous best matches (matchit1). The idea is to repeat the process until there is not more duplicate firms (a single firm from sabi serving as match for different firms in the patent data, and therefore, being duplicated).

However, I am having huge problems, since I cannot understand why it is not doing the loop. If instead, I copy/paste the block of codes within “loop1” right below (let’s say twice), it continues working (since dup is from 0 to 4). A very short data example is in #2.

Any idea will be more than welcome!

Code:

use `patent', clear matchit n_nif_pat pat_cp_name using "`sabi'", idu(n_nif) txtu(sabi_cp_name) weights(simple) threshold(0.5) sim(token) override gsort n_nif_pat -similscore preserve gsort sabi_cp_name -similscore duplicates drop sabi_cp_name, force duplicates drop pat_cp_name, force keep if similscore>0.95 gen v1 = _n sum v1 tempfile matchit1 save `matchit1', replace restore use `matchit1', clear merge 1:1 n_nif_pat pat_cp_name using "`patent'" keep if _merge==2 drop _merge tempfile patent save `patent', replace use `matchit1', clear merge 1:1 n_nif sabi_cp_name using "`sabi'" keep if _merge==2 drop _merge tempfile sabi save `sabi', replace use `patent', clear matchit n_nif_pat pat_cp_name using "`sabi'", idu(n_nif) txtu(sabi_cp_name) weights(simple) threshold(0.5) sim(token) override gsort n_nif_pat -similscore duplicates report sabi_cp_name duplicates tag sabi_cp_name, gen(dup) tab dup ****************************** loop1 ***************************** while dup>0 { gsort sabi_cp_name -similscore duplicates drop sabi_cp_name, force gsort pat_cp_name -similscore duplicates drop pat_cp_name, force tempfile matchit2 save `matchit2', replace use `matchit1', clear duplicates report han_id pat_cp_name, force append using `matchit2' tempfile matchit1 save `matchit1', replace use `matchit1', clear merge 1:1 han_id pat_cp_name using "`patent'" keep if _merge==2 drop _merge tempfile patent save `patent', replace use `matchit1', clear merge 1:1 n_nif sabi_cp_name using "`sabi'" keep if _merge==2 drop _merge tempfile sabi save `sabi', replace use `patent', clear matchit han_id pat_cp_name using "`sabi'", idu(n_nif) txtu(sabi_cp_name) weights(simple) threshold(0.1) sim(token) override gsort han_id -similscore duplicates report sabi_cp_name duplicates tag sabi_cp_name, gen(dup) tab dup }
Comment

Announcement