Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Eliminate duplicates matches from matchit

    Dear Statalist, I would like to ask you if possible, about the matchit command, however, as this questions is a little bit different from another question I did about matchit, I put it in a different post (my apologize in advance).

    I want to find for each firm listed in the Patent dataset, that same firm in Amadeus (firm) dataset. I would like to know how to eliminate the matched firm from both datasets (patents and firms [Amadeus]) after doing the matchit in a recursive way. I mean, let’s say that first I do a merge for equal cases (same firm names) between the two datasets. Those firms matched should be eliminated from the matchit process I will do next in both datasets.

    For those not merged, then I would use matchit. However, when using matchit, it is possible that some firms from Amadeus dataset duplicates in different matches in the patent dataset (two different firm's name from patent data uses the same matched firm name from Amadeus data). For those cases, I would need to keep as matched the line with the highest score (and eliminate those firm's name that were matched from both datasets (patents and firms). And redo the matchit for those with a lower score (poorer matching) and for those that were using the same duplicated firm's name but with a lower score. This should be done until all firms in the patent dataset are matched without these duplications.

    Thereafter, I would need to joint all the matches in a single file. Is it possible to do this matchit process in a recursive way that store the matches and redo the matchit for the duplicated (with lower score) cases and poorer (say lower than 0.9) matches?

    I do not know how to copy/paste the dataex from my previous post about matchit. But you may find it in #1: Matching patent and firm-level data using matchit - Statalist



  • #2
    Just for if it help, I have been trying to do a very raw version of what I need. For that you will see next two dataex. The first is the patent data example. The second one is the firm data example. Below you will see the process I am looking for. That is, I do a first matchit and keep the best matches (eliminating the duplicate firms) which is "matchit1". Then, I merge this "matchit1" with the patent and also with sabi to eliminate the cases that were matched. Then, using the unmatched patent cases (patent_to_match), I do a second matchit with those unmatched Firm data cases (sabi_to_match).
    Then I repeat the same steps, that is, from this second matchit, I elimimnate the duplicates coming from patent and firm datasets, and save it as (matchit2). Then I append this (matchit2) to the (matchit1) which are the cases that should not be pursue in the following matchit processes (I re-save this as matchit2). Then, I merge this new (matchit2) with the original patent and firm datasets to discard those that were previously matched from the next matchit processes.

    My point is that it should be a more clean way to do this. But the main reason is that I do not know how many time should I have to repeat the same process. I do not know how to tell a loop that it should stop doing matchit when there is not more duplicates from both datasets.

    Do you have any clue of how to do this in a better way?

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long han_id str129 pat_cp_name
      890 "3 D MITA SL 28500"                                                     
     4280 "3P BIOPHARMACEUTICALS SL 31110"                                        
     6541 "A INGENIERIA DE AUTOMATISMOS SA 48213"                                 
    11608 "A3 ADVANCED AUTOMOTIVE ANTENNAS 08100"                                 
    14850 "ABB POWER TECH SA 28037"                                               
    15754 "ABENGOA SOLAR NEW TECH SA "                                            
    16081 "AB BIOTICS SA 08193"                                                   
    16926 "ABENGOA SOLAR NEW TECNOLOGIES SA 41014"                                
    17564 "ABN PIPE SYSTEMS SLU 15008"                                            
    18182 "ABUNDANCIA DE SANTIAGO RAMON 35017"                                    
    19340 "ABENGOA HIDROGENO SA 41014"                                            
    19410 "ABAD MUNOZ FERNANDO 14009"                                             
    19821 "ABUNDANCIA NAVARRO CRISTINA 35017"                                     
    19822 "ABUNDANCIA NAVARRO JUAN CARLOS 35017"                                  
    21281 "ACENER INVESTIGACION Y DESARROLLO SL 28030"                            
    23363 "ACCIONA ENERGIA SA 31621"                                              
    23366 "ACCIONA TOWERS SA 28020"                                               
    23451 "ABRAHAM VENEGAS FRANCO 29730"                                          
    23728 "ABERTIS AUTOPISTAS ESPANA SA 08040"                                    
    23910 "ACOSTA APARICIO VICTOR 28006"                                          
    26111 "ACTUALITY SISTEMS SL 03113"                                            
    26529 "ACEITES DEL SUR COOSUR SA 23220"                                       
    26552 "ACERALIA TRANSFORMADOS SA 31014"                                       
    28806 "ACRONIMUS TECH SL "                                                    
    30133 "ACERIA COMPACTA DE BIZKAIA SA 48910"                                   
    30280 "ACCIONA INFRAESTRUCTURAS SA 28108"                                     
    31274 "ACITURRI ENGINEERING SLU 47151"                                        
    31355 "IMMOSOLAR ACTIVE BUILDING TECH SL 07180"                               
    31402 "ACTIVO MARK SL 46002"                                                  
    31940 "ADMINISTRACION GENERAL DE LA COMUNIDAD AUTONOMA DE EUSKADI 01010"      
    32563 "ACONDICIONAMIENTO TARRASENSE 08225"                                    
    33140 "ADELTE AIRPORT TECH SL 08029"                                          
    35500 "ACORDE TECH SA 39002"                                                  
    35722 "ADAICO SL 31006"                                                       
    36403 "ACS SERVICIOS COMUNICACIONES Y ENERGIA SL 28016"                       
    36537 "ADVANCED IN VITRO CELL TECH SL 08028"                                  
    37061 "ADVANCED MEDICAL PROJECTS 28760"                                       
    37779 "ADVANCED SCIENTIFIC TECH EUROPE SA 08700"                              
    37791 "ADVANCED SIMULATION TECH SL 33203"                                     
    40071 "AERNNOVA ENGINEERING SOLUTIONS IBERICA 28050"                          
    40197 "ADT ESPANA SL 50008"                                                   
    41964 "ADN CONTEXT AWARE MOBILE SOLUTIONS SL 33203"                           
    43015 "AEROSPACE CONSULTING CORP SPAIN SL 08034"                              
    43561 "ADVANCED DIGITAL DESIGN SA 50018"                                      
    45346 "ADVANCELL ADVANCED IN VITRO CELL TECH SA 08028"                        
    47630 "AGENCIA PUBLICA EMPRESARIAL SANITARIA HOSPITAL ALTO GUADALQUIVIR 23740"
    47631 "AGENCIA PUBLICA EMPRESARIAL SANITARIA HOSPITAL DE PONIENTE 04700"      
    48483 "AF SISTEMAS SA 28041"                                                  
    48564 "AFINITICA TECH SL 08193"                                               
    48740 "AGQ TECHNOLOGICAL CORPORATE SA 41220"                                  
    end
    Code:
    tempfile patent
    save `patent'
    Firm dataset
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long n_nif str126 sabi_cp_name
        181 "ACERIA DE ALAVA SA 01470"                                                   
       1221 "AERNNOVA MANUFACTURING ENGINEERING SA 01013"                                
       9480 "LA AUXILIAR TARRASENSE SA 08221"                                            
      10625 "INMOBILIARIA TARRASENSE SA 08221"                                           
      52084 "ACCIONA SOLAR SA 31621"                                                     
      61858 "ABENGOA SA 41014"                                                           
      92568 "ADVANCELL ADVANCED IN VITRO CELL TECHNOLOGIES SA 08006"                     
      93543 "AB BIOTICS SA 08172"                                                        
      94636 "ABERTIS INTERNACIONAL SA 28046"                                             
     101792 "INVERSIONES AYERBE SA 28007"                                                
     102429 "ABB POWER TECHNOLOGY SA 28037"                                              
     106616 "ABB ENERGIA SA 28043"                                                       
     106853 "ACCIONA MANTENIMIENTO DE INFRAESTRUCTURAS SA 28108"                         
     108195 "ACEITES DEL SUR COOSUR SA 23220"                                            
     110005 "AF INMUEBLES SA 28014"                                                      
     111016 "ADVANCED DIGITAL SA 28100"                                                  
     111303 "ABERTIS AUTOPISTAS ESPAÑA SAU 28046"                                       
     114255 "ACS SERVICIOS COMUNICACIONES Y ENERGIA SA 28016"                            
     120532 "AERNNOVA ENGINEERING SOLUTIONS IBERICA SA 28050"                            
     124260 "ABENGOA SOLAR SA 41014"                                                     
     151797 "ACENER SXXI SL 02002"                                                       
     316001 "CONSTRUCCIONES PLAZA SL 14009"                                              
     320670 "DISTRIBUCIONES GOMEZ VILLA SL 14009"                                        
     347630 "ABN PIPE GESTION SL 15008"                                                  
     348503 "ABN PIPE SYSTEMS SL 15008"                                                  
     415960 "ADVANCED MEDICAL PROJECTS SL 28805"                                         
     428956 "ARCELORMITTAL ACERALIA BASQUE HOLDING SL 48910"                             
     454807 "ADELTE AIRPORT TECHNOLOGIES SL 08029"                                       
     542232 "ABRAHAM RUIZ SL 29620"                                                      
     581950 "ADAICO RECAMBIOS SL 31006"                                                  
     582139 "3 P BIOPHARMACEUTICALS SL 31110"                                            
     582896 "ACCIONA ENERGIA SOLAR SL 31621"                                             
     606385 "ADVANCED SIMULATION TECHNOLOGIES SL 33203"                                  
     608869 "ADN CONTEXT AWARE MOBILE SOLUTIONS SL 33211"                                
     715793 "INSTITUTO AWARE SL 46010"                                                   
     720591 "INMOBILIARIA ACOSTA SL 41500"                                               
     833815 "ADT ESPAÑA SL 50008"                                                       
     838620 "AGRO INDUSTRIAL AYERBE SL 50007"                                            
     862439 "ACOSTA APARICIO SL 03540"                                                   
     924840 "ENERGIA AUTONOMA SL 43201"                                                  
     942960 "ACTIVE BUILDING TECHNOLOGIES INTELLIGENT SYSTEMS SL 07009"                  
     992759 "IMMOSOLAR SL 08830"                                                         
    1013603 "ADT TELECOMUNICACIONES SL 08110"                                            
    1059704 "ADVANCED AUTOMOTIVE ANTENNAS SL 08028"                                      
    1064429 "BARCELONA MARK CENTER SL 08029"                                             
    1122507 "AF SOL GRUP SL 08036"                                                       
    1137540 "ACRONIMUS TECHNOLOGY SL 08242"                                              
    1138273 "IN VITRO MEDIA SL 08008"                                                    
    1142883 "ADELTE GROUP SL 08029"                                                      
    1151815 "MITA COSTA SL 08013"                                                        
    1154199 "AFINITICA TECHNOLOGIES SL 08193"                                            
    1177769 "AFINITICA PROCESS TECHNOLOGY SL 08193"                                      
    1182462 "HIDROGENO CONSULTING SL 08006"                                              
    1296991 "COMERCIAL AGAR SL 28027"                                                    
    1319680 "AEROSPACE SL 28007"                                                         
    1334880 "ACITURRI ENGINEERING SL 47151"                                              
    1363136 "MADRID SCIENTIFIC FILMS SL 28691"                                           
    1379457 "ACORDE CONSULTORES SL 28003"                                                
    1380552 "INVERSIONES ACORDE SL 28001"                                                
    1393146 "3 D MITA INGENIERIA SL 28500"                                               
    1400325 "ACS SERVICIOS COMUNICACIONES Y ENERGIA INTERNACIONAL SL 28016"              
    1410031 "TOWERS CONSULTING E INVERSIONES SL 28016"                                   
    1410223 "GRAN ABUNDANCIA SL 28006"                                                   
    1433549 "INNOVA SCIENTIFIC SL 28290"                                                 
    1437034 "ADVANCED MEDICAL SYSTEMS SL 28012"                                          
    1453788 "ACTUALITY EVENTOS Y COMUNICACION SL 28027"                                  
    1454045 "ACENER RENOVA SL 28006"                                                     
    1475987 "MADRID AEROSPACE SERVICES SL 28850"                                         
    1476213 "ACTUALITY SALUD SL 28028"                                                   
    1485810 "TOWERS INVERSIONES Y PROYECTOS SL 28028"                                    
    1500934 "ADVANCED MATERIAL SIMULATION SL 48008"                                      
    1521569 "ABRAHAM PRODUCCIONES SL 28014"                                              
    1536575 "DESARROLLO EMPRESARIAL ADVANCED SL 28014"                                   
    1540823 "CORPORACION ACCIONA INFRAESTRUCTURAS SL 28108"                              
    1543325 "FOOD MARK GROUP SL 28007"                                                   
    1548989 "CERES BIOTICS TECH SL 28830"                                                
    1575537 "ACITURRI GETAFE SL 28906"                                                   
    1582869 "AGAR MANAGEMENT SL 28039"                                                   
    1611422 "AGQ TECHNOLOGICAL CORPORATE SL 41013"                                       
    1616563 "LABS & TECHNOLOGICAL SERVICES AGQ SL 41220"                                 
    1662970 "LA JOYA DE LA ABUNDANCIA SL 29602"                                          
    1686003 "CARPINTERIA SEGUI SL 48213"                                                 
    1688079 "ACERALIA CONSTRUCCIONES SL 48910"                                           
    1695829 "LUMINOSOS VERA SL 48213"                                                    
    1704855 "HIDROGENO DEL NORTE SL 48960"                                               
    1707355 "ADAICO TRUCK & TRAILER SL 48340"                                            
    1824467 "COOP AUTONOMA VALENCIANA DE TRANSPORTES SC V 46920"                         
    1830045 "AGENCIA SANITARIA ALTO GUADALQUIVIR 23740"                                  
    1830156 "AGENCIA PUBLICA EMPRESARIAL SANITARIA DEL BAJO GUADALQUIVIR 41710"          
    1830195 "AGENCIA PUBLICA EMPRESARIAL SANITARIA HOSPITAL DE PONIENTE DE ALMERIA 04700"
    1830205 "AGENCIA PUBLICA EMPRESARIAL SANITARIA COSTA DEL SOL 29600"                  
    1830463 "ABENGOA SA Y ABENGOA SERVICIOS URBANOS UTE 41018"                           
      92568 "ADVANCELL ADVANCED IN VITRO CELL TECHNOLOGIES SA 08006"                     
      92568 "ADVANCELL ADVANCED IN VITRO CELL TECHNOLOGIES SA 08006"                     
     124260 "ABENGOA SOLAR SA 41014"                                                     
    1138273 "IN VITRO MEDIA SL 08008"                                                    
    1410223 "GRAN ABUNDANCIA SL 28006"                                                   
    1410223 "GRAN ABUNDANCIA SL 28006"                                                   
    1662970 "LA JOYA DE LA ABUNDANCIA SL 29602"                                          
    1662970 "LA JOYA DE LA ABUNDANCIA SL 29602"                                          
    end
    Code:
    tempfile sabi
    save `sabi'
    Code:
    use `patent', clear
    dataex 
    matchit han_id pat_cp_name using "`sabi'", idu(n_nif) txtu(sabi_cp_name) weights(simple) threshold(0.5) sim(token) override
    gsort han_id -similscore
    
    preserve
    gsort sabi_cp_name -similscore
    duplicates drop sabi_cp_name, force
    duplicates drop pat_cp_name, force
    keep if similscore>0.95
    tempfile matchit1
    save `matchit1', replace
    restore
    
    use `matchit1', clear
    merge 1:1 han_id pat_cp_name using "`patent'"
    keep if _merge!=3
    drop _merge
    tempfile patent_to_match
    save `patent_to_match', replace
    
    use `matchit1', clear
    merge 1:1 n_nif sabi_cp_name using "`sabi'"
    keep if _merge!=3
    drop _merge
    tempfile sabi_to_match
    save `sabi_to_match', replace
    
    use `patent_to_match', clear
    matchit han_id pat_cp_name using "`sabi_to_match'", idu(n_nif) txtu(sabi_cp_name) weights(simple) threshold(0.5) sim(token) override
    gsort han_id -similscore
    
    preserve
    gsort sabi_cp_name -similscore
    duplicates drop sabi_cp_name, force
    duplicates drop pat_cp_name, force
    keep if similscore>0.9
    tempfile matchit2
    save `matchit2', replace
    restore
    
    use `matchit1', clear
    append using `matchit2'
    tempfile matchit2
    save `matchit2', replace
    
    use `matchit2', clear
    merge 1:1 han_id pat_cp_name using "`patent'"
    keep if _merge!=3
    drop _merge
    tempfile patent_to_match
    save `patent_to_match', replace
    
    use `matchit2', clear
    merge 1:1 n_nif sabi_cp_name using "`sabi'"
    keep if _merge!=3
    drop _merge
    tempfile sabi_to_match
    save `sabi_to_match', replace
    
    use `patent_to_match', clear
    matchit han_id pat_cp_name using "`sabi_to_match'", idu(n_nif) txtu(sabi_cp_name) weights(simple) threshold(0.5) sim(token) override
    gsort han_id -similscore
    duplicates report pat_cp_name
    duplicates report sabi_cp_name

    Comment


    • #3
      Dear all, sorry to bother you again with this issue, but after trying several times, I could not yet solve it. I do not see where the error is, since I assume I am following logical steps. Again, this possibly should be done in a better way, but this is my rudimentary version.

      Since I want for each firm in the patent data to be matched with a firm in the Amadues (sabi) data, I do not want a given firm from sabi to be repeated for different firms in the patent data. Therefore, I first create a matchit and eliminate those obs with the same sabi firm and lower score. Then, I keep only the firm with the best score from patent. This is kept in a file (matchit1). The idea is to merge this with the original patent data to discard from the patent file those firms that were matched. And I do the same for sabi. Therefore, I assume that what is left in the patent and sabi data are firms that were unmatched.

      I redo the matchit using the new datasets for patent and sabi (matchit2), and again I eliminate duplicates sabi firm (serving as a matched firm for more than one patent firm) and keep only the best matches for patent firm (in terms of score). Thereafter, I append these new set of firms (matchit2) with the previous best matches (matchit1). The idea is to repeat the process until there is not more duplicate firms (a single firm from sabi serving as match for different firms in the patent data, and therefore, being duplicated).

      However, I am having huge problems, since I cannot understand why it is not doing the loop. If instead, I copy/paste the block of codes within “loop1” right below (let’s say twice), it continues working (since dup is from 0 to 4). A very short data example is in #2.

      Any idea will be more than welcome!

      Code:
      use `patent', clear 
      matchit n_nif_pat pat_cp_name using "`sabi'", idu(n_nif) txtu(sabi_cp_name) weights(simple) threshold(0.5) sim(token) override
      gsort n_nif_pat -similscore
      
      preserve
      gsort sabi_cp_name -similscore
      duplicates drop sabi_cp_name, force
      duplicates drop pat_cp_name, force
      keep if similscore>0.95
      gen v1 = _n
      sum v1
      tempfile matchit1
      save `matchit1', replace
      restore
      
      use `matchit1', clear
      merge 1:1 n_nif_pat pat_cp_name using "`patent'"
      keep if _merge==2
      drop _merge
      tempfile patent
      save `patent', replace
      
      use `matchit1', clear
      merge 1:1 n_nif sabi_cp_name using "`sabi'"
      keep if _merge==2
      drop _merge
      tempfile sabi
      save `sabi', replace
      
      use `patent', clear
      matchit n_nif_pat pat_cp_name using "`sabi'", idu(n_nif) txtu(sabi_cp_name) weights(simple) threshold(0.5) sim(token) override
      gsort n_nif_pat -similscore
      duplicates report sabi_cp_name
      duplicates tag sabi_cp_name, gen(dup)
      tab dup
      
      
      ****************************** loop1 *****************************
      
      while dup>0 { 
      
      gsort sabi_cp_name -similscore
      duplicates drop sabi_cp_name, force
      gsort pat_cp_name -similscore
      duplicates drop pat_cp_name, force
      tempfile matchit2
      save `matchit2', replace
      use `matchit1', clear
      duplicates report han_id pat_cp_name, force
      append using `matchit2'
      tempfile matchit1
      save `matchit1', replace
      
      use `matchit1', clear
      merge 1:1 han_id pat_cp_name using "`patent'"
      keep if _merge==2
      drop _merge
      tempfile patent
      save `patent', replace
      
      use `matchit1', clear
      merge 1:1 n_nif sabi_cp_name using "`sabi'"
      keep if _merge==2
      drop _merge
      tempfile sabi
      save `sabi', replace
      
      use `patent', clear
      matchit han_id pat_cp_name using "`sabi'", idu(n_nif) txtu(sabi_cp_name) weights(simple) threshold(0.1) sim(token) override
      gsort han_id -similscore
      duplicates report sabi_cp_name
      duplicates tag sabi_cp_name, gen(dup)
      tab dup
      }

      Comment

      Working...
      X