Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Matching patent and firm-level data using matchit

    Dear Statalist, I need to match patent data (PATSTAT) with nearly 8000 firms’ names and financial firm data (200000 firms). I have tried with “merge” but the amount of coincidence is too low 800 from the 8000 firms. I have seen that “matchit” could help with this. I am wondering if you can give me some guidance with this since I am truly lost.

    The patent data (see the first dataex below) have the firms’ names (person_name_clean) and a unique id (han_id) for the firm (also it has other information like region). However, in this patent dataset the same firm (han_id) may have different names (person_name_clean). This is not a problem since I could fill the other permutations of a firm’s name if at least one of the names in a given han_id is matched. The point is that most of firms (han_id) are not matched.

    Firm-level financial data (see the second dataex example below) have a unique row per firm, with a unique firm’s name, and the national id number (string because it also has letters) for each firm (which is what I need the most to be linked with the patent dataset).
    Following the matchit command, I see that in the using dataset (firm financial data) it needs a numeric variable idu() which I do not know what to put there since the national id (NIF) is string one. I have tried to leave it empty, but the command failed.

    Can anyone please help me on how to merge these two datasets (with matchit or other)? Is there any sequential steps that needs to be done? Or any further cleaning process?
    Thanks a lot in advance for your help!

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long(han_id person_id) str191 person_name_clean str2 person_ctry_code long appln_id str7 reg_code
      25955 55727233 "ACCIONA AGUA SA"                                          "ES" 450444643 "ES300"
      25955  4673612 "ACCIONA AGUA SAU"                                         "ES" 322520912 "ES300"
      30595 46984000 "ACTEGA ARTISTICA SA"                                      "ES" 274221922 "ES114"
      30595 69826549 "ACTEGA ARTISTICA SAU"                                     "ES" 488163435 "ES114"
      31942 69814576 "ADMINISTRADOR DE INFRAESTRUCTURAS FERROVIARIAS"           "ES" 439669595 "ES300"
      31942 53344761 "ADMINISTRADOR DE INFRAESTRUCTURAS FERROVIARIAS ADIF"      "ES" 375853261 "ES300"
      35716 69867964 "FICOSA ADAS SLU"                                          "ES" 523951915 "ES511"
      35716  4894042 "ADASENS AUTOMOTIVE SLU"                                   "ES" 332446336 "ES511"
      40488   177439 "ADVANCED AUTOMOTIVE ANTENNAS SL"                          "ES"    146643 "ES511"
      40488 54483641 "ADVANCED AUTOMOTIVE ANTENNAS SLU"                         "ES" 512621885 "ES511"
     149468 57749205 "ARACLON BIOTECH SL"                                       "ES"  57583877 "ES243"
     149468 57754059 "ARACLON BIOTECH"                                          "ES"  16413469 "ES243"
     167816  4083453 "ASOC DE INVESTIGACION DE LA INDUSTRIA TEXTIL"             "ES"  54956570 "ES521"
     167816 79377500 "ASOCIACION DE INVESTIGACION DE LA INDUSTRIA TEXTIL AITEX" "ES" 525801472 "ES521"
     267639 56696716 "BELLOTA HERRAMIENTAS SL"                                  "ES" 438359524 "ES212"
     267639  4227267 "BELLOTA HERRAMIENTAS SA"                                  "ES" 421504350 "ES212"
     285293 49482288 "BELGICAST INTERNACIONAL SL"                               "ES"  15873292 "ES213"
     285293  1656000 "BELGICAST INTERNACIONAL SLU"                              "ES"  16149465 "ES213"
     363840  4191675 "BORGWARNER EMISSIONS SYSTEMS SPAIN SL"                    "ES"  56198444 "ES114"
     363840 46969603 "BORGWARNER EMISSIONS SYSTEMS SPAIN SLU"                   "ES" 441582235 "ES114"
     445688 77691953 "CETIR CENTRE MEDIC SL"                                    "ES" 487788862 "ES511"
     445688   336701 "CETIR CENTRE MEDIC SA"                                    "ES" 333273194 "ES511"
     857214 49166589 "ESMALGLASS SA"                                            "ES"  15991978 "ES522"
     857214 49493891 "ESMALGLASS SAU"                                           "ES" 411983837 "ES522"
     881550   105478 "FABRICACION ASIENTOS VEHICULOS INDUSTRIALES SA"           "ES" 475624433 "ES511"
     881550  3612976 "FABRICACION ASIENTOS VEHICULOS INDUSTRIALES SA FAINSA"    "ES"  17188769 "ES511"
     914089  1085153 "FEDERAL SIGNAL VAMA SA"                                   "ES"  15845924 "ES511"
     914089 69867962 "FEDERAL SIGNAL VAMA SAU"                                  "ES" 533306720 "ES511"
    1113791 56786974 "FAMA SOFAS SL"                                            "ES" 493116528 "ES620"
    1113791 72042076 "FAMA SOFAS SLU"                                           "ES" 511127730 "ES620"
    1262544 77678129 "GALENICUM HEALTH SLU"                                     "ES" 513301106 "ES511"
    1262544 45074134 "GALENICUM HEALTH SL"                                      "ES" 414765977 "ES511"
    1366987  2209701 "INDUSTRIAL TECNICA PECUARIA SA"                           "ES" 273607356 "ES511"
    1366987 44850616 "INDUSTRIA TECNICA PECUARIA SA"                            "ES"  16423946 "ES511"
    1373289 57752917 "INGENIATRICS TECNOLOGIAS SL"                              "ES" 470580193 "ES415"
    1373289 53351661 "INGENIATRICS TECNOLOGIAS"                                 "ES" 437553942 "ES618"
    1381931  4555171 "INSTITUT UNIV DE CIENCIA I TECNOLOGIA SA"                 "ES" 274221936 "ES511"
    1381931   180527 "INSTITUT UNIV DE CIENCIA I TECNOLOGIA"                    "ES"    150352 "ES511"
    1530294  3375998 "INDUSTRIAS TAYG SL"                                       "ES"  17065792 "ES523"
    1530294 77639604 "INDUSTRIAS TAYG SLU"                                      "ES" 518846697 "ES523"
        762 47944213 "1919 POLO PATENT SL"                                    "ES" 405684877 "ES300"
    3639482 55695205 "20 EMMA 20 SL"                                          "ES" 448680622 "ES212"
       2690  2015391 "2000 TRANSMISIONES EUROPA SA"                           "ES"  16322784 "ES212"
    3639482 56668388 "20EMMA20 SL"                                            "ES" 447230178 "ES212"
       2726  4941846 "23 CANVA SL"                                            "ES" 333779315 "ES521"
    4338570 74542337 "2EYES VISION SL"                                        "ES" 505178480 "ES300"
       2868 69822626 "2PI PROJECTES 2015 SL"                                  "ES" 486130647 "ES511"
       1979 49378546 "2T PACK SL"                                             "ES"  17367907 "ES511"
        890  4818817 "3 D MITA SL"                                            "ES" 329010052 "ES300"
    2027409 57736198 "3D SPORT SIGNS SL"                                      "ES" 475308823 "ES511"
    4391376 69806481 "3D ZINGS ADDITIVE TECH SL"                              "ES" 505138890 "ES617"
       5835 45011965 "3DITIZE SL"                                             "ES" 423258278 "ES212"
       3485 53449440 "3OTP AUTENTICACION SL"                                  "ES" 405264750 "ES300"
       4280 52344451 "3P BIOPHARMACEUTICALS SL"                               "ES" 424018082 "ES220"
       4311  4616989 "3T SCIENCE SL"                                          "ES" 275558515 "ES618"
    4506624 77642626 "4BASEBIO SL"                                            "ES" 504472686 "ES300"
    4338572 74443201 "4D PHARMA LEON SLU"                                     "ES" 497820661 "ES413"
       7381  3484950 "A G EDICIONES SL"                                       "ES"  17121669 "ES514"
       6541 53345686 "A INGENIERIA DE AUTOMATISMOS SA"                        "ES" 341899124 "ES213"
      11608 46975939 "A3 ADVANCED AUTOMOTIVE ANTENNAS"                        "ES"  16259086 "ES511"
    3740738 69761498 "AAAG SEROS TRAINING SL"                                 "ES" 458842211 "ES511"
      13806  4094454 "AB BIOTICS PRODUCCIONES INDUSTRIALES DE MICROBIOTAS SL" "ES" 267762040 "ES511"
      16081 49492170 "AB BIOTICS SA"                                          "ES" 324034604 "ES511"
    3639693 69819648 "ABAC THERAPEUTICS SL"                                   "ES" 483087943 "ES511"
      18319 53347844 "ABACCUS SOLUCIONES E INNOVACION SL"                     "ES" 341426648 "ES300"
      19410 53340184 "ABAD MUNOZ FERNANDO"                                    "ES" 441085128 "ES613"
      16437  4466412 "ABAIN COMPONENTS SL"                                    "ES" 273315584 "ES212"
      13284 54588366 "ABANZA TECNOMED SL"                                     "ES" 505240207 "ES220"
    3640729 57753642 "ABARCONES UCO SL"                                       "ES" 480589899 "ES130"
    3891646 69819580 "ABASCAL RUBIO MARIA SUSANA"                             "ES" 490082637 "ES220"
      14850   528001 "ABB POWER TECH SA"                                      "ES"    309311 "ES300"
      17309 54601978 "ABBCN SL"                                               "ES" 447230833 "ES511"
    4431887 72157937 "ABCR LABORATORIOS SL"                                   "ES" 492419961 "ES111"
      15751  4785854 "ABENGOA BIOENERGIA NUEVAS TECNOLOGIAS SA"               "ES" 326070848 "ES618"
      19340 49154164 "ABENGOA HIDROGENO SA"                                   "ES" 422325592 "ES618"
      19341 49161757 "ABENGOA RESEARCH SL"                                    "ES" 409439130 "ES618"
      25664 55653642 "ABENGOA SEAPOWER SA"                                    "ES" 445319899 "ES618"
      15754  4249504 "ABENGOA SOLAR NEW TECH SA"                              "ES" 353326683 "ES618"
      16926 72086832 "ABENGOA SOLAR NEW TECNOLOGIES SA"                       "ES" 500151725 "ES618"
       5372  3607619 "ABERDIN SL"                                             "ES"  17185440 "ES521"
      23728 52402214 "ABERTIS AUTOPISTAS ESPANA SA"                           "ES" 413211078 "ES511"
    4338573 74418656 "ABILITY PHARMACEUTICALS SL"                             "ES" 494388366 "ES511"
      26453 56707752 "ABIOPEP SL"                                             "ES" 472640430 "ES620"
      17564  4691580 "ABN PIPE SYSTEMS SLU"                                   "ES" 317591864 "ES111"
      23451  4466417 "ABRAHAM VENEGAS FRANCO"                                 "ES" 273315589 "ES617"
      26378 46983998 "ABRO BIOTEC SL"                                         "ES" 381082930 "ES418"
      18182  3624847 "ABUNDANCIA DE SANTIAGO RAMON"                           "ES"  17195355 "ES705"
      19821  3624846 "ABUNDANCIA NAVARRO CRISTINA"                            "ES"  17195355 "ES705"
      19822  3624845 "ABUNDANCIA NAVARRO JUAN CARLOS"                         "ES"  17195355 "ES705"
      24486  3834520 "ACCESORIOS DE TUBERIA DE COBRE SA"                      "ES"  17321444 "ES613"
    end

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str191 person_name_clean str9 CódigoNIF str35 Localidad str6 País
    "ACCIONA AGUA SA"                                          "A95113361" "ALCOBENDAS"             "ESPANA"
    "ACCIONA AGUA SAU"                                         ""          ""                       ""      
    "ACTEGA ARTISTICA SA"                                      "A36805794" "O PORRIÑO"             "ESPANA"
    "ACTEGA ARTISTICA SAU"                                     ""          ""                       ""      
    "ADMINISTRADOR DE INFRAESTRUCTURAS FERROVIARIAS"           "Q2801660H" "MADRID"                 "ESPANA"
    "ADMINISTRADOR DE INFRAESTRUCTURAS FERROVIARIAS ADIF"      ""          ""                       ""      
    "FICOSA ADAS SLU"                                          "B64963457" "BARCELONA"              "ESPANA"
    "ADASENS AUTOMOTIVE SLU"                                   ""          ""                       ""      
    "ADVANCED AUTOMOTIVE ANTENNAS SL"                          "B62879721" "BARCELONA"              "ESPANA"
    "ADVANCED AUTOMOTIVE ANTENNAS SLU"                         ""          ""                       ""      
    "ARACLON BIOTECH SL"                                       "B50998707" "ZARAGOZA"               "ESPANA"
    "ARACLON BIOTECH"                                          ""          ""                       ""      
    "ASOC DE INVESTIGACION DE LA INDUSTRIA TEXTIL"             "G03182870" "ALCOY/ALCOI"            "ESPANA"
    "ASOCIACION DE INVESTIGACION DE LA INDUSTRIA TEXTIL AITEX" ""          ""                       ""      
    "BELLOTA HERRAMIENTAS SL"                                  "B20426078" "LEGAZPI"                "ESPANA"
    "BELLOTA HERRAMIENTAS SA"                                  ""          ""                       ""      
    "BELGICAST INTERNACIONAL SL"                               "B48750806" "MUNGIA"                 "ESPANA"
    "BELGICAST INTERNACIONAL SLU"                              ""          ""                       ""      
    "BORGWARNER EMISSIONS SYSTEMS SPAIN SL"                    "B61507133" "VIGO"                   "ESPANA"
    "BORGWARNER EMISSIONS SYSTEMS SPAIN SLU"                   ""          ""                       ""      
    "CETIR CENTRE MEDIC SL"                                    "B58080219" "BARCELONA"              "ESPANA"
    "CETIR CENTRE MEDIC SA"                                    ""          ""                       ""      
    "ESMALGLASS SA"                                            "A12023719" "VILA-REAL"              "ESPANA"
    "ESMALGLASS SAU"                                           ""          ""                       ""      
    "FABRICACION ASIENTOS VEHICULOS INDUSTRIALES SA"           "A08361016" "MARTORELLES"            "ESPANA"
    "FABRICACION ASIENTOS VEHICULOS INDUSTRIALES SA FAINSA"    ""          ""                       ""      
    "FEDERAL SIGNAL VAMA SA"                                   "A07231954" "VILASSAR DE DALT"       "ESPANA"
    "FEDERAL SIGNAL VAMA SAU"                                  ""          ""                       ""      
    "FAMA SOFAS SL"                                            "B73855090" "YECLA"                  "ESPANA"
    "FAMA SOFAS SLU"                                           ""          ""                       ""      
    "GALENICUM HEALTH SLU"                                     "B63078075" "ESPLUGUES DE LLOBREGAT" "ESPANA"
    "GALENICUM HEALTH SL"                                      ""          ""                       ""      
    "INDUSTRIAL TECNICA PECUARIA SA"                           "A08219511" "BARCELONA"              "ESPANA"
    "INDUSTRIA TECNICA PECUARIA SA"                            ""          ""                       ""      
    "INGENIATRICS TECNOLOGIAS SL"                              "B91177469" "CAMAS"                  "ESPANA"
    "INGENIATRICS TECNOLOGIAS"                                 ""          ""                       ""      
    "INSTITUT UNIV DE CIENCIA I TECNOLOGIA SA"                 "A61516704" "MOLLET DEL VALLES"      "ESPANA"
    "INSTITUT UNIV DE CIENCIA I TECNOLOGIA"                    ""          ""                       ""      
    "INDUSTRIAS TAYG SL"                                       "B46054508" "BENIPARRELL"            "ESPANA"
    "INDUSTRIAS TAYG SLU"                                      ""          ""                       ""      
    "3S GEOTECNIA Y TECNOLOGIA SL"                                  "B20697207" "ASTIGARRAGA"               "ESPANA"
    "ACCESORIOS Y ELEVADORES VALENCIA SL"                           "B96225651" "PUÇOL"                    "ESPANA"
    "ACCESORIOS Y RESORTES SL"                                      "B03006681" "IBI"                       "ESPANA"
    "ACCIONA AGUA SA"                                               "A95113361" "ALCOBENDAS"                "ESPANA"
    "ACCIONA BIOCOMBUSTIBLES SA"                                    "A31710064" "VALLE DE EGÜES/EGUESIBAR" "ESPANA"
    "ACIDEKA SA"                                                    "A48093926" "BILBAO"                    "ESPANA"
    "ACIEROID SA"                                                   "A28182392" "L'HOSPITALET DE LLOBREGAT" "ESPANA"
    "ACTEGA ARTISTICA SA"                                           "A36805794" "O PORRIÑO"                "ESPANA"
    "ACTILUM RGB SL"                                                "B65372666" "SANT ADRIA DE BESOS"       "ESPANA"
    "ACTIU BERBEGAL Y FORMAS SA"                                    "A03137874" "CASTALLA"                  "ESPANA"
    "ACTIVIDADES DE CONSTRUCCION Y REHABILITACION GRUPO GOIALDE SL" "B31854458" "BERRIOPLANO/BERRIOBEITI"   "ESPANA"
    "ACUSTICA BEYMA SL"                                             "B46052064" "MONCADA"                   "ESPANA"
    "ADMINISTRADOR DE INFRAESTRUCTURAS FERROVIARIAS"                "Q2801660H" "MADRID"                    "ESPANA"
    "ADVANCED AUTOMOTIVE ANTENNAS SL"                               "B62879721" "BARCELONA"                 "ESPANA"
    "ADVANCED MARKER DISCOVERY SL"                                  "B47665294" "VALLADOLID"                "ESPANA"
    "AF STEELCASE SA"                                               "A78939576" "MADRID"                    "ESPANA"
    "AFHER EUROBELT SA"                                             "A47028436" "VALLADOLID"                "ESPANA"
    "AGNELLI SL"                                                    "B03501517" "ELCHE/ELX"                 "ESPANA"
    "AGROBIO SL"                                                    "B04221610" "LA MOJONERA"               "ESPANA"
    "AGROMELCA SL"                                                  "B44187888" "CALACEITE"                 "ESPANA"
    "AGROTECNOLOGIAS NATURALES SL"                                  "B60574290" "LA RIERA DE GAIA"          "ESPANA"
    "AGRUPACION CLUSTER DE ELECTRODOMESTICOS DE EUSKADI AIE"        "V01124007" "ARRASATE/MONDRAGON"        "ESPANA"
    "AINIA"                                                         "G46421988" "PATERNA"                   "ESPANA"
    "AIRBUS OPERATIONS SL"                                          "B82875055" "GETAFE"                    "ESPANA"
    "AISLAMIENTOS SUAVAL SA"                                        "A33049263" "CARREÑO"                  "ESPANA"
    "ALAMBRES PAMPLONA SL"                                          "B31074891" "PAMPLONA/IRUÑA"           "ESPANA"
    "ALBIRAL DISPLAY SOLUTIONS SL"                                  "B60452695" "SANT HIPOLIT DE VOLTREGA"  "ESPANA"
    "ALBOAIRES SA"                                                  "A03516671" "MUTXAMEL"                  "ESPANA"
    "ALCOHOLES DE TOMELLOSO SA"                                     "A13005384" "TOMELLOSO"                 "ESPANA"
    "ALEJANDRO ALTUNA SA"                                           "A20056420" "ARRASATE/MONDRAGON"        "ESPANA"
    "ALGRY QUIMICA SL"                                              "B41916198" "PALOS DE LA FRONTERA"      "ESPANA"
    "ALLGLASS CONFORT SYSTEMS SL"                                   "B92982693" "ALHAURIN DE LA TORRE"      "ESPANA"
    "ALMIRALL SA"                                                   "A58869389" "BARCELONA"                 "ESPANA"
    "ALUCOIL SA"                                                    "A81468993" "MIRANDA DE EBRO"           "ESPANA"
    "ALUMINIOS LA SERENA SA"                                        "A28692333" "VILLANUEVA DE LA SERENA"   "ESPANA"
    "AMADEO MARTI CARBONELL SA"                                     "A12011672" "NULES"                     "ESPANA"
    "AMETSIS INGENIERIA Y ASESORIA TECNICA SL"                      "B84133446" "MADRID"                    "ESPANA"
    "AMPER SA"                                                      "A28079226" "POZUELO DE ALARCON"        "ESPANA"
    "AMURRIO FERROCARRIL Y EQUIPOS SA"                              "A01061373" "AMURRIO"                   "ESPANA"
    "ANALCO AUXILIAR CALZADO SA"                                    "A53075321" "ELCHE/ELX"                 "ESPANA"
    "ANALISIS Y SIMULACION SL"                                      "B01251529" "VITORIA-GASTEIZ"           "ESPANA"
    "ANGODOS SL"                                                    "B81046609" "FUENLABRADA"               "ESPANA"
    "ANSABERE SURGICAL SL"                                          "B31802077" "GALAR"                     "ESPANA"
    "ANSITEC SOLUCIONES DE INGENIERIA E INFORMATICA SL"             "B50999143" "ZARAGOZA"                  "ESPANA"
    "ANTECUIR SL"                                                   "B03449261" "MURO DE ALCOY"             "ESPANA"
    "APLICACIONES ELECTROMECANICAS GERVALL SA"                      "A08724627" "PATERNA"                   "ESPANA"
    "APLICACIONES TECNICAS DE LA ENERGIA SL"                        "B62618442" "RIBA-ROJA DE TURIA"        "ESPANA"
    "APLICACIONES TECNOLOGICAS SA"                                  "A46309688" "PATERNA"                   "ESPANA"
    "ARACHEM SA"                                                    "A48068001" "BERANGO"                   "ESPANA"
    "ARACLON BIOTECH SL"                                            "B50998707" "ZARAGOZA"                  "ESPANA"
    "ARANGUREN COMERCIAL DEL EMBALAJE SL"                           "B96350186" "MASSAMAGRELL"              "ESPANA"
    "ARBE STOLANIC SL"                                              "B98687635" "VALENCIA"                  "ESPANA"
    "ARCILLA BLANCA SA"                                             "A12461802" "L'ALCORA"                  "ESPANA"
    "ARCUSIN SA"                                                    "A25013798" "VILA-SANA"                 "ESPANA"
    "ARFLU SA"                                                      "A78898277" "SOPELA"                    "ESPANA"
    "ARIADNA INSTRUMENTS SL"                                        "B95473088" "AMOREBIETA-ETXANO"         "ESPANA"
    "ARTIFICIAL SOLUTIONS IBERIA SL"                                "B62059068" "BARCELONA"                 "ESPANA"
    "ASEA BROWN BOVERI SA"                                          "A08002883" "MADRID"                    "ESPANA"
    "ASFALTOS Y CONSTRUCCIONES ELSAN SA"                            "A81940371" "MADRID"                    "ESPANA"
    "ASOC DE INVESTIGACION DE LA INDUSTRIA TEXTIL"                  "G03182870" "ALCOY/ALCOI"               "ESPANA"
    end



  • #2
    You can make a numeric variable that corresponds one-to-one with CodigoNIF by
    Code:
    egen long n_nif = group(CódigoNIF)
    and use n_nif as the idusing() variable in -matchit-.

    Comment


    • #3
      Dear Clyde Schechter, thanks a lot for your help! I was looking to other posts in which you also recommended to put variable values in capitals. Is there any other recommendation you think I might follow? Do you think including other command options may improve the quality of the matching?
      I am using
      Code:
      matchit han_id person_name_clean using "Firms.dta", idu(n_nif) txtu(person_name_clean) override
      Last edited by Doris Rivera; 21 Jun 2022, 10:59.

      Comment


      • #4
        The idea is to remove as much uninformative variation as possible from the values of the variables to be matched before applying -matchit-. So you want to use uniform case (I prefer upper, but using all lower is fine, too--what matters is having all one case). Also, excess spacing (leading or trailing blanks, sequences of more than one consecutive internal blank) should be removed. The -trim()- and -itrim()- functions can do that for you. Finally, in most situations I like to remove commas and periods. That can be done with the -substr()- function.

        Comment


        • #5
          Dear Clyde Schechter, I assume then that I should also eliminate the SA, SL, SLU, SAU, etc. from the firms' names in both datasets before matching? I am sure I have seen some of these not at the end of the company name, but also in the middle, so I will look how to do it. Also, thanks for the suggested commands for removing dots and double spaces.
          Some of the firm's name start with numbers (ex. "20 EMMA 20 SL", "2000 TRANSMISIONES EUROPA SA", "23 CANVA SL",...). Do you think it would be worth to also eliminate these starting numbers? In such a case, any suggestion on how to do it (since the number of caracters is not always the same)?
          Can I brefly ask you about what it does the threshold()- option? Does it improve the quality of the matching? Do you think it is worth to try it and which should be the best treshold?

          Comment


          • #6
            Clyde's recommendation is to eliminate trivial typographical variation that conveys no meaning - case differences, extra spaces, unimportant punctuation. The point to matchit is for its intelligence to save you the effort of trying to make up rules to handle the sorts of complexities caused by word - as opposed to typographical - differences.

            The output of help matchit tells us

            Please, note that matchit is case-sensitive. It also takes into account all other symbols (as far as Stata does). While data cleaning is not needed for using matchit, it often implies an improvement of the similarity scores and, in consequence, the overall quality of the matching exercise. However, too much data cleaning might remove relevant information, inducing a negative effect on quality due to false positives.
            With that said, the very helpful author of matchit, Julio Raffo , who often posts here, may perhaps see this and offer advice tailored to your specific problem.

            Comment


            • #7
              I assume then that I should also eliminate the SA, SL, SLU, SAU, etc. from the firms' names in both datasets before matching? I am sure I have seen some of these not at the end of the company name, but also in the middle, so I will look how to do it.
              I am familiar with SA, but the others you mention are new to me. These will be harder to remove, because, as you note, these character sequences can occur in the middle of names. Also, removing them might actually degrade the matching if there are two distinct companies with similar names but one is an SA and the other is not. The SA could serve to be informative here.

              Some of the firm's name start with numbers (ex. "20 EMMA 20 SL", "2000 TRANSMISIONES EUROPA SA", "23 CANVA SL",...). Do you think it would be worth to also eliminate these starting numbers?
              I don't think removing numbers is a good idea. They do convey information. And I can remember from the old days of telephone books, that there could easily be two different businesses whose names differ only by the number. You don't want to reduce those to indistinguishable.

              The above said, if Julio Raffo does respond to this thread, his advice about these matters would be much better informed than mine.

              Can I brefly ask you about what it does the threshold()- option? Does it improve the quality of the matching? Do you think it is worth to try it and which should be the best treshold?
              -matchit- pairs every name in the master data with every name in the user data and calculates a score for how similar they are. The scores are scaled to run between 0 and 1. By default, when it is done, it shows you those pairs where the score is 0.5 or higher. From that point on, you have to manually select which of the proposed matches are correct--some will be and some won't. And since fuzzy matching is, as a whole, an imperfect process, there may be some pairs that "should" match but didn't reach a similarity score of 0.5, so you don't get to see them. If you think you are seeing too many false-positive matches and would like -matchit- to be more selective, you can set a value of -threshold()- higher than 0.5. If you think you are seeing too many items going unmatched in the final results, you can set a lower threshold value so you will see more. There is no general rule for an optimal threshold. In my own usage of -matchit-, the default usually works very well, or occasionally I push it up a bit higher. But your mileage may vary, as they say. You will have to just do some trial and error to see what is best in your data. It also, by the way, depends on the relative seriousness of accepting a false match vs missing a true match. Remember that adjusting the -threshold()- parameter is a trade-off: with a higher threshold you get fewer false matches but miss more true matches, when you lower it, the opposite happens. There is no way to simultaneously reduce both types of problem.

              Again, Julio Raffo might have more to say about this, and if his advice contradicts mine, you will probably be better off following his.

              Comment


              • #8
                I am very glad to say William Lisowski and Clyde Schechter did a great job summarizing the main points. I agree that it is not a good idea to remove numbers when comparing company names (think of e.g. 3M). About the "SA" , "SL", etc., it is tricky. Often if removed from the trailing part of string, the risk of losing information is lower. However, this is why -matchit- has two valuable options weights() and stopwordsauto.

                - weights() is essential to improve the similarity score, definitely use it (even at the cost of increasing a little bit the whole computation)

                - stopwordsauto is not essential but it can be very helpful when you have certain grams/tokens that are too frequent in the data. If used correctly, it can change dramatically the speed of -matchit-

                Doris Rivera you may find it interesting/funny that -matchit-'s ancestor in PHP was precisely conceived to be used to disambiguate PATSTAT's applicants and match them with other sources.

                Comment


                • #9
                  Dear Clyde Schechter, William Lisowski, and Julio Raffo, I really appreciate and thank all your suggestions. I understand why it is not a good approach eliminating the SA, SL,… or even the numbers. Unfortunately, the firm financial data sometimes have instead of SL (SOCIEDAD LIMITADA), but I will leave it anyway and cross fingers.

                  About the threshold() option, I think I would prefer not to have false-positive matches (impute to a given patent the values of another different firm). So, I think that looking at the scores I show below I will increase the threshold to 0.8 at least (on top of eliminating dots, and double spaces).

                  About the weights() option, I totally would like to use it as you recommend, but, can you tell me how to compute a variable (with respect to which characteristic) for a good weight based on my example?

                  Is there a test I can do after the matching for evaluating the whole matching process, instead of doing it by looking randomly some cases?

                  Do you think that on top o using the firm's name, including another variable like the region in both datasets, could improve the matching? In such a case, how do I include that variable in the command?

                  Finally, is there an option where I can impose to obtain only 3 or 5 alternatives (base on the threshold level)? I mean, in the example below, I have obtained for a given firm, sometimes more than 100 different matches. Is it possible to reduce it?

                  Thanks a lot again for your help!

                  Code:
                  * Example generated by -dataex-. To install: ssc install dataex
                  clear
                  input long han_id str124 person_name_clean long n_nif str191 person_name_clean1 double similscore
                  762 "1919 POLO PATENT SL" 164021 "1919 POLO PATENT SL."     .9780192938436515
                  762 "1919 POLO PATENT SL"  55003 "RENTAT PADO SL"           .5913123959890826
                  762 "1919 POLO PATENT SL"  33374 "CENTENO POZO SL"          .5527707983925666
                  762 "1919 POLO PATENT SL" 117099 "ALTIMA PATENT SL"         .5504818825631803
                  762 "1919 POLO PATENT SL" 155286 "CENTER POOL SL"           .5321811563901744
                  762 "1919 POLO PATENT SL"  79565 "DE POLO A POLO TRAVEL SL" .5195887333176439
                  762 "1919 POLO PATENT SL"  51399 "PONTE PAZ SL"               .51425947722658
                  762 "1919 POLO PATENT SL"  91901 "LUCIO POLO SL"            .5128225940683707
                  762 "1919 POLO PATENT SL"  32226 "ECO PONIENTE SL"          .5128225940683707
                  762 "1919 POLO PATENT SL" 127099 "CENTRO DE PATOLOGIA SLP"                 .5
                  890 "3 D MITA SL"          44036 "ONDA MITA SL"             .7016464154456235
                  890 "3 D MITA SL"          79964 "LA MARMITA DE DOMINGO SL" .6459422414661737
                  890 "3 D MITA SL"         190794 "ADAMITA SL"               .6324555320336759
                  890 "3 D MITA SL"          97017 "ITACA SL"                 .5976143046671968
                  890 "3 D MITA SL"         196726 "MITAE SL"                 .5976143046671968
                  1979 "2T PACK SL"  53328 "POST PACK SL"  .8040302522073697
                  1979 "2T PACK SL" 129961 "PER PACK SL"   .7378647873726218
                  1979 "2T PACK SL" 198855 "VAL PACK SL"   .7378647873726218
                  1979 "2T PACK SL" 199943 "SCO PACK SL"   .7378647873726218
                  1979 "2T PACK SL" 110233 "SAICA PACK SL" .6735753140545634
                  1979 "2T PACK SL" 156123 "VASPACK SL"    .6666666666666666
                  1979 "2T PACK SL"  35209 "COMPACK SL"    .6666666666666666
                  1979 "2T PACK SL"  78050 "ZORPACK SL"    .6666666666666666
                  end

                  Comment


                  • #10
                    After playing with the command, I have ended combining the firm’s name with its postal code in the same variable. I thought this might add more differences between both datasets and the score would be higher only for the best matches, for which I would use a cutoff higher than say 0.9.
                    However, after redoing the matchit I see that only works for some of the firms. For instance, you can check below that the match is very poor in general, and specifically for han_id 16081, the “AB BIOTICS SA 08193” is paired with “ASECS LAB SL 20018” with a higher score than for the second option which fits better “AB BIOTICS SA 08172”. I mean, it is clearly more similar the second one, but it has a lower score.
                    My idea is to first match using the name of the firm plus its postal code and take those with a score higher than 0.9 or 0.95, and then, for those poorer matches (lower than the threshold) use only their names. At the end, I think I prefer to lose some matches than to match patents with other than their firms.
                    Do you have any hint or a better strategy (different steps) that help me to do the best match? I mean, I know that after this process, I need to start checking manually, but this is something that cannot be done for all firms since there are too many firms to check.
                    Can someone with experience in working with patent data advice me what is a common solution in such a situation?

                    Code:
                         +---------------------------------------------------------------------------------------------------------------------------------------------------------+
                         | han_id                                                    pat_cp_name     n_nif                                       amadeus_cp_name   similsc~e   count1 |
                         |---------------------------------------------------------------------------------------------------------------------------------------------------------|
                      1. |    890                                              3 D MITA SL 28500   1393146                       3 D MITA INGENIERIA SL 28500   .97699462        1 |
                      2. |    890                                              3 D MITA SL 28500   1024644                                  VICASA 3 SL 08503   .96925177        2 |
                      3. |    890                                              3 D MITA SL 28500   1070350                               TRIOMA 2003 SL 08500   .96843517        3 |
                         |---------------------------------------------------------------------------------------------------------------------------------------------------------|
                      4. |   3485                                    3OTP AUTENTICACION SL 28006   1083254                             3OR SOLUTIONS SL 08007   .99998281        1 |
                      5. |   3485                                    3OTP AUTENTICACION SL 28006   1570119                                     3OFUS SL 28004   .99998267        2 |
                      6. |   3485                                    3OTP AUTENTICACION SL 28006    812233                     3OES LEADING SOFTWARE SL 47151    .9999795        3 |
                         |---------------------------------------------------------------------------------------------------------------------------------------------------------|
                      7. |   4280                                 3P BIOPHARMACEUTICALS SL 31110    382581                           3POLS DECORACIO SL 17181   .99993066        1 |
                      8. |   4280                                 3P BIOPHARMACEUTICALS SL 31110    295838                            ABAST PRIME 3P SL 12005   .99992929        2 |
                      9. |   4280                                 3P BIOPHARMACEUTICALS SL 31110   1191511                      3P PLAST COMPONENTES SL 08211   .99992889        3 |
                         |---------------------------------------------------------------------------------------------------------------------------------------------------------|
                     10. |   5835                                               3DITIZE SL 20018   1165185                        3D TECH OMEGA ZETA SL 08015   .99829191        1 |
                     11. |   5835                                               3DITIZE SL 20018     43822                                        3D SA 28007    .9973368        2 |
                     12. |   5835                                               3DITIZE SL 20018   1623464                                 DIDENT 3D SL 41018   .99733053        3 |
                         |---------------------------------------------------------------------------------------------------------------------------------------------------------|
                     13. |   6541                          A INGENIERIA DE AUTOMATISMOS SA 48213    308139   INGENIERIA INSTALACIONES Y AUTOMATISMOS SL 13600   .87284637        1 |
                     14. |   6541                          A INGENIERIA DE AUTOMATISMOS SA 48213    964105                       RASESA AUTOMATISMOS SL 08203   .85022017        2 |
                     15. |   6541                          A INGENIERIA DE AUTOMATISMOS SA 48213     18583                         AUTOMATISMOS FOR S A 08210   .84712085        3 |
                         |---------------------------------------------------------------------------------------------------------------------------------------------------------|
                     16. |   6701          CONSEJO SUPERIOR DE INVESTIGACIONES CIENT FICAS 28006   1437908                         INVARATO INAREJOS SL 28003   .85200701        1 |
                     17. |   6701          CONSEJO SUPERIOR DE INVESTIGACIONES CIENT FICAS 28006   1284066         INVERSIONES FINANCIERAS CORRALEJO SL 35003   .85186328        2 |
                     18. |   6701          CONSEJO SUPERIOR DE INVESTIGACIONES CIENT FICAS 28006    964402                                     TEJON SL 08006   .84872099        3 |
                         |---------------------------------------------------------------------------------------------------------------------------------------------------------|
                     19. |  11608                          A3 ADVANCED AUTOMOTIVE ANTENNAS 08100    872066             GA3 GLOBAL ADVISING ALICANTE SLL 03007    .9995317        1 |
                     20. |  11608                          A3 ADVANCED AUTOMOTIVE ANTENNAS 08100   1148598                         INSTALA3 SOLUTION SL 08041   .99716204        2 |
                     21. |  11608                          A3 ADVANCED AUTOMOTIVE ANTENNAS 08100    133772                                 CA3 ALAVA SL 01015   .99716011        3 |
                         |---------------------------------------------------------------------------------------------------------------------------------------------------------|
                     22. |  13806   AB BIOTICS PRODUCCIONES INDUSTRIALES DE MICROBIOTAS SL 08290   1274065                                 ASECS LAB SL 20018   .84699053        1 |
                     23. |  13806   AB BIOTICS PRODUCCIONES INDUSTRIALES DE MICROBIOTAS SL 08290     93543                                AB BIOTICS SA 08172   .83242995        2 |
                     24. |  13806   AB BIOTICS PRODUCCIONES INDUSTRIALES DE MICROBIOTAS SL 08290   1227927                           MARAB LOGISTICS SL 11370   .81571461        3 |
                         |---------------------------------------------------------------------------------------------------------------------------------------------------------|
                     25. |  14850                                        ABB POWER TECH SA 28037     29597                            PROMOWEBB TEAM SA 18210   .99137045        1 |
                     26. |  14850                                        ABB POWER TECH SA 28037    926845                             BUBBLE BLOWER SL 43206   .98151741        2 |
                     27. |  14850                                        ABB POWER TECH SA 28037   1092533                          WEBB REAL ESTATE SL 08005   .95360317        3 |
                         |---------------------------------------------------------------------------------------------------------------------------------------------------------|
                     28. |  15754                                     ABENGOA SOLAR NEW TECH SA     632902             NEW ASSISTANT FOR COACH SPORT SL 35008   .98032204        1 |
                     29. |  15754                                     ABENGOA SOLAR NEW TECH SA    1034203                             STAR NEW TECH SL 08014   .97949458        2 |
                     30. |  15754                                     ABENGOA SOLAR NEW TECH SA    1508260                                  NEW GOAT SL 28050   .97655771        3 |
                         |---------------------------------------------------------------------------------------------------------------------------------------------------------|
                     31. |  16081                                            AB BIOTICS SA 08193   1274065                                 ASECS LAB SL 20018   .90590783        1 |
                     32. |  16081                                            AB BIOTICS SA 08193     93543                                AB BIOTICS SA 08172    .8882523        2 |
                     33. |  16081                                            AB BIOTICS SA 08193   1227927                           MARAB LOGISTICS SL 11370   .87193043        3 |
                         |---------------------------------------------------------------------------------------------------------------------------------------------------------|
                     34. |  16926                         ABENGOA SOLAR NEW TECNOLOGIES SA 41014   1508260                                  NEW GOAT SL 28050   .97646226        1 |
                     35. |  16926                         ABENGOA SOLAR NEW TECNOLOGIES SA 41014   1174070                                OASIS VIEW SL 08008   .97578646        2 |
                     36. |  16926                         ABENGOA SOLAR NEW TECNOLOGIES SA 41014   1590614                                 NEW NONIS SL 41400    .9663947        3 |
                         +---------------------------------------------------------------------------------------------------------------------------------------------------------+

                    Comment


                    • #11
                      Hi Rivera,
                      I just found out that you raised the question of matching patent information and financial information. If you use the function of stata matchit, I must say that you still need a lot of judgment afterwards. Taking TSMC in Taiwan as an example, The names of the patent companies listed below are all TSMC. However, many of the match results are less than 0.95. It is understood that some company names are intentional errors in company strategy, making people mistakenly think that they are patents of other companies, so they must be used with great care. Thanks!
                      2330 Taiwan Semiconductor Manufacturing Co., Ltd. SMCCorporation 0.816497
                      2330 Taiwan Semiconductor Manufacturing Co., Ltd. TAIWAN SEIMICONDUCTOR MANUFACTURING COMPANY 0.957841
                      2330 Taiwan Semiconductor Manufacturing Co., Ltd. Taiwan Semicondcutor Manufacturing Company, Ltd. 0.912159
                      2330 Taiwan Semiconductor Manufacturing Co., Ltd. Taiwan Semiconductar Manufacturing Campany 0.87831
                      2330 Taiwan Semiconductor Manufacturing Co., Ltd. Taiwan Semiconductor Manufactured Company, Ltd. 0.927634
                      2330 Taiwan Semiconductor Manufacturing Co., Ltd. Taiwan Semiconductor Manufacturing Company, Ltd., ("TSMC") 0.874083
                      2330 Taiwan Semiconductor Manufacturing Co., Ltd. TAIWAN SEMICONDUCTOR MANUFACTURING COMPANYLIMITED 1
                      2330 Taiwan Semiconductor Manufacturing Co., Ltd. TAIWAN SEMICONDUCTOR MTAIWANANUFACTURING CO., LTD. 0.928727
                      2330 Taiwan Semiconductor Manufacturing Co., Ltd. Taiwan Semiconductoring Manufacturing Company 0.942847
                      2330 Taiwan Semiconductor Manufacturing Co., Ltd. Taiwan Semincondutor Manufacturing Co. 0.914286
                      2330 Taiwan Semiconductor Manufacturing Co., Ltd. Tawian Semiconductor Manufacturing Company, Ltd. 0.914286
                      2330 Taiwan Semiconductor Manufacturing Co., Ltd. Tawiwan Semiconductor Manufacturing Co., Ltd. 0.957841
                      2330 Taiwan Semiconductor Manufacturing Co., Ltd. Tiawan Semiconductor Manufacturing Company Limited 0.914286
                      2330 Taiwan Semiconductor Manufacturing Co., Ltd. TSM Corporation 0.816497
                      2330 Taiwan Semiconductor Manufacturing Co., Ltd. TWAIWAN SEMICONDUCTOR MANUFACTORING COMPANY, LTD. 0.912871

                      Comment


                      • #12
                        Hi Rivera,
                        I just found out that you raised the question of matching patent information and financial information. If you use the function of stata matchit, I must say that you still need a lot of judgment afterwards. Taking TSMC in Taiwan as an example, The names of the patent companies listed below are all TSMC. However, many of the match results are less than 0.95. It is understood that some company names are intentional errors in company strategy, making people mistakenly think that they are patents of other companies, so they must be used with great care. Thanks!
                        2330 Taiwan Semiconductor Manufacturing Co., Ltd. SMCCorporation 0.816497
                        2330 Taiwan Semiconductor Manufacturing Co., Ltd. TAIWAN SEIMICONDUCTOR MANUFACTURING COMPANY 0.957841
                        2330 Taiwan Semiconductor Manufacturing Co., Ltd. Taiwan Semicondcutor Manufacturing Company, Ltd. 0.912159
                        2330 Taiwan Semiconductor Manufacturing Co., Ltd. Taiwan Semiconductar Manufacturing Campany 0.87831
                        2330 Taiwan Semiconductor Manufacturing Co., Ltd. Taiwan Semiconductor Manufactured Company, Ltd. 0.927634
                        2330 Taiwan Semiconductor Manufacturing Co., Ltd. Taiwan Semiconductor Manufacturing Company, Ltd., ("TSMC") 0.874083
                        2330 Taiwan Semiconductor Manufacturing Co., Ltd. TAIWAN SEMICONDUCTOR MANUFACTURING COMPANYLIMITED 1
                        2330 Taiwan Semiconductor Manufacturing Co., Ltd. TAIWAN SEMICONDUCTOR MTAIWANANUFACTURING CO., LTD. 0.928727
                        2330 Taiwan Semiconductor Manufacturing Co., Ltd. Taiwan Semiconductoring Manufacturing Company 0.942847
                        2330 Taiwan Semiconductor Manufacturing Co., Ltd. Taiwan Semincondutor Manufacturing Co. 0.914286
                        2330 Taiwan Semiconductor Manufacturing Co., Ltd. Tawian Semiconductor Manufacturing Company, Ltd. 0.914286
                        2330 Taiwan Semiconductor Manufacturing Co., Ltd. Tawiwan Semiconductor Manufacturing Co., Ltd. 0.957841
                        2330 Taiwan Semiconductor Manufacturing Co., Ltd. Tiawan Semiconductor Manufacturing Company Limited 0.914286
                        2330 Taiwan Semiconductor Manufacturing Co., Ltd. TSM Corporation 0.816497
                        2330 Taiwan Semiconductor Manufacturing Co., Ltd. TWAIWAN SEMICONDUCTOR MANUFACTORING COMPANY, LTD. 0.912871

                        Comment

                        Working...
                        X