Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • String comparison within same variable

    Hi all,

    I have a variable in a dataset with the name of some firms like that:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str142 doc_std_name
    "SEEO INC"                               
    "BOSCH GMBH ROBERT"                      
    "SAMSUNG SDI CO LTD"                     
    "NAGAI TAKAYUKI"                         
    "WESTPORT POWER INC"                     
    "SAMSUNG ELECTRONICS CO LTD"             
    "SATO TOSHIO"                            
    "SUMITOMO ELECTRIC INDUSTRIES"           
    "TOSHIBA KK"                             
    "TEIKOKU SEIYAKU KK"                     
    "MITSUBISHI ELECTRIC CORP"               
    "IHI CORP"                               
    "WEI XI"                                 
    "SIEMENS AG"                             
    "HYUNDAI MOTOR CO LTD"                   
    "COOPER TECHNOLOGIES CO"                 
    "TSUI CHENG-WEN"                         
    "UCHICAGO ARGONNE LLC"                   
    "BAYERISCHE MOTOREN WERKE AG"            
    "BAYERISCHE MOTOREN WERKE AG"            
    "YANAGIDA EIJI"                          
    "MINEBEA CO LTD"                         
    "CATERPILLAR INC"                        
    "LENOVO SINGAPORE PTE LTD"               
    "FLORIDA TURBINE TECH INC"               
    "SIEMENS AG"                             
    "TOYOTA MOTOR CO LTD"                    
    "NTT DOCOMO INC"                         
    "YANG JUN-HO"                            
    "GEN ELECTRIC"                           
    "UP RIGHT DESIGNS LLC"                   
    "CATERPILLAR INC"                        
    "CONTINENTAL AUTOMOTIVE GMBH"            
    "GM GLOBAL TECH OPERATIONS LLC"          
    "SIEMENS AG"                             
    "WIDEGREN HANS"                          
    "CPC CORP TAIWAN"                        
    "SHANGHAI TIANMA MICRO ELECT CO"         
    "SAMSUNG ELECTRONICS CO LTD"             
    "TOHOKU TECHNO ARCH CO LTD"              
    "FORD GLOBAL TECH LLC"                   
    "MEDIATEK INC"                           
    "BELL SPORTS INC"                        
    "MCI MIRROR CONTROLS INT NETHERLANDS B V"
    "VOLKSWAGEN AG"                          
    "BAYERISCHE MOTOREN WERKE AG"            
    "SIEMENS ENERGY INC"                     
    "INVENTEC CORP"                          
    "HUSQVARNA AB"                           
    "AIR LIQUIDE"                            
    "JAEGER ERICH GMBH & CO KG"              
    "TOSHIBA KK"                                                       
    "SAMSUNG LIMITED"                   
    "IBM"                                    
    "MAXWELL TECHNOLOGIES INC"               
    "BAYERISCHE MOTOREN WERKE AG"                     
    "FANUC CORP"                             
    "GM GLOBAL TECH OPERATIONS LLC"          
    "MEDIATEK INC"                           
    "SEKISUI CHEMICAL CO LTD"                
    "KISHIOKA TAKAHIRO"                      
    "EVONIK DEGUSSA GMBH"                    
    "ARAMCO SERVICES CO"                     
    "WESTERN DIGITAL TECH INC"               
    "VOLKSWAGEN AG"                          
    "AIRBUS OPERATIONS GMBH"                 
    "UNITED TECHNOLOGIES CORP"               
    "GEELY HOLDING GROUP CO LTD"             
    "3M INNOVATIVE PROPERTIES CO"            
    "BAYERISCHE MOTOREN WERKE AG"            
    "DAIMLER AG"                             
    "SAMSUNG SDI CO LTD"                     
    "SAMSUNG ELECTRONICS CO LTD"             
    "GEN ELECTRIC"                           
    "SARPERI LUCIANO PIETRO GIACOMO"         
    "MTU AERO ENGINES GMBH"                  
    "AMAZON TECH INC"                        
    "KYOCERA CORP"                           
    "MURAMATSU KENJI"                        
    "STURMAN ODED EDDIE"                     
    "SHARP KK"                               
    "TRANSOCEAN SEDCO FOREX VENTURES LTD"    
    "CANON KK"                               
    "KIM JEONGWOOK"                          
    "NOVALED AG"                             
    "ERICSSON TELEFON AB L M (PUBL)"         
    "WESTERN DIGITAL TECH INC"               
    "LANDMARK GRAPHICS CORP"                 
    "DAIMLER AG"                             
    "SUZUKI MOTOR CORP"                      
    "ST MICROELECTRONICS ASIA"               
    "DELL PRODUCTS LP"                       
    "CRYOVAC INC"                            
    "DANA HEAVY VEHICLE SYS GROUP"           
    "INIS BIOTECH LLC"                       
    "SARPERI LUCIANO PIETRO GIACOMO"         
    "MITUTOYO CORP"                          
    "NEC LAB AMERICA INC"                    
    end
    The list is of 175000 firms but a lot of them are the same firm with different names (e.g. above SAMSUNG SDI CO LTD, SAMSUNG ELECTRONICS CO LTD, "SAMSUNG LIMITED", "SAMSUNG CO LIMITED"...).
    What I would like to do is to find a way that does not take a lot (e.g. max one day) that puts all the names of the firm under a unique name. I tried something similar in python but the actual algorithm that I have tried makes all the possible couple comparisons and put them under a unique list. However, it takes on forever to run since its complexity is huge (it has to do all the possible couple comparisons of firms). I was wondering if stata provided some tool to do so in an already optimized way.

    Thank you

  • #2
    I think I just replied to a similar post. You could start with grouping observations by the first word of the name, but you will need to double check carefully.

    Code:
    gen group=word(lower(doc_std_name),1)
    For example "GEN ELECTRIC" will not group with "GENERAL ELECTRIC" using this method. Are there other variables that you can use to find similars?

    Comment


    • #3
      Daniel Shin thanks a lot. Unfortunately, there are not. Maybe I could try starting with your approach and let you know how it goes.

      Comment


      • #4
        You could start by replacing " CO ", " LTD ", " CO ", " LIMITED ", " AG " and " COMPANY " with blanks. That is unlikely to cause false matches, and covers most of the problems. As for matching "GM" with "General Motors" - that is something that will require handwork, unless you can find a dataset where someone has already done the handwork. Maybe listing all the matches on the first word, and fiing them by hand would be feasible. You will never achieve perfection.

        Comment


        • #5
          Some thing you could do to focus your attention to companies with variations in names. Using your dataset above:

          Code:
          duplicates drop
          gen group=word(lower(doc_std_name),1)
          egen varcount=count(doc_std_name), by(group)
          bro if varcount!=1

          Comment

          Working...
          X