String comparison within same variable

Federico Nutarelli

Join Date: Sep 2018
Posts: 430

String comparison within same variable

17 Feb 2023, 09:55

Hi all,

I have a variable in a dataset with the name of some firms like that:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str142 doc_std_name
"SEEO INC"                               
"BOSCH GMBH ROBERT"                      
"SAMSUNG SDI CO LTD"                     
"NAGAI TAKAYUKI"                         
"WESTPORT POWER INC"                     
"SAMSUNG ELECTRONICS CO LTD"             
"SATO TOSHIO"                            
"SUMITOMO ELECTRIC INDUSTRIES"           
"TOSHIBA KK"                             
"TEIKOKU SEIYAKU KK"                     
"MITSUBISHI ELECTRIC CORP"               
"IHI CORP"                               
"WEI XI"                                 
"SIEMENS AG"                             
"HYUNDAI MOTOR CO LTD"                   
"COOPER TECHNOLOGIES CO"                 
"TSUI CHENG-WEN"                         
"UCHICAGO ARGONNE LLC"                   
"BAYERISCHE MOTOREN WERKE AG"            
"BAYERISCHE MOTOREN WERKE AG"            
"YANAGIDA EIJI"                          
"MINEBEA CO LTD"                         
"CATERPILLAR INC"                        
"LENOVO SINGAPORE PTE LTD"               
"FLORIDA TURBINE TECH INC"               
"SIEMENS AG"                             
"TOYOTA MOTOR CO LTD"                    
"NTT DOCOMO INC"                         
"YANG JUN-HO"                            
"GEN ELECTRIC"                           
"UP RIGHT DESIGNS LLC"                   
"CATERPILLAR INC"                        
"CONTINENTAL AUTOMOTIVE GMBH"            
"GM GLOBAL TECH OPERATIONS LLC"          
"SIEMENS AG"                             
"WIDEGREN HANS"                          
"CPC CORP TAIWAN"                        
"SHANGHAI TIANMA MICRO ELECT CO"         
"SAMSUNG ELECTRONICS CO LTD"             
"TOHOKU TECHNO ARCH CO LTD"              
"FORD GLOBAL TECH LLC"                   
"MEDIATEK INC"                           
"BELL SPORTS INC"                        
"MCI MIRROR CONTROLS INT NETHERLANDS B V"
"VOLKSWAGEN AG"                          
"BAYERISCHE MOTOREN WERKE AG"            
"SIEMENS ENERGY INC"                     
"INVENTEC CORP"                          
"HUSQVARNA AB"                           
"AIR LIQUIDE"                            
"JAEGER ERICH GMBH & CO KG"              
"TOSHIBA KK"                                                       
"SAMSUNG LIMITED"                   
"IBM"                                    
"MAXWELL TECHNOLOGIES INC"               
"BAYERISCHE MOTOREN WERKE AG"                     
"FANUC CORP"                             
"GM GLOBAL TECH OPERATIONS LLC"          
"MEDIATEK INC"                           
"SEKISUI CHEMICAL CO LTD"                
"KISHIOKA TAKAHIRO"                      
"EVONIK DEGUSSA GMBH"                    
"ARAMCO SERVICES CO"                     
"WESTERN DIGITAL TECH INC"               
"VOLKSWAGEN AG"                          
"AIRBUS OPERATIONS GMBH"                 
"UNITED TECHNOLOGIES CORP"               
"GEELY HOLDING GROUP CO LTD"             
"3M INNOVATIVE PROPERTIES CO"            
"BAYERISCHE MOTOREN WERKE AG"            
"DAIMLER AG"                             
"SAMSUNG SDI CO LTD"                     
"SAMSUNG ELECTRONICS CO LTD"             
"GEN ELECTRIC"                           
"SARPERI LUCIANO PIETRO GIACOMO"         
"MTU AERO ENGINES GMBH"                  
"AMAZON TECH INC"                        
"KYOCERA CORP"                           
"MURAMATSU KENJI"                        
"STURMAN ODED EDDIE"                     
"SHARP KK"                               
"TRANSOCEAN SEDCO FOREX VENTURES LTD"    
"CANON KK"                               
"KIM JEONGWOOK"                          
"NOVALED AG"                             
"ERICSSON TELEFON AB L M (PUBL)"         
"WESTERN DIGITAL TECH INC"               
"LANDMARK GRAPHICS CORP"                 
"DAIMLER AG"                             
"SUZUKI MOTOR CORP"                      
"ST MICROELECTRONICS ASIA"               
"DELL PRODUCTS LP"                       
"CRYOVAC INC"                            
"DANA HEAVY VEHICLE SYS GROUP"           
"INIS BIOTECH LLC"                       
"SARPERI LUCIANO PIETRO GIACOMO"         
"MITUTOYO CORP"                          
"NEC LAB AMERICA INC"                    
end

The list is of 175000 firms but a lot of them are the same firm with different names (e.g. above SAMSUNG SDI CO LTD, SAMSUNG ELECTRONICS CO LTD, "SAMSUNG LIMITED", "SAMSUNG CO LIMITED"...).
What I would like to do is to find a way that does not take a lot (e.g. max one day) that puts all the names of the firm under a unique name. I tried something similar in python but the actual algorithm that I have tried makes all the possible couple comparisons and put them under a unique list. However, it takes on forever to run since its complexity is huge (it has to do all the possible couple comparisons of firms). I was wondering if stata provided some tool to do so in an already optimized way.

Thank you

Tags: string, Suggestion, syntax

Daniel Shin

Join Date: Mar 2020

Posts: 146
#2

17 Feb 2023, 10:59

I think I just replied to a similar post. You could start with grouping observations by the first word of the name, but you will need to double check carefully.

Code:

gen group=word(lower(doc_std_name),1)

For example "GEN ELECTRIC" will not group with "GENERAL ELECTRIC" using this method. Are there other variables that you can use to find similars?
1 like
Comment
Federico Nutarelli

Join Date: Sep 2018

Posts: 430
#3

18 Feb 2023, 03:51

Daniel Shin thanks a lot. Unfortunately, there are not. Maybe I could try starting with your approach and let you know how it goes.
Comment
Daniel Feenberg

Join Date: Oct 2014

Posts: 334
#4

18 Feb 2023, 09:44

You could start by replacing " CO ", " LTD ", " CO ", " LIMITED ", " AG " and " COMPANY " with blanks. That is unlikely to cause false matches, and covers most of the problems. As for matching "GM" with "General Motors" - that is something that will require handwork, unless you can find a dataset where someone has already done the handwork. Maybe listing all the matches on the first word, and fiing them by hand would be feasible. You will never achieve perfection.
1 like
Comment
Daniel Shin

Join Date: Mar 2020

Posts: 146
#5

18 Feb 2023, 09:56

Some thing you could do to focus your attention to companies with variations in names. Using your dataset above:

Code:

duplicates drop gen group=word(lower(doc_std_name),1) egen varcount=count(doc_std_name), by(group) bro if varcount!=1
1 like
Comment

Announcement

String comparison within same variable

Comment

Comment

Comment

Comment