Hi all,
I have a variable in a dataset with the name of some firms like that:
The list is of 175000 firms but a lot of them are the same firm with different names (e.g. above SAMSUNG SDI CO LTD, SAMSUNG ELECTRONICS CO LTD, "SAMSUNG LIMITED", "SAMSUNG CO LIMITED"...).
What I would like to do is to find a way that does not take a lot (e.g. max one day) that puts all the names of the firm under a unique name. I tried something similar in python but the actual algorithm that I have tried makes all the possible couple comparisons and put them under a unique list. However, it takes on forever to run since its complexity is huge (it has to do all the possible couple comparisons of firms). I was wondering if stata provided some tool to do so in an already optimized way.
Thank you
I have a variable in a dataset with the name of some firms like that:
Code:
* Example generated by -dataex-. For more info, type help dataex clear input str142 doc_std_name "SEEO INC" "BOSCH GMBH ROBERT" "SAMSUNG SDI CO LTD" "NAGAI TAKAYUKI" "WESTPORT POWER INC" "SAMSUNG ELECTRONICS CO LTD" "SATO TOSHIO" "SUMITOMO ELECTRIC INDUSTRIES" "TOSHIBA KK" "TEIKOKU SEIYAKU KK" "MITSUBISHI ELECTRIC CORP" "IHI CORP" "WEI XI" "SIEMENS AG" "HYUNDAI MOTOR CO LTD" "COOPER TECHNOLOGIES CO" "TSUI CHENG-WEN" "UCHICAGO ARGONNE LLC" "BAYERISCHE MOTOREN WERKE AG" "BAYERISCHE MOTOREN WERKE AG" "YANAGIDA EIJI" "MINEBEA CO LTD" "CATERPILLAR INC" "LENOVO SINGAPORE PTE LTD" "FLORIDA TURBINE TECH INC" "SIEMENS AG" "TOYOTA MOTOR CO LTD" "NTT DOCOMO INC" "YANG JUN-HO" "GEN ELECTRIC" "UP RIGHT DESIGNS LLC" "CATERPILLAR INC" "CONTINENTAL AUTOMOTIVE GMBH" "GM GLOBAL TECH OPERATIONS LLC" "SIEMENS AG" "WIDEGREN HANS" "CPC CORP TAIWAN" "SHANGHAI TIANMA MICRO ELECT CO" "SAMSUNG ELECTRONICS CO LTD" "TOHOKU TECHNO ARCH CO LTD" "FORD GLOBAL TECH LLC" "MEDIATEK INC" "BELL SPORTS INC" "MCI MIRROR CONTROLS INT NETHERLANDS B V" "VOLKSWAGEN AG" "BAYERISCHE MOTOREN WERKE AG" "SIEMENS ENERGY INC" "INVENTEC CORP" "HUSQVARNA AB" "AIR LIQUIDE" "JAEGER ERICH GMBH & CO KG" "TOSHIBA KK" "SAMSUNG LIMITED" "IBM" "MAXWELL TECHNOLOGIES INC" "BAYERISCHE MOTOREN WERKE AG" "FANUC CORP" "GM GLOBAL TECH OPERATIONS LLC" "MEDIATEK INC" "SEKISUI CHEMICAL CO LTD" "KISHIOKA TAKAHIRO" "EVONIK DEGUSSA GMBH" "ARAMCO SERVICES CO" "WESTERN DIGITAL TECH INC" "VOLKSWAGEN AG" "AIRBUS OPERATIONS GMBH" "UNITED TECHNOLOGIES CORP" "GEELY HOLDING GROUP CO LTD" "3M INNOVATIVE PROPERTIES CO" "BAYERISCHE MOTOREN WERKE AG" "DAIMLER AG" "SAMSUNG SDI CO LTD" "SAMSUNG ELECTRONICS CO LTD" "GEN ELECTRIC" "SARPERI LUCIANO PIETRO GIACOMO" "MTU AERO ENGINES GMBH" "AMAZON TECH INC" "KYOCERA CORP" "MURAMATSU KENJI" "STURMAN ODED EDDIE" "SHARP KK" "TRANSOCEAN SEDCO FOREX VENTURES LTD" "CANON KK" "KIM JEONGWOOK" "NOVALED AG" "ERICSSON TELEFON AB L M (PUBL)" "WESTERN DIGITAL TECH INC" "LANDMARK GRAPHICS CORP" "DAIMLER AG" "SUZUKI MOTOR CORP" "ST MICROELECTRONICS ASIA" "DELL PRODUCTS LP" "CRYOVAC INC" "DANA HEAVY VEHICLE SYS GROUP" "INIS BIOTECH LLC" "SARPERI LUCIANO PIETRO GIACOMO" "MITUTOYO CORP" "NEC LAB AMERICA INC" end
What I would like to do is to find a way that does not take a lot (e.g. max one day) that puts all the names of the firm under a unique name. I tried something similar in python but the actual algorithm that I have tried makes all the possible couple comparisons and put them under a unique list. However, it takes on forever to run since its complexity is huge (it has to do all the possible couple comparisons of firms). I was wondering if stata provided some tool to do so in an already optimized way.
Thank you

Comment