Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to identify whether two firms' names that are partially the same

    Hello!

    I'm working with a set of data that under each ID number there are two firms, they are with different tickers, so they are not exactly the same firm, but some may with a partial name that is the same, meaning they may have the same parent company. e.g., ZHEJIANG WASU BROADCAST & TV NETWORK CO., LTD vs WASU MEDIA HOLDING CO., LTD, the WASU part is an indication that they have the same parent company. I have several hundred pairs of firms, I wonder whether stata can help me to identify two firms' names that are partially the same.

    Some examples:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long ID str189 firm1name str135 firm2name
    1907119766 "CPT WYNDHAM HOLDINGS LTD"                                             "SHENZHEN ENERGY GROUP CO., LTD"                                   
    1943161276 "CHINA ELECTRONIC SYSTEM TECHNOLOGY CO., LTD"                          "SHENZHEN SED INDUSTRY CO., LTD"                                   
    1907214857 "SUNVISION ECOLOGICAL CONSTRUCTION INVESTMENT (SUZHOU) CO., LTD"       "TUNGHSU AZURE RENEWABLE ENERGY CO., LTD"                          
    1907127424 "CHINA-UNION CENTURY ARCHITECTURE DESIGN CO., LTD"                     "SHENZHEN HUAKONG SEG CO., LTD"                                    
    1601207170 "THREE JAMAICAN GOVERNMENT-OWNED SUGAR FACTORIES"                      "CHINA NATIONAL COMPLETE PLANT IMPORT & EXPORT CORPORATION LTD"    
    1907036848 "CHENGDU PUSH PHARMACEUTICAL CO., LTD"                                 "ANHUI FENGYUAN PHARMACEUTICAL CO., LTD"                           
    1943140202 "ZHEJIANG WASU BROADCAST & TV NETWORK CO., LTD"                        "WASU MEDIA HOLDING CO., LTD"                                      
    1943186573 "CLP EQUIPMENT SHANDONG ELECTRONICS CO., LTD"                          "XJ ELECTRIC CO., LTD"                                             
    1943013306 "CHONGQING SHENGBANG GAS CO., LTD"                                     "SHANDONG SHENGLI CO., LTD"                                        
    1943189095 "BEIDOU TIANDI CO., LTD"                                               "SHANDONG GEO-MINERAL CO., LTD"                                    
    1907057630 "BEIJING MINSHENG PAWNBROKING CO., LTD"                                "MINSHENG HOLDINGS CO., LTD"                                       
    1907066201 "TANGHE TIMES MINING INDUSTRY CO., LTD"                                "INNER MONGOLIA XINGYE MINING CO., LTD"                            
    1943084140 "KUNMING DONGCHUAN TONGDU MINING CO., LTD"                             "INNER MONGOLIA XINGYE MINING CO., LTD"                            
    1943150707 "SHANDONG HIGH-SPEED QILU CONSTRUCTION GROUP CO., LTD"                 "SHANDONG HI-SPEED ROAD & BRIDGE CO., LTD"                         
    1907191820 "GUANGXI INTERCONTINENTAL FORESTRY INVESTMENT CO., LTD"                "JIANGSU SIHUAN BIOENGINEERING CO., LTD"                           
    1907042699 "GUANGZHOU LINGNAN INTERNATIONAL HOTEL MANAGEMENT CO., LTD"            "GUANGZHOU DONGFANG HOTEL CO., LTD"                                
    1943059412 "HENAN FIFTH URBAN AND RURAL DEVELOPMENT CO., LTD"                     "CENTRAL PLAINS ENVIRONMENT PROTECTION CO., LTD"                   
    1907163880 "JIANGXI XINJINYE INDUSTRIAL CO., LTD"                                 "JINYUAN CEMENT CO., LTD"                                          
    1907013698 "BEIJING ALLDAY SCIENCE AND TECHNOLOGY CO., LTD"                       "CHINA SCHOLARS GROUP CO., LTD"                                    
    1907102324 "NANJING CHANGFENG AEROSPACE ELECTRONICS TECHNOLOGY CO., LTD"          "CHINA SCHOLARS GROUP CO., LTD"                                    
    1941490506 "SHANGHAI BAIF TECHNOLOGY CO., LTD"                                    "CREATE TECHNOLOGY & SCIENCE CO., LTD"                             
    1907081995 "NINGXIA NINGDONG RAILWAY CO., LTD"                                    "GUANGXIA (YINCHUAN) INDUSTRY CO., LTD"                            
    1633026104 "KUNMING NEW SOUTHWEST TRADING CO., LTD"                               "KUNMING SINOBRIGHT (GROUP) CO., LTD"                              
    1943142645 "HUNAN ROYAL SEAL CO., LTD"                                            "CCOOP GROUP CO., LTD"                                             
    1633009824 "WEINING COUNTY MEITANGOU COAL MINE"                                   "DONGGUAN WINNERWAY INDUSTRIAL ZONE CO., LTD"                      
    1943228780 "SHANGHAI KELING INDUSTRIAL DEVELOPMENT CO., LTD"                      "JIANGSU HAGONG INTELLIGENT ROBOT CO., LTD"                        
    1907182422 "BEIJING ZHONGFU KANGHUA SCENIC AREA TOURISM DEVELOPMENT CO., LTD"     "ZHONGFU STRAITS (PINGTAN) DEVELOPMENT CO., LTD"                   
    end
    Thanks a lot for any kind help!

  • #2
    For this problem, I recommend installing Julio Raffo's -matchit-, available from SSC.

    Just keep your expectations modest. There is no perfect solution to the "fuzzy match" problem. All such solutions require tuning some parameter(s) that will trade off between getting too many false matches and missing too many true matches. In the end, you will be left with some errors in both directions that you will either have to clean up by manual inspection, or simply accept because you don't know how to resolve them.

    Comment


    • #3
      The more structure you can put on the problem, the better you can use Stata to construct an algorithm.

      For instance, suppose we knew that the only way two companies would match is if they had full words common between them. Then you might do something like this:

      Code:
      foreach var of varlist firm?name {
          gen `var'_clean = ustrregexra(`var',"( &|\.|,| CO| LTD|\(|\))","")
      }
      
      gen common = ""
      forval i = 1/`=_N' {
          local firm1 = firm1name_clean[`i']
          local firm2 = firm2name_clean[`i']
          local common: list firm1 & firm2
          replace common = "`common'" in `i'
      }
      drop firm?name_clean
      which gives us:

      Code:
      . li if !missing(common), noobs string(30) sep(0)
      
        +--------------------------------------------------------------------------------------------------------+
        |         ID   firm1name                          firm2name                                       common |
        |--------------------------------------------------------------------------------------------------------|
        | 1907036848   CHENGDU PUSH PHARMACEUTICAL CO..   ANHUI FENGYUAN PHARMACEUTICAL ..        PHARMACEUTICAL |
        | 1943140202   ZHEJIANG WASU BROADCAST & TV N..   WASU MEDIA HOLDING CO., LTD                       WASU |
        | 1907057630   BEIJING MINSHENG PAWNBROKING C..   MINSHENG HOLDINGS CO., LTD                    MINSHENG |
        | 1907066201   TANGHE TIMES MINING INDUSTRY C..   INNER MONGOLIA XINGYE MINING C..                MINING |
        | 1943084140   KUNMING DONGCHUAN TONGDU MININ..   INNER MONGOLIA XINGYE MINING C..                MINING |
        | 1943150707   SHANDONG HIGH-SPEED QILU CONST..   SHANDONG HI-SPEED ROAD & BRIDG..              SHANDONG |
        | 1907042699   GUANGZHOU LINGNAN INTERNATIONA..   GUANGZHOU DONGFANG HOTEL CO., ..       GUANGZHOU HOTEL |
        | 1941490506   SHANGHAI BAIF TECHNOLOGY CO., ..   CREATE TECHNOLOGY & SCIENCE CO..            TECHNOLOGY |
        | 1633026104   KUNMING NEW SOUTHWEST TRADING ..   KUNMING SINOBRIGHT (GROUP) CO...               KUNMING |
        | 1907182422   BEIJING ZHONGFU KANGHUA SCENIC..   ZHONGFU STRAITS (PINGTAN) DEVE..   ZHONGFU DEVELOPMENT |
        +--------------------------------------------------------------------------------------------------------+
      You could probably make this smarter by excluding other generic words like "MINING", "TECHNOLOGY", "PHARMACEUTICAL", etc.
      Last edited by Hemanshu Kumar; 17 Oct 2022, 09:48.

      Comment


      • #4
        Hi Clyde and Hemanshu,

        Thank you both for the helpful suggestions! It is very useful to identify the common part of the two names, then I could do some manual inspections, luckily my sample is not too big.

        Comment

        Working...
        X