Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • group string var with random names

    Hi,

    I have list of investor name which are not same like "investor group" var. I want to create a variable like investor group from var investor_name. Please suggest me how to do it.
    Investor_name Investor_group
    Blue Ocean Blue Ocean Partners
    Blue Ocean Partners Blue Ocean Partners
    Blue Ocean Partners LLC Blue Ocean Partners
    Breakthrough Energy Breakthrough Energy
    Deutsche Bank Deutsche Bank
    Goldman Goldman Sachs
    Goldman Sachs Goldman Sachs
    Goldman Sachs, Inc Goldman Sachs
    Google Google
    Google Ventures Google
    J.P. Morgan JP Morgan
    JP Morgan JP Morgan
    JP Morgan Chase JP Morgan
    Kleiner Perkins Kleiner Perkins
    Kleiner Perkins Caufield & Byers Kleiner Perkins
    Biomet Orthopedics, LLC Biomet
    Biomet Spine, LLC Biomet
    Biomet Trauma, LLC Biomet
    Biomet Sports Medicine, LLC Biomet
    BIomet 3i, LLC Biomet
    Biomet Microfixation, LLC Biomet
    Biomet Biologics, LLC Biomet
    Davol Inc. C. R. Bard
    Bard Peripheral Vascular, Inc. C. R. Bard
    C. R. Bard, Inc. & Subsidiaries C. R. Bard
    Bard Access Systems, Inc. C. R. Bard
    DePuy Synthes Products LLC DePuy
    DePuy Mitek LLC DePuy
    DePuy Orthopaedics Inc. DePuy
    Synthes USA Products LLC DePuy
    DePuy Spine, LLC DePuy

  • #2
    Look into -matchit-, by Julio Raffo, available from SSC.

    Comment


    • #3
      I looked it. I couldn't figure out how to generate group variable using matchit. please can you give me the code?

      Comment


      • #4
        Rezoanul, if I understand correctly, you'd like to generate the correct investor group for each investor name.

        My belief is that a software, including Stata, can only do something that has clear algorithm. In other words, if researchers are not able to explain, in plain words, the procedure of doing something, then a software can't do that either. In your case, the algorithm would be: How could I know the name of "investor group" only based on the "investor name"? For example, the algorithm for "Blue Ocean" --> "Blue Ocean Partners" seems to be adding "Partners" to the investor name, but this operation isn't valid for other cases. Even worse, "Davol Inc." belonging to "C. R. Bard" is something I would never know unless I know more information about real-life business than just the "investor name" -- Stata "thinks" similarly.

        Given that there is no "uniform" or "easy" algorithm for your case, we are only able to go through case by case. For example, any investor name including "Biomet" (seven investor names in your case) belongs to the group called "Biomet". Then the code is

        Code:
        gen investor_group = "Biomet" if strrpos(investor_name, "Biomet") > 0
        Other groups may use different algorithms and need different codes.

        Comment


        • #5
          I may have understood the original request. Here is how I interpreted it. I assume that OP has two data sets. One, let's call it investors.dta, contains investor_names, which are irregular and erratic, and another contains the correct investor group names, let's call it correct_groups.dta. The task is to match the investor names in investors.dta with the correct corresponding group from correct_groups.dta. Now, given the erratic nature of the names in investors.dta, this is an imperfect process and what is needed is a fuzzy match that picks one or more reasonably close matches. The results will need to be reviewed afterwards to deal manually with false matches or unmatched investor names. This is what -matchit- accomplishes. The following code illustrates how it is used:

          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input str19 investor_group
          "Blue Ocean Partners"
          "Breakthrough Energy"
          "Deutsche Bank"      
          "Goldman Sachs"      
          "Google"            
          "JP Morgan"          
          "Kleiner Perkins"    
          "Biomet"            
          "C. R. Bard"        
          "DePuy"              
          end
          gen long obs_no = _n
          tempfile correct_groups
          save `correct_groups'
          
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input str33 investor_name
          "Blue Ocean "                      
          "Blue Ocean Partners "            
          "Blue Ocean Partners LLC "        
          "Breakthrough Energy "            
          "Deutsche Bank "                  
          "Goldman "                        
          "Goldman Sachs "                  
          "Goldman Sachs, Inc "              
          "Google "                          
          "Google Ventures "                
          "J.P. Morgan "                    
          "JP Morgan "                      
          "JP Morgan Chase "                
          "Kleiner Perkins "                
          "Kleiner Perkins Caufield & Byers "
          ""                                
          "Biomet Orthopedics, LLC "        
          "Biomet Spine, LLC "              
          "Biomet Trauma, LLC "              
          "Biomet Sports Medicine, LLC "    
          "BIomet 3i, LLC "                  
          "Biomet Microfixation, LLC "      
          "Biomet Biologics, LLC "          
          "Davol Inc. "                      
          "Bard Peripheral Vascular, Inc. "  
          "C. R. Bard, Inc. & Subsidiaries "
          "Bard Access Systems, Inc. "      
          "DePuy Synthes Products LLC "      
          "DePuy Mitek LLC "                
          "DePuy Orthopaedics Inc. "        
          "Synthes USA Products LLC "        
          "DePuy Spine, LLC "                
          end
          tempfile investors
          save `investors'
          
          use `investors'
          gen long obs_no = _n
          matchit obs_no investor_name using `correct_groups', txtusing(investor_group) ///
              idusing(obs_no) override
          Note: In the above code, instead of investors.dta and correct_groups.dta I have used tempfiles `investors' and `correct_groups'. OP should modify the code to use the actual names of whatever those files are. Note that both files require an ID number variable--which the code above supplies.

          Depending on how the results turn out, it may be necessary to rerun, experimenting with different settings of the threshold or other options available in -matchit-. This is a trial-and-error process that cannot be set out here. Perfect results should not be expected.

          The advice by Fei Wang in #4 is correct, but is based on a different interpretation of what OP wants to do. But, to be sure, if there is no reference file containing the correct names, the task cannot be accomplished: Stata cannot guess what those might be. If OP does not have such a file, it may be possible to find one on line somewhere--in fact, I would be surprised if such a file were not somewhere generally available, although this not being my area of interest or expertise, I can't say exactly how to go about finding it.

          In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

          When asking for help with code, always show example data. When showing example data, always use -dataex-.

          Comment

          Working...
          X