group string var with random names

Rezoanul Hoque

Join Date: Apr 2020
Posts: 17

group string var with random names

30 Oct 2021, 23:09

Hi,

I have list of investor name which are not same like "investor group" var. I want to create a variable like investor group from var investor_name. Please suggest me how to do it.

Investor_name	Investor_group
Blue Ocean	Blue Ocean Partners
Blue Ocean Partners	Blue Ocean Partners
Blue Ocean Partners LLC	Blue Ocean Partners
Breakthrough Energy	Breakthrough Energy
Deutsche Bank	Deutsche Bank
Goldman	Goldman Sachs
Goldman Sachs	Goldman Sachs
Goldman Sachs, Inc	Goldman Sachs
Google	Google
Google Ventures	Google
J.P. Morgan	JP Morgan
JP Morgan	JP Morgan
JP Morgan Chase	JP Morgan
Kleiner Perkins	Kleiner Perkins
Kleiner Perkins Caufield & Byers	Kleiner Perkins

Biomet Orthopedics, LLC	Biomet
Biomet Spine, LLC	Biomet
Biomet Trauma, LLC	Biomet
Biomet Sports Medicine, LLC	Biomet
BIomet 3i, LLC	Biomet
Biomet Microfixation, LLC	Biomet
Biomet Biologics, LLC	Biomet
Davol Inc.	C. R. Bard
Bard Peripheral Vascular, Inc.	C. R. Bard
C. R. Bard, Inc. & Subsidiaries	C. R. Bard
Bard Access Systems, Inc.	C. R. Bard
DePuy Synthes Products LLC	DePuy
DePuy Mitek LLC	DePuy
DePuy Orthopaedics Inc.	DePuy
Synthes USA Products LLC	DePuy
DePuy Spine, LLC	DePuy

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

30 Oct 2021, 23:17

Look into -matchit-, by Julio Raffo, available from SSC.
Comment
Rezoanul Hoque

Join Date: Apr 2020

Posts: 17
#3

30 Oct 2021, 23:20

I looked it. I couldn't figure out how to generate group variable using matchit. please can you give me the code?
Comment
Fei Wang

Join Date: Oct 2021

Posts: 726
#4

31 Oct 2021, 00:45

Rezoanul, if I understand correctly, you'd like to generate the correct investor group for each investor name.

My belief is that a software, including Stata, can only do something that has clear algorithm. In other words, if researchers are not able to explain, in plain words, the procedure of doing something, then a software can't do that either. In your case, the algorithm would be: How could I know the name of "investor group" only based on the "investor name"? For example, the algorithm for "Blue Ocean" --> "Blue Ocean Partners" seems to be adding "Partners" to the investor name, but this operation isn't valid for other cases. Even worse, "Davol Inc." belonging to "C. R. Bard" is something I would never know unless I know more information about real-life business than just the "investor name" -- Stata "thinks" similarly.

Given that there is no "uniform" or "easy" algorithm for your case, we are only able to go through case by case. For example, any investor name including "Biomet" (seven investor names in your case) belongs to the group called "Biomet". Then the code is

Code:

gen investor_group = "Biomet" if strrpos(investor_name, "Biomet") > 0

Other groups may use different algorithms and need different codes.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#5

31 Oct 2021, 09:35

I may have understood the original request. Here is how I interpreted it. I assume that OP has two data sets. One, let's call it investors.dta, contains investor_names, which are irregular and erratic, and another contains the correct investor group names, let's call it correct_groups.dta. The task is to match the investor names in investors.dta with the correct corresponding group from correct_groups.dta. Now, given the erratic nature of the names in investors.dta, this is an imperfect process and what is needed is a fuzzy match that picks one or more reasonably close matches. The results will need to be reviewed afterwards to deal manually with false matches or unmatched investor names. This is what -matchit- accomplishes. The following code illustrates how it is used:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str19 investor_group "Blue Ocean Partners" "Breakthrough Energy" "Deutsche Bank" "Goldman Sachs" "Google" "JP Morgan" "Kleiner Perkins" "Biomet" "C. R. Bard" "DePuy" end gen long obs_no = _n tempfile correct_groups save `correct_groups' * Example generated by -dataex-. For more info, type help dataex clear input str33 investor_name "Blue Ocean " "Blue Ocean Partners " "Blue Ocean Partners LLC " "Breakthrough Energy " "Deutsche Bank " "Goldman " "Goldman Sachs " "Goldman Sachs, Inc " "Google " "Google Ventures " "J.P. Morgan " "JP Morgan " "JP Morgan Chase " "Kleiner Perkins " "Kleiner Perkins Caufield & Byers " "" "Biomet Orthopedics, LLC " "Biomet Spine, LLC " "Biomet Trauma, LLC " "Biomet Sports Medicine, LLC " "BIomet 3i, LLC " "Biomet Microfixation, LLC " "Biomet Biologics, LLC " "Davol Inc. " "Bard Peripheral Vascular, Inc. " "C. R. Bard, Inc. & Subsidiaries " "Bard Access Systems, Inc. " "DePuy Synthes Products LLC " "DePuy Mitek LLC " "DePuy Orthopaedics Inc. " "Synthes USA Products LLC " "DePuy Spine, LLC " end tempfile investors save `investors' use `investors' gen long obs_no = _n matchit obs_no investor_name using `correct_groups', txtusing(investor_group) /// idusing(obs_no) override

Note: In the above code, instead of investors.dta and correct_groups.dta I have used tempfiles `investors' and `correct_groups'. OP should modify the code to use the actual names of whatever those files are. Note that both files require an ID number variable--which the code above supplies.

Depending on how the results turn out, it may be necessary to rerun, experimenting with different settings of the threshold or other options available in -matchit-. This is a trial-and-error process that cannot be set out here. Perfect results should not be expected.

The advice by Fei Wang in #4 is correct, but is based on a different interpretation of what OP wants to do. But, to be sure, if there is no reference file containing the correct names, the task cannot be accomplished: Stata cannot guess what those might be. If OP does not have such a file, it may be possible to find one on line somewhere--in fact, I would be surprised if such a file were not somewhere generally available, although this not being my area of interest or expertise, I can't say exactly how to go about finding it.

In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
Comment

Announcement

group string var with random names

Comment

Comment

Comment

Comment