Creating new dummy variable from several dummies

julie anderson

Join Date: Mar 2020
Posts: 3

Creating new dummy variable from several dummies

01 Apr 2020, 09:18

Hi,
I have a long list of dummy variables (more than 2000) [these are tags for companies which I converted to 1/0 dummies] for each firm - notes as t1-t2100
In a separate file I have a matrix that converts that split these t* dummies into categories such as software, hardware:

tag_m

tagss

soft

hard

bio

OTHER

MEDICAL

telcomobile

ECOMMERCE

CYBER

FINTECH

INDUSTRY4

AGRI

3d-technology

t30

adtech

t31

advertisers

t32

advertising

t46

agriculture

t48

agtech

t59

alert-system

t63

algorithms

t74

analytics

t87

anti-fraud

I want to create a variable in the original dataset using this category matrix such that basically
gen software==1 if t30==1 | t31==1| t32==1 ......

But doing it for so many category and t* variables is very tedious - I am sure there is a neat way to do it rather than manually putting it like above?

Would appreciate any help or suggestion.
To clarify the legend matrix which I copied above is in a separate data file than the main data which looks like this:

company_id	value	amount	age	t1	t2	t3	t4	t5	t6	t7	t8	t9	t10	t11
x	100	10	4	0	1	1	1	0	0	0	1	1	0	0
y	4000	4	8	0	1	0	0	0	0	0	1	0	0	1

I want to create additional columns to this dataset so that based on the categories matrix above new variables will be soft=0/1 if if t30==1 | t31==1| t32==1 ......

company_id	value	amount	age	t1	t2	t3	t4	t5	t6	t7	t8	t9	t10	t11	soft	hard	bio	other
x	100	10	4	0	1	1	1	0	0	0	1	1	0	0
y	4000	4	8	0	1	0	0	0	0	0	1	0	0	1

Thanks

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30104
#2

01 Apr 2020, 13:06

I sense from your use of terminology that you have begun to use Stata relatively recently and are steeped in habits of thought and practices appropriate to some other statistical package. One of the keys to becoming efficient in the use of Stata is to break those old habits and develop new ones that are more congenial to the way Stata works.

For example, the wide layout of your company data, with all those t* variables makes things hard to do in Stata. A more workable layout is the long one, with multiple observations per company, one for each t* that applies to it. Another example is the use of 1/missing value coding for your indicators in the categories data. When Stata evaluates logical expressions, 1 and missing value both evaluate as true: only 0 evaluates as false. So you are setting yourself up for logic errors later on when you use 1 for yes and missing value for no. It should be 1 for yes and 0 for no.

So, given what your starting with, some transformations of the data are required in order to put these two data sets together. Once that is done, it's very simple.

Code:

clear* * Example generated by -dataex-. To install: ssc install dataex clear input str3 tag_m str13 tagss byte(var3 hard bio other medical telcomobile ecommerce cyber fintech industry4 agri) "t6" "3d-technology" . 1 . . . . . . . . . "t30" "adtech" 1 . . . . . . . . . . "t31" "advertisers" 1 . . . . . . . . . . "t32" "advertising" 1 . . . . . . . . . . "t46" "agriculture" . . . . . . . . . . 1 "t48" "agtech" . . . . . . . . . . 1 "t59" "alert-system" 1 1 . . . . . . . . . "t63" "algorithms" 1 . . . . . . . . . . "t74" "analytics" 1 . . . . . . . . . . "t87" "anti-fraud" . . . . . . . . . . . end ds tag_m tagss, not tempfile categories save `categories' * Example generated by -dataex-. To install: ssc install dataex clear input str1 company_id int value byte(amount age t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11) "x" 100 10 4 0 1 1 1 0 0 0 1 1 0 0 "y" 4000 4 8 0 1 0 0 0 0 0 1 0 0 1 end // GO TO LONG LAYOUT reshape long t, i(company_id) j(tag_m) drop if t == 0 drop t tostring tag_m, replace replace tag_m = "t" + tag_m frame create categories frame change categories use `categories' // MAKE THE INDICATORS FOR HARD, BIO, ETC. PROPER 0/1 VARIABLES ds tag_m tagss, not mvencode `r(varlist)', mv(0) // NOW PUT THEM TOGETHER frame change default frlink m:1 tag_m, frame(categories) frget hard-agri, from(categories)

In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.

Note: In your data example tableaus, the only instances of the t* variables being 1 are t2, t3, t4, t8, t9, and t11. But none of these appear in the categories data set. So applying the code to your example data turns up all zeroes for soft hard bio, etc. Generally when posting it is better to show example data that fits together and illustrates the phenomena you are trying to capture. Presumably in your full data, this problem does not arise.

Added: It dawns on me that you will actually need to revert to your original data organization with a single observation per company, perhaps using the various categories (hard, soft, etc.) as predictors in some kind of model. So the following code extends what is above to do that. In addition, I have shortened the code somewhat by eliminating some unnecessary steps in the management of the categories data set.

Code:

clear* * Example generated by -dataex-. To install: ssc install dataex clear input str1 company_id int value byte(amount age t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 t11) "x" 100 10 4 0 1 1 1 0 0 0 1 1 0 0 "y" 4000 4 8 0 1 0 0 0 0 0 1 0 0 1 end // RETAIN A COPY OF THE DATA IN ITS ORIGINAL FORM frame copy default original // GO TO LONG LAYOUT reshape long t, i(company_id) j(tag_m) drop if t == 0 drop t tostring tag_m, replace replace tag_m = "t" + tag_m frame create categories frame change categories * Example generated by -dataex-. To install: ssc install dataex clear input str3 tag_m str13 tagss byte(var3 hard bio other medical telcomobile ecommerce cyber fintech industry4 agri) "t6" "3d-technology" . 1 . . . . . . . . . "t30" "adtech" 1 . . . . . . . . . . "t31" "advertisers" 1 . . . . . . . . . . "t32" "advertising" 1 . . . . . . . . . . "t46" "agriculture" . . . . . . . . . . 1 "t48" "agtech" . . . . . . . . . . 1 "t59" "alert-system" 1 1 . . . . . . . . . "t63" "algorithms" 1 . . . . . . . . . . "t74" "analytics" 1 . . . . . . . . . . "t87" "anti-fraud" . . . . . . . . . . . end // MAKE THE INDICATORS FOR HARD, BIO, ETC. PROPER 0/1 VARIABLES ds tag_m tagss, not mvencode `r(varlist)', mv(0) // NOW GRAB THE CATEGORIES frame change default frlink m:1 tag_m, frame(categories) frget hard-agri, from(categories) // AND REDUCE TO ONE OBSERVATION PER COMPANY collapse (max) hard-agri, by(company_id) // AND NOW PUT THAT TOGETHER WITH THE ORIGINAL DATA frame change original frlink 1:1 company_id, frame(default) frget hard-agri, from(default) drop default frame drop default

Last edited by Clyde Schechter; 01 Apr 2020, 13:31.
Comment

Announcement

Creating new dummy variable from several dummies

Comment