Convert categorical variable to dummy variables in a large dataset

Van Anh

Join Date: Jan 2019

Posts: 8
#1

Convert categorical variable to dummy variables in a large dataset

26 Jan 2019, 09:34

Dear everyone,

I have a dataset of 350 categorical variables, e.g: How satisfied are you with your life? 1. Unhappy .... 10. Very happy.
I want to do regression on them, but my professor advised me to convert them into dummy variables before doing it.
However, I only know one command to convert 1 categorical variable to dummy variable: tabulate varname, gen(newvarname). How could I convert such a large number of variables to dummy at once?
I also tried loop with foreach, but It was not correct.

Could you please kindly advised me?
Thank you very much!
Tags: None
Carole J. Wilson

Join Date: Jan 2015

Posts: 932
#2

26 Jan 2019, 09:45

1) You don't tell us if there is a naming convention for your variables. You'll either need to type out the 350 variables or somehow find a pattern that you can use to identify the variables in a macro.
2) Do you really want to create a new variable for each category of your 350 variables? Or do you want to create something like values 1 to 5=0 (Unhappy), and values 6-10=1 (Happy)?
3) Do you want the coding to be the same for all variables (1-5=0 for all variables) (6-10=1 for all variables)?

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1
Comment
Van Anh

Join Date: Jan 2019

Posts: 8
#3

26 Jan 2019, 10:08

Carole J. Wilson : Thank you very much for your response!
1. My variables are named V1, V2, ... V265.
2. I think that I have to create a new variable for each category. Some are not ordinal, e.g: race. And happiness variable is chosen the dependent variable.

One more question is that: If I want to select feature with lassopack, do I need to convert these variables to dummy first, or just remain them unchanged?
Thank you very much!
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#4

26 Jan 2019, 10:13

In modern Stata some commands such as -regress- and other estimation commands accept what is called "factor notation" or "factor variables".

This means that most of the time you do not need to manually create the dummies, but you just use factor variable notation, and Stata creates the dummies for you on the fly.

Type "factor variables" in the help or -help fvvarlist- and you will see who those are used.
1 like
Comment
David Benson

Join Date: Oct 2018

Posts: 489
#5

26 Jan 2019, 17:19

As Joro mentions, using factor notation, your regression will look something like this (I list OLS regression, but it would work for multinomial logit as well):

Code:

regress happiness age sex i.income_level i.educ_level i.marital_status i.religion
1 like
Comment
Van Anh

Join Date: Jan 2019

Posts: 8
#6

28 Jan 2019, 07:07

Joro Kolev David Benson : Thank you very much for your help!
Before doing regression, I want to do lasso2 (in the lassopack) first, to select the variables. I was advised to do the following steps:
1. convert categorical variables to dummy variables.
2. standardize these dummy variables
3. lasso2
Can I also use the factor notation to do standardization in the 2nd step? Can I use the loop (foreach) to do standardization? It is too complicated for me.
Thank you very much!
Comment

David Benson

Join Date: Oct 2018
Posts: 489

30 Jan 2019, 22:43

I've never used lasso2 or lassopack, so I'm not going to be of much help. I also don't know what it means to standardize a dummy variable

Also, it's not clear that standardizing really helps anything. See posts here, here, here, and here.

However, assuming you really need to do these things:

Code:

* To create indicator (dummy) variables for each categorical variables
tabulate income_level, gen(inc)

* Loop to standardize variables
* You can list the variables in the loop, or put them in a local macro and then loop over the local macro

local my_vars "income_level educ_level marital_status religion"

foreach v in `my_vars' {
     egen std_`v' = std(`v')
}

foreach var of varlist income_level educ_level marital_status religion {
     egen std_`var' = std(`var')
}

Comment

Jordan Louis

Join Date: Dec 2020

Posts: 2
#8

16 Dec 2020, 07:27

Originally posted by Carole J. Wilson View Post

1) You don't tell us if there is a naming convention for your variables. You'll either need to type out the 350 variables or somehow find a pattern that you can use to identify the variables in a macro.
2) Do you really want to create a new variable for each category of your 350 variables? Or do you want to create something like values 1 to 5=0 (Unhappy), and values 6-10=1 (Happy)?
3) Do you want the coding to be the same for all variables (1-5=0 for all variables) (6-10=1 for all variables)?

How would you go about doing 2) ? I want to create a dummy where 1 to 2 of my happy variable will be = 0 for unhappy and 3 to 4 will be = 1 for happy. I want to do a similar thing for my number of children variable too where a participant having up to 3 children is few children = 0 and >4 children = 1 for many children.

happy
tabulation: Freq. Numeric Label
6,807 1 More than usual
41,444 2 Same as usual
6,422 3 Less so
991 4 Much less

child
tabulation: Freq. Numeric Label
37,862 0
7,419 1
7,514 2
2,413 3
357 4
70 5
26 6
3 7

Hope this makes sense, thanks
Comment

Announcement

Convert categorical variable to dummy variables in a large dataset

Comment

Comment

Comment

Comment

Comment

Comment

Comment