Recoding Dummy variable: O/1 or 1/2

Anne Wanyonyi

Join Date: Jun 2016

Posts: 88
#1

Recoding Dummy variable: O/1 or 1/2

11 Oct 2016, 01:36

I just need a bit of clarification:

Is it imperative to have dummy variables as 0/1 values or having them as 1/2 doesn't matter when carrying out regressions in stata. I have realized most survey data code the dummies as 1/2

If yes, what's the logic behind having the dummy variables recoded as 0/1?

And what about categorical variables? Should they also start from 0, 1,2....or they can remain as 1,2,3,....

Thank you.
Tags: None
Maarten Buis

Join Date: Mar 2014

Posts: 3458
#2

11 Oct 2016, 01:49

As long as you are using the factor variable notation when entering the variables in your regression model you can use any non-negative integer for binary or categorical variables. Internally Stata will turn those variables in (a set of) 0/1 indicator (dummy) variables. The benefit of that is that the constant refers to conditional mean of the reference category. If you coded your variable 1/2 the constant would refer to the conditional mean for a group that does not exist.

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#3

11 Oct 2016, 02:05

Another benefit of (0, 1) coding is that the mean of such an indicator variable has an immediate concrete interpretation as a probability. In the auto data

Code:

sysuse auto, clear summarize foreign

the mean is the probability (proportion, fraction) of cars that are foreign. Note incidentally the excellent convention of naming indicators for the category coded 1. Thus use names such as female (not gender).
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#4

11 Oct 2016, 08:13

Another advantage of coding a dichotomous variable as 0/1 is that in the event you also want to use the variable as the dependent variable in a logistic or probit model, Stata always interprets those variables as 0 = false, non-zero = true. So if you code it as 1/2 and try to use it that way, Stata will complain that your outcome doesn't vary.
Comment
Mat Sko

Join Date: Apr 2019

Posts: 52
#5

11 May 2019, 16:49

I have a question related to the question posed here.

I have generated a dummy variable "cellar" with has 3 answers: 1 large basement, 2 small basement or 3 crawl space, using:

Code:

tab cellar, gen(c)

To use regression with the dependent variable logprice (=houseprice), should I use:

Code:

reg logprice i.cellar

or

Code:

reg logprice c2 c3

and c1 should be left out and is the base group.

Which one is correct?

Thanks in advance.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#6

11 May 2019, 20:05

Mat Sko your two choices are mathematically equivalent if you want level 1 las the reference category. However, in Stata is much more advantageous and simpler to use factor variable notation, as in your first regression statement. The factor notation also allows.one to change the reference/base category. For example, to use level 3, you would type -ib3.celler-.
1 like
Comment
Chi chi

Join Date: Nov 2018

Posts: 4
#7

18 Jul 2019, 00:27

Dear all,

I have one question related to this issue.

When I change reference category, how do the results change? For example, I have an categorical variable as independent variable, which have 3 levels- 1,2,3. When I treat 1 as base level, and when I treat 3 as base level, what will the results change (sign and magnitude of coefficient)?

I guess that its coefficient will change to opposite sign. However, my results show that it all changes in sign, size, and statistical significance. So, what am I wrong here?

Thanks alot.
Comment
Jean-Claude Arbaut

Join Date: Jul 2017

Posts: 209
#8

18 Jul 2019, 02:25

Chi chi

Let's have a look on a simulated dataset.

Code:

clear set seed 17760704 set obs 1000 gen x=runiformint(1,4) gen y=rnormal(x) replace y=y-0.9 if x==3

Here I create an independent variable that takes values 1-4, and a dependent variable that is normal and has mean x, except when x=3: i want the y to have almost the same mean as with x=2, for a reason that I will explain below.

Regression with default reference (the first category of x)

Code:

reg y i.x Source | SS df MS Number of obs = 1,000 -------------+---------------------------------- F(3, 996) = 425.18 Model | 1286.91173 3 428.970575 Prob > F = 0.0000 Residual | 1004.87019 996 1.00890581 R-squared = 0.5615 -------------+---------------------------------- Adj R-squared = 0.5602 Total | 2291.78192 999 2.29407599 Root MSE = 1.0044 ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- x | 1 | -1.001929 .0886142 -11.31 0.000 -1.175821 -.8280369 3 | .0245305 .0911161 0.27 0.788 -.154271 .203332 4 | 2.085322 .0892179 23.37 0.000 1.910245 2.260399 | _cons | 1.961651 .0630244 31.13 0.000 1.837975 2.085327 ------------------------------------------------------------------------------

And regression with taking x=2 as the reference

Code:

reg y ib2.x Source | SS df MS Number of obs = 1,000 -------------+---------------------------------- F(3, 996) = 425.18 Model | 1286.91173 3 428.970575 Prob > F = 0.0000 Residual | 1004.87019 996 1.00890581 R-squared = 0.5615 -------------+---------------------------------- Adj R-squared = 0.5602 Total | 2291.78192 999 2.29407599 Root MSE = 1.0044 ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- x | 2 | 1.001929 .0886142 11.31 0.000 .8280369 1.175821 3 | 1.026459 .0906117 11.33 0.000 .8486477 1.204271 4 | 3.087251 .0887027 34.80 0.000 2.913185 3.261317 | _cons | .9597223 .0622929 15.41 0.000 .8374819 1.081963 ------------------------------------------------------------------------------

Let's first focus on the coefficients. Actually, you have to think in terms of "contrasts": there is a reference category because we can't estimate all parameters. And that's because the sum of all four dummies is 1, the constant regressor, whose coefficient is already estimated by the constant coefficient. You can't have both the constant and all 4 regressors. So one of them is chosen as reference, which means its coefficient is artificially chosen to be 0. But that really means the other coefficients are estimated as differences from the reference.

So, in the first regression, the coefficient 1.001929 _b[2.x] (I use Stata syntax here) is really the difference of coefficients _b[2.x]-_b[1.x] (where _b[1.x] is taken to be 0). In the second regression, the coefficient _b[1.x] is really the difference _b[1.x]-_b[2.x] (where _b[2.x] is taken to be 0). Since the two regression are really the same with just a change of reference, the difference _b[2.x]-_b[1.x] is the same, and you end up with opposite coefficients.
Not all coefficient are opposite: for _b[3.x], you have _b[3.x]-_b[1.x] in the first regression, and _b[3.x]-_b[2.x] in the second. But in the first you can compute _b[3.x]-_b[2.x]=1.026459-1.001929=.0245305, the same coefficient as in the second regression.

So, basically, you can estimate differences of coefficients, not the coefficients themselves. You can estimate any linear combination a1 _b[1.x] + a2 _b[2.x] +a3 _b[3.x] +a4 _b[4.x] where a1+a2+a3+a4=0 (the linear combinations are called contrasts).

For instance, 2-3+4-3=0 and 2 _b[1.x] - 3 _b[2.x] + 4 _b[3.x] - 3 _b[4.x] = -3(_b[2.x]-_b[1.x]) + 4(_b[3.x]-_b[1.x]) - 3(_b[4.x]-_b[1.x]), and each term can be estimated using the coefficients of the first regression. Or you can write it 2(_b[1.x]-_b[2.x])+4(_b[3.x]-_b[2.x])-3(_b[4.x]-_b[2.x]) and estimate using the second regression. Stata accepts the initial linear combination and the following result is the same with both regressions:

Code:

lincom 2*1.x-3*2.x+4*3.x-3*4.x ------------------------------------------------------------------------------ y | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- (1) | -8.161702 .3955266 -20.64 0.000 -8.937863 -7.385541 ------------------------------------------------------------------------------

So, what do I mean when I write that, in the first regression, _b[2.x] is really _b[2.x]-_b[1.x]? You can check with "lincom 2.x-1.x", which returns the same value as the regression output for 2.x. That's because we can only estimate contrasts.

Now, the p-values. The test applies to differences of coefficients. So we really test if two coefficients are significantly far apart. In the first regression, the coefficients of _b[2.x], _b[3.x], _b[4.x] are all significantly different than _b[1.x] (i.e. the differences are significant), so all p-values are small.

In the second regression, I wanted to show what happens when one difference is small (I managed to have _b[3.x] close to _b[2.x]). and the p-value for _b[3.x] (which is really the p-value for _b[3.x]-_b[2.x]) is large.

A point worth noting: if you compute contrasts with lincom, the result will not depend on the reference you have chosen (same coefficient, same p-value). If you compute a linear combination which is not a contrast, the result will depend on the reference. There is a twist here, as Stata can only ever estimate contrasts: for instance, if you want "lincom 3.x+4.x", you will really get (4.x-1.x)+(3.x-1.x) or (4.x-2.x)+(3.x-2.x) which depend on the reference (x=1 or x=2).

Last edited by Jean-Claude Arbaut; 18 Jul 2019, 02:56.
3 likes
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#9

18 Jul 2019, 09:58

I would just add to Jean-Claude Arbaut's excellent discussion the observation that when the reference category changes, the model's predictions, that is, the values produced by -predict-, do not change. Nor is there any change in R², nor root mean squared error. In models estimated by maximum likelihood, the log-likelihood also remains unchanged. And coefficients of the variables other than x itself do not change (unless there are interaction terms).
2 likes
Comment
Chi chi

Join Date: Nov 2018

Posts: 4
#10

20 Jul 2019, 01:01

@ Jean-Claude Arbaut
,
Clyde SchechterThank you very much. Now I got it.
Comment

Announcement

Recoding Dummy variable: O/1 or 1/2

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment