Categorical Dummy Variable Coding

Ella Ki

Join Date: Mar 2017

Posts: 39
#1

Categorical Dummy Variable Coding

05 May 2017, 10:55

Hi,

Say I am trying to see the effects of being from a different ethnicity on wages.

In my initial dataset ethnicities are coded from 1-4; white as 1, black as 2, indian as 3, and other as 4.

If I drop all missing values for ethnicity, and then create dummies manually using the code:

gen black = (race==2)
gen indian = (race==3)
gen other = (race==4)

And then run a regression of wage against these, would the interpretation of the coefficient of e.g. being Indian be compared to the base group of being white, or being indian compared to white, black or other?

Similarly, if I include age in the regression, would the interpretation for that coefficient relate to someone white?

I understand these questions might seem very basic but I want to check I have coded my dummies correctly to interpret them against the base group of only white rather than the base group of all other races.

Finally, if I included other categorical dummies in the equation, such as married and divorced with never married as the based group, would the coefficient on "Indian" be compared to a never married white individual or a white individual with the same marital status?

Last edited by Ella Ki; 05 May 2017, 11:09.
Tags: None

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17700

05 May 2017, 11:14

Ella:
A1) race: the interpretation would be against the reference category (white, in your case);
A2) age: no; the resulting coefficient refers to age, no matter the race.
As far as categorical predictors are concerned, you may want to add -baselevel- option to your regression in order to increase your awareness of what Stata is doing, as in the following example:

Code:

. sysuse auto.dta
(1978 Automobile Data)

. regress price i.foreign mpg, baselevel

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(2, 71)        =     14.07
       Model |   180261702         2  90130850.8   Prob > F        =    0.0000
    Residual |   454803695        71  6405685.84   R-squared       =    0.2838
-------------+----------------------------------   Adj R-squared   =    0.2637
       Total |   635065396        73  8699525.97   Root MSE        =    2530.9

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     foreign |
   Domestic  |          0  (base)
    Foreign  |   1767.292    700.158     2.52   0.014     371.2169    3163.368
             |
         mpg |  -294.1955   55.69172    -5.28   0.000    -405.2417   -183.1494
       _cons |   11905.42   1158.634    10.28   0.000     9595.164    14215.67
------------------------------------------------------------------------------

Another useful (for me at, least) drill was to call -predict- and then calculate them by hand for one observation for each group, just to check if I didi it right and get familiar with -regress- machinery.

Kind regards,
Carlo
(Stata 19.0)

Comment

Ella Ki

Join Date: Mar 2017

Posts: 39
#3

05 May 2017, 11:18

Thanks a lot Carlo!

Does this mean if only age was regressed with wage, and then if age and race variables were regressed with wage, age would have the same interpretation in either case?

And, if I included other categorical dummies in the equation, such as married and divorced with never married as the based group, would the coefficient on "Indian" be compared to a never married white individual or a white individual with the same marital status?
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17700

05 May 2017, 11:35

Ella:
the best approach would be following the FAQ and post what you typed and what Stata gave you back. Thanks.
That said, if you specify different regression model, the coefficient would be different too, as they are adjusted for the remaining predictors.
Sticking with my previous example:
- the constant (US11905.42) refers to the price of a domestic car with -mpg-=0 (that does not exist, by the way);
- a foreign car costs US1767.292 more than a domestic car with -mpg-=0 (that does not exist, by the way);
- each -mpg- reduces the abovementioned cost of (US-294.1955*-mpg-).
You can check it using -predict- and re-calculate all the stuff by hand:

Code:

 predict predict, xb


. bysort foreign: list foreign predict if _n==1

-----------------------------------------------------------------------------------------------------------------
-> foreign = Domestic

     +---------------------+
     |  foreign    predict |
     |---------------------|
  1. | Domestic   5433.114 |
     +---------------------+

-----------------------------------------------------------------------------------------------------------------
-> foreign = Foreign

     +--------------------+
     | foreign    predict |
     |--------------------|
  1. | Foreign   8671.384 |
     +--------------------+


. mat list e(b)

e(b)[1,4]
            0b.          1.                       
       foreign     foreign         mpg       _cons
y1           0   1767.2922  -294.19553   11905.415

. bysort foreign: list foreign predict mpg if _n==1

-----------------------------------------------------------------------------------------------------------------
-> foreign = Domestic

     +---------------------------+
     |  foreign    predict   mpg |
     |---------------------------|
  1. | Domestic   5433.114    22 |
     +---------------------------+

-----------------------------------------------------------------------------------------------------------------
-> foreign = Foreign

     +--------------------------+
     | foreign    predict   mpg |
     |--------------------------|
  1. | Foreign   8671.384    17 |
     +--------------------------+


. di _b[_cons] + _b[mpg]*22
5433.1136


. di _b[_cons] + _b[1.foreign] + _b[mpg]*17
8671.3835

.

Kind regards,
Carlo
(Stata 19.0)

Comment

Takele Ayele

Join Date: Jun 2020

Posts: 3
#5

04 Jun 2020, 02:24

Hi everyone,

I am trying to see the if Marital status affects the Households enrollment in scheme knowns Community Based Health Insurance.
For the variable Marital status i label it as :
Single=1
Married=2
divorced=3
widowed=4
I am planning to use Married as Reference group, so how can I code this categorical variable in Stata 16? I would appreciate any answers.

Thanks,
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4458
#6

04 Jun 2020, 02:34

it appears to already be coded so it is not clear what your question is; however, use it as follows in your model specification:

Code:

ib2.marstat

replace marstat in the above with whatever the actual variable name is (it cannot be "Marital status" because Stata does not allow spaces in variable names); see

Code:

help fvvarlist
Comment
Takele Ayele

Join Date: Jun 2020

Posts: 3
#7

04 Jun 2020, 02:49

The codes single =1, married=2, divorced=3 and widowed =4 are codes that I have used in Questionaire to collect data. But, my question is how can I enter this data in to Stata, by using Married as a reference group?
Comment
Takele Ayele

Join Date: Jun 2020

Posts: 3
#8

04 Jun 2020, 02:51

The codes single =1, married=2, divorced=3 and widowed =4 are codes that I have used in Questionaire to collect data. But, my question is how can I enter this data into Stata, by using Married as a reference group?
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17700

04 Jun 2020, 02:58

Takele:
provided that is advisable to start from 0 and not from 1 when numbering the levels of your categorical variables, what you're after is easy to achieve just exploiting the capabilitues of -fvvarlist- notation, as you can see in the following toy-example:

Code:

. set obs 4
number of observations (_N) was 0, now 4

. g A=_n*2

. g Marital_Status=_n

. label define Marital_Status 1 "single" 2 "married" 3 "divorced" 4 "widowed"

. label val Marital_Status Marital_Status

. poisson A ib2.Marital_Status

Iteration 0:   log likelihood = -6.7380416 
Iteration 1:   log likelihood = -6.7374942 
Iteration 2:   log likelihood = -6.7374942 

Poisson regression                              Number of obs     =          4
                                                LR chi2(3)        =       4.26
                                                Prob > chi2       =     0.2350
Log likelihood = -6.7374942                     Pseudo R2         =     0.2401

--------------------------------------------------------------------------------
             A |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
---------------+----------------------------------------------------------------
Marital_Status |
       single  |  -.6931472   .8660254    -0.80   0.423    -2.390526    1.004231
     divorced  |   .4054651   .6454972     0.63   0.530    -.8596862    1.670616
      widowed  |   .6931472   .6123724     1.13   0.258    -.5070807    1.893375
               |
         _cons |   1.386294         .5     2.77   0.006     .4063124    2.366276
--------------------------------------------------------------------------------

.

Sidelight: the -poisson- regression above is probably far from being correctly specified.

Kind regards,
Carlo
(Stata 19.0)

Announcement