Controlling for gender in linear regression

Anne Todd

Join Date: Dec 2018

Posts: 163
#1

Controlling for gender in linear regression

13 Dec 2018, 18:09

Hello Statalist! This is my first ever post, and I am newly learning Stata as a somewhat-competent user of SPSS.

I am hoping to figure out the best way to 'control' for race in a linear regression, where I am trying to predict GPA, with my independent variables being resil, motiv, work, debat, creat (all measures of non-cognitive skills, essentially). I've recoded income and also recoded gender where 0 represents male and 1 represents female.

If someone could please help me with the code for how the best method of handling my race variable (w is white, h is hispanic, and so on), and then for the ensuing regression, I would be eternally grateful! Thank you.

* Example generated by -dataex-. To install: ssc install dataex
clear
input double gpa str1 race byte(resil motiv work debat creat income gender1)
2.07 "B" 3 1 1 1 1 14 0
3.03 "H" 1 1 1 1 1 4 0
3.07 "H" 1 1 5 5 5 5 1
2.66 "H" 5 1 1 3 1 13 0
3.00 "A" 3 4 1 2 1 9 1
3.04 "B" 5 3 1 1 5 14 1
3.02 "A" 1 1 4 1 1 14 1
2.04 "B" 4 1 1 1 1 5 0
3.00 "H" 4 4 1 1 3 14 0
3.02 "W" 4 1 1 1 1 4 0
3.42 "H" 1 4 1 3 5 13 1
3.66 "H" 5 3 1 4 3 7 1
3.02 "W" 4 1 1 1 1 14 0
3.04 "A" 1 2 5 1 1 14 1
3.26 "H" 5 1 1 1 1 8 1
3.45 "A" 5 1 5 1 3 14 1
2.03 "W" 3 1 1 1 4 14 1
3.01 "A" 5 1 4 1 2 9 0
end
[/CODE]
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35711

13 Dec 2018, 18:55

Thanks for the data example.

race is string, so map to numeric and then ask for it to be handled as a bunch of indicator (some say dummy) variables.

Code:

. encode race, gen(n_race) 

. 
. tab race n_race 

           |                   n_race
      race |         A          B          H          W |     Total
-----------+--------------------------------------------+----------
         A |         5          0          0          0 |         5 
         B |         0          3          0          0 |         3 
         H |         0          0          7          0 |         7 
         W |         0          0          0          3 |         3 
-----------+--------------------------------------------+----------
     Total |         5          3          7          3 |        18 

. 
. regress gpa i.gender1 i.n_race 

      Source |       SS           df       MS      Number of obs   =        18
-------------+----------------------------------   F(4, 13)        =      3.07
       Model |  1.77898265         4  .444745661   Prob > F        =    0.0550
    Residual |   1.8816618        13  .144743215   R-squared       =    0.4860
-------------+----------------------------------   Adj R-squared   =    0.3278
       Total |  3.66064444        17  .215332026   Root MSE        =    .38045

------------------------------------------------------------------------------
         gpa |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   1.gender1 |   .2266584   .1939559     1.17   0.264    -.1923579    .6456747
             |
      n_race |
          B  |  -.6148927   .2922139    -2.10   0.055    -1.246182    .0163969
          H  |   .1049505   .2271379     0.46   0.652    -.3857512    .5956522
          W  |  -.3082261   .2922139    -1.05   0.311    -.9395157    .3230636
             |
       _cons |   2.922673    .230271    12.69   0.000     2.425203    3.420144
------------------------------------------------------------------------------

Here "A" is mapped to 1, "B" to 2, and so forth, and so "A" defines the base but

Code:

help fvvarlist

shows that that is just the default, which you can override.

Code:

. regress gpa i.gender1 ib3.n_race 

      Source |       SS           df       MS      Number of obs   =        18
-------------+----------------------------------   F(4, 13)        =      3.07
       Model |  1.77898265         4  .444745661   Prob > F        =    0.0550
    Residual |   1.8816618        13  .144743215   R-squared       =    0.4860
-------------+----------------------------------   Adj R-squared   =    0.3278
       Total |  3.66064444        17  .215332026   Root MSE        =    .38045

------------------------------------------------------------------------------
         gpa |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
   1.gender1 |   .2266584   .1939559     1.17   0.264    -.1923579    .6456747
             |
      n_race |
          A  |  -.1049505   .2271379    -0.46   0.652    -.5956522    .3857512
          B  |  -.7198432   .2665669    -2.70   0.018    -1.295726   -.1439604
          W  |  -.4131766   .2665669    -1.55   0.145    -.9890594    .1627063
             |
       _cons |   3.027624   .1815525    16.68   0.000     2.635403    3.419844
------------------------------------------------------------------------------

Comment

Anne Todd

Join Date: Dec 2018

Posts: 163
#3

13 Dec 2018, 19:11

Wow, thank you Nick! I have two quick follow-up questions, if you don't mind:

Can you explain the "i" in your code "i.gender1 ib3.n_race"?

Given this, can you show me what the most appropriate (or, perhaps most efficient?) syntax would be for when I regress the non-cognitive skill variables and include race now? From what I understand, a simple linear regression would just be:

* regress gpa resil motiv work debat creat

And then to add in the income and gender variables I created, it would be this:

* regress gpa resil motiv work debat creat income gender1

And then what exactly gets included, given what you've done above, for race?

Thank you again!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#4

13 Dec 2018, 21:32

The i. prefixes are factor variable notation. Read -help fvvarlist- and also click on the link near the top of that page to the corresponding part of the PDF documentation for a full explanation. Briefly, it tells Stata that the variable is a categorical variable and that levels of it, except for one, should be represented in the regression by a series of indicator ("dummy") variables.

For your model with additional variables, you still need the i.n_race and i.gender1. Looking over your example data, it seems that the resil-creat variables take on a relatively small number of integer values, 1 through 5. If you want those to be treated as if they were continuous variables, then just leave them as is. If you want them to be treated as categorical variables, then they, too, would get an i. prefix. Your income variable appears to range 1 to 13. I imagine that such a variable is either categorical or ordinal. If ordinal and you think that the ordinal categories are about equally spaced, then just enter it as income. If it is better thought of as categorical, then again, it would get an i. prefix. If you think it is ordinal but should not be treated as equally spaced, well, that gets complicated.

Added: I would not name my variable gender1. If you have to come back to this project a year from now to answer questions about what you did, will you remember that it was 0 = male and 1 = female, not the other way around? It would be more sensible to name the variable female. That way the name itself tells you that female is 1 and male is 0.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35711
#5

14 Dec 2018, 01:57

Clyde and I have in press for Stata Journal 19(1) a paper surveying creation of indicator variables. The last paragraph in #3 covers one of several points made in the paper.
Comment

Anne Todd

Join Date: Dec 2018
Posts: 163

14 Dec 2018, 07:08

Thank you Clyde, I took your advice and renamed the variable female. That explanation makes sense to me. I think, given what you've explained about income, I've figured out the best way to handle the income and resil-creat variables. However, I still may need a little help with troubleshooting n_race, given the output when I put it into my larger dataset:

Code:

tab race n_race

| n_race
race | Asian Black Hispanic White | Total
-----------+--------------------------------------------+----------
Asian | 720 0 0 0 | 720
Black | 0 797 0 0 | 797
Hispanic | 0 0 695 0 | 695
White | 0 0 0 731 | 731
-----------+--------------------------------------------+----------
Total | 720 797 695 731 | 2,943

Code:

regress gpa resil motiv work debat creat i.female parent_income ib3.n_race
note: 3b.n_race identifies no observations in the sample
note: 8.n_race omitted because of collinearity

Source | SS df MS Number of obs = 1,790
-------------+---------------------------------- F(10, 1779) = 19.42
Model | 48.3825101 10 4.83825101 Prob > F = 0.0000
Residual | 443.217626 1,779 .249138632 R-squared = 0.0984
-------------+---------------------------------- Adj R-squared = 0.0934
Total | 491.600136 1,789 .274790462 Root MSE = .49914

-------------------------------------------------------------------------------
gpa | Coef. Std. Err. t P>|t| [95% Conf. Interval]
--------------+----------------------------------------------------------------
resil | -.0248507 .0070009 -3.55 0.000 -.0385815 -.0111198
motiv | .01498 .0087332 1.72 0.086 -.0021483 .0321084
work | .0107036 .0073424 1.46 0.145 -.003697 .0251042
debat | .0175461 .0096118 1.83 0.068 -.0013055 .0363978
creat | -.0025623 .0077168 -0.33 0.740 -.0176974 .0125727
1.female | .0250477 .0245173 1.02 0.307 -.0230381 .0731334
parent_income | .018109 .0060302 3.00 0.003 .006282 .029936
|
n_race |
H | 0 (empty)
Asian | -.0556561 .0334713 -1.66 0.097 -.1213033 .0099911
Black | -.3686887 .033573 -10.98 0.000 -.4345353 -.3028421
Hispanic | -.1637888 .0343016 -4.77 0.000 -.2310646 -.0965131
White | 0 (omitted)
|
_cons | 2.942213 .0905353 32.50 0.000 2.764646 3.119779
-------------------------------------------------------------------------------

Last edited by Anne Todd; 14 Dec 2018, 07:10.

Comment

Richard Williams

Join Date: Apr 2014

Posts: 5008
#7

14 Dec 2018, 07:26

Added: I would not name my variable gender1. If you have to come back to this project a year from now to answer questions about what you did, will you remember that it was 0 = male and 1 = female, not the other way around? It would be more sensible to name the variable female. That way the name itself tells you that female is 1 and male is 0.

I will second that. All too often I have had students give me output with a variable called gender and they don't know if a positive coefficient means men score higher or women score higher.

Some people can still get confused though. I once had a student who couldn't understand that, if a variable was called female, it didn't mean that the respondent had to be female.

Adding value labels also often helps, especially if a categorical variable has more than 2 categories. If race has 5 categories, is race 1 white, black, Asian, or what? If you use factor variable notation the labels will usually appear in the output.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Anne Todd

Join Date: Dec 2018

Posts: 163
#8

14 Dec 2018, 07:37

Thank you all so much.

Richard, I think your point actually helps me with the error I point to in my latest post. I had tried to rename the value of my race variable so that "A" said "Asian", "H" said "Hispanic", and so on, using this:

Code:

gen race1 = "" replace race1 = "Asian" if race == "A" replace race1 = "Black" if race == "B" replace race1 = "Hispanic" if race =="H" replace race1 = "White" if race=="W"

And that gave me the issue cited above. But, when I go back and undo this and leave the "A" and "H", etc., as is, then the regression seems to work out nicely.
Comment

Announcement