Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Controlling for gender in linear regression

    Hello Statalist! This is my first ever post, and I am newly learning Stata as a somewhat-competent user of SPSS.

    I am hoping to figure out the best way to 'control' for race in a linear regression, where I am trying to predict GPA, with my independent variables being resil, motiv, work, debat, creat (all measures of non-cognitive skills, essentially). I've recoded income and also recoded gender where 0 represents male and 1 represents female.

    If someone could please help me with the code for how the best method of handling my race variable (w is white, h is hispanic, and so on), and then for the ensuing regression, I would be eternally grateful! Thank you.

    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input double gpa str1 race byte(resil motiv work debat creat income gender1)
    2.07 "B" 3 1 1 1 1 14 0
    3.03 "H" 1 1 1 1 1 4 0
    3.07 "H" 1 1 5 5 5 5 1
    2.66 "H" 5 1 1 3 1 13 0
    3.00 "A" 3 4 1 2 1 9 1
    3.04 "B" 5 3 1 1 5 14 1
    3.02 "A" 1 1 4 1 1 14 1
    2.04 "B" 4 1 1 1 1 5 0
    3.00 "H" 4 4 1 1 3 14 0
    3.02 "W" 4 1 1 1 1 4 0
    3.42 "H" 1 4 1 3 5 13 1
    3.66 "H" 5 3 1 4 3 7 1
    3.02 "W" 4 1 1 1 1 14 0
    3.04 "A" 1 2 5 1 1 14 1
    3.26 "H" 5 1 1 1 1 8 1
    3.45 "A" 5 1 5 1 3 14 1
    2.03 "W" 3 1 1 1 4 14 1
    3.01 "A" 5 1 4 1 2 9 0
    end
    [/CODE]

  • #2
    Thanks for the data example.

    race is string, so map to numeric and then ask for it to be handled as a bunch of indicator (some say dummy) variables.

    Code:
    . encode race, gen(n_race) 
    
    . 
    . tab race n_race 
    
               |                   n_race
          race |         A          B          H          W |     Total
    -----------+--------------------------------------------+----------
             A |         5          0          0          0 |         5 
             B |         0          3          0          0 |         3 
             H |         0          0          7          0 |         7 
             W |         0          0          0          3 |         3 
    -----------+--------------------------------------------+----------
         Total |         5          3          7          3 |        18 
    
    . 
    . regress gpa i.gender1 i.n_race 
    
          Source |       SS           df       MS      Number of obs   =        18
    -------------+----------------------------------   F(4, 13)        =      3.07
           Model |  1.77898265         4  .444745661   Prob > F        =    0.0550
        Residual |   1.8816618        13  .144743215   R-squared       =    0.4860
    -------------+----------------------------------   Adj R-squared   =    0.3278
           Total |  3.66064444        17  .215332026   Root MSE        =    .38045
    
    ------------------------------------------------------------------------------
             gpa |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
       1.gender1 |   .2266584   .1939559     1.17   0.264    -.1923579    .6456747
                 |
          n_race |
              B  |  -.6148927   .2922139    -2.10   0.055    -1.246182    .0163969
              H  |   .1049505   .2271379     0.46   0.652    -.3857512    .5956522
              W  |  -.3082261   .2922139    -1.05   0.311    -.9395157    .3230636
                 |
           _cons |   2.922673    .230271    12.69   0.000     2.425203    3.420144
    ------------------------------------------------------------------------------
    Here "A" is mapped to 1, "B" to 2, and so forth, and so "A" defines the base but

    Code:
    help fvvarlist
    shows that that is just the default, which you can override.

    Code:
    . regress gpa i.gender1 ib3.n_race 
    
          Source |       SS           df       MS      Number of obs   =        18
    -------------+----------------------------------   F(4, 13)        =      3.07
           Model |  1.77898265         4  .444745661   Prob > F        =    0.0550
        Residual |   1.8816618        13  .144743215   R-squared       =    0.4860
    -------------+----------------------------------   Adj R-squared   =    0.3278
           Total |  3.66064444        17  .215332026   Root MSE        =    .38045
    
    ------------------------------------------------------------------------------
             gpa |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
       1.gender1 |   .2266584   .1939559     1.17   0.264    -.1923579    .6456747
                 |
          n_race |
              A  |  -.1049505   .2271379    -0.46   0.652    -.5956522    .3857512
              B  |  -.7198432   .2665669    -2.70   0.018    -1.295726   -.1439604
              W  |  -.4131766   .2665669    -1.55   0.145    -.9890594    .1627063
                 |
           _cons |   3.027624   .1815525    16.68   0.000     2.635403    3.419844
    ------------------------------------------------------------------------------

    Comment


    • #3
      Wow, thank you Nick! I have two quick follow-up questions, if you don't mind:

      Can you explain the "i" in your code "i.gender1 ib3.n_race"?

      Given this, can you show me what the most appropriate (or, perhaps most efficient?) syntax would be for when I regress the non-cognitive skill variables and include race now? From what I understand, a simple linear regression would just be:

      * regress gpa resil motiv work debat creat

      And then to add in the income and gender variables I created, it would be this:

      * regress gpa resil motiv work debat creat income gender1

      And then what exactly gets included, given what you've done above, for race?

      Thank you again!

      Comment


      • #4
        The i. prefixes are factor variable notation. Read -help fvvarlist- and also click on the link near the top of that page to the corresponding part of the PDF documentation for a full explanation. Briefly, it tells Stata that the variable is a categorical variable and that levels of it, except for one, should be represented in the regression by a series of indicator ("dummy") variables.

        For your model with additional variables, you still need the i.n_race and i.gender1. Looking over your example data, it seems that the resil-creat variables take on a relatively small number of integer values, 1 through 5. If you want those to be treated as if they were continuous variables, then just leave them as is. If you want them to be treated as categorical variables, then they, too, would get an i. prefix. Your income variable appears to range 1 to 13. I imagine that such a variable is either categorical or ordinal. If ordinal and you think that the ordinal categories are about equally spaced, then just enter it as income. If it is better thought of as categorical, then again, it would get an i. prefix. If you think it is ordinal but should not be treated as equally spaced, well, that gets complicated.

        Added: I would not name my variable gender1. If you have to come back to this project a year from now to answer questions about what you did, will you remember that it was 0 = male and 1 = female, not the other way around? It would be more sensible to name the variable female. That way the name itself tells you that female is 1 and male is 0.

        Comment


        • #5
          Clyde and I have in press for Stata Journal 19(1) a paper surveying creation of indicator variables. The last paragraph in #3 covers one of several points made in the paper.

          Comment


          • #6
            Thank you Clyde, I took your advice and renamed the variable female. That explanation makes sense to me. I think, given what you've explained about income, I've figured out the best way to handle the income and resil-creat variables. However, I still may need a little help with troubleshooting n_race, given the output when I put it into my larger dataset:

            Code:
            tab race n_race
            
            | n_race
            race | Asian Black Hispanic White | Total
            -----------+--------------------------------------------+----------
            Asian | 720 0 0 0 | 720
            Black | 0 797 0 0 | 797
            Hispanic | 0 0 695 0 | 695
            White | 0 0 0 731 | 731
            -----------+--------------------------------------------+----------
            Total | 720 797 695 731 | 2,943

            Code:
            regress gpa resil motiv work debat creat i.female parent_income ib3.n_race
            note: 3b.n_race identifies no observations in the sample
            note: 8.n_race omitted because of collinearity
            
            Source | SS df MS Number of obs = 1,790
            -------------+---------------------------------- F(10, 1779) = 19.42
            Model | 48.3825101 10 4.83825101 Prob > F = 0.0000
            Residual | 443.217626 1,779 .249138632 R-squared = 0.0984
            -------------+---------------------------------- Adj R-squared = 0.0934
            Total | 491.600136 1,789 .274790462 Root MSE = .49914
            
            -------------------------------------------------------------------------------
            gpa | Coef. Std. Err. t P>|t| [95% Conf. Interval]
            --------------+----------------------------------------------------------------
            resil | -.0248507 .0070009 -3.55 0.000 -.0385815 -.0111198
            motiv | .01498 .0087332 1.72 0.086 -.0021483 .0321084
            work | .0107036 .0073424 1.46 0.145 -.003697 .0251042
            debat | .0175461 .0096118 1.83 0.068 -.0013055 .0363978
            creat | -.0025623 .0077168 -0.33 0.740 -.0176974 .0125727
            1.female | .0250477 .0245173 1.02 0.307 -.0230381 .0731334
            parent_income | .018109 .0060302 3.00 0.003 .006282 .029936
            |
            n_race |
            H | 0 (empty)
            Asian | -.0556561 .0334713 -1.66 0.097 -.1213033 .0099911
            Black | -.3686887 .033573 -10.98 0.000 -.4345353 -.3028421
            Hispanic | -.1637888 .0343016 -4.77 0.000 -.2310646 -.0965131
            White | 0 (omitted)
            |
            _cons | 2.942213 .0905353 32.50 0.000 2.764646 3.119779
            -------------------------------------------------------------------------------
            Last edited by Anne Todd; 14 Dec 2018, 07:10.

            Comment


            • #7
              Added: I would not name my variable gender1. If you have to come back to this project a year from now to answer questions about what you did, will you remember that it was 0 = male and 1 = female, not the other way around? It would be more sensible to name the variable female. That way the name itself tells you that female is 1 and male is 0.
              I will second that. All too often I have had students give me output with a variable called gender and they don't know if a positive coefficient means men score higher or women score higher.

              Some people can still get confused though. I once had a student who couldn't understand that, if a variable was called female, it didn't mean that the respondent had to be female.

              Adding value labels also often helps, especially if a categorical variable has more than 2 categories. If race has 5 categories, is race 1 white, black, Asian, or what? If you use factor variable notation the labels will usually appear in the output.
              -------------------------------------------
              Richard Williams, Notre Dame Dept of Sociology
              StataNow Version: 19.5 MP (2 processor)

              EMAIL: [email protected]
              WWW: https://www3.nd.edu/~rwilliam

              Comment


              • #8
                Thank you all so much.

                Richard, I think your point actually helps me with the error I point to in my latest post. I had tried to rename the value of my race variable so that "A" said "Asian", "H" said "Hispanic", and so on, using this:

                Code:
                gen race1 = ""
                replace race1 = "Asian" if race == "A"
                replace race1 = "Black" if race == "B"
                replace race1 = "Hispanic" if race =="H"
                replace race1 = "White" if race=="W"
                And that gave me the issue cited above. But, when I go back and undo this and leave the "A" and "H", etc., as is, then the regression seems to work out nicely.

                Comment

                Working...
                X