Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multiple linear regression

    Hi all,

    I am a new comer struggling to specify a regression model using a cross-sectional data set in stata. I have a few questions and any help on them would be amazing. I'm looking to do an analysis on the returns to education. Should I include a variable for the square of experience? I also want to check the differences in return to education by gender and race. I have available a dummy for gender and two dummies for two ethnicities. Should all these variables be included in the specified form.

    Ralph

  • #2
    Welcome to Statalist.

    can you clarify exactly what your dependent variable is, and how it is measured? For that matter it could help to be clear on what all your vars are and how they are measured.
    -------------------------------------------
    Richard Williams, Notre Dame Dept of Sociology
    Stata Version: 17.0 MP (2 processor)

    EMAIL: [email protected]
    WWW: https://www3.nd.edu/~rwilliam

    Comment


    • #3
      Ralph:
      Richard's wise advice aboute being clearer about your research goals (and about what you typed and what Stata gave you back, even better if topped off with an example/excerpt of your dataset shared cia -dataex-, as reminded to all of us by the FAQ) cannot be overestimated.
      For instance, I do not understand the need to have two different categorical variables for -ethnicity-. Can't you group them together in a single, two-level categorical variable to be created via -fvvarlist- notation? This way, the -fvvarlist- machinery will protect you by the so called dummy trap (https://en.wikipedia.org/wiki/Dummy_...le_(statistics)) automatically.
      That said, the following toy-example can hopefully be helpful:
      Code:
      . use "C:\Program Files\Stata17\ado\base\a\auto.dta"
      (1978 automobile data)
      
      . regress price c.weight##c.weight i.foreign i.rep78
      
            Source |       SS           df       MS      Number of obs   =        69
      -------------+----------------------------------   F(7, 61)        =     11.58
             Model |   329138744         7  47019820.6   Prob > F        =    0.0000
          Residual |   247658215        61  4059970.73   R-squared       =    0.5706
      -------------+----------------------------------   Adj R-squared   =    0.5214
             Total |   576796959        68  8482308.22   Root MSE        =    2014.9
      
      -----------------------------------------------------------------------------------
                  price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
      ------------------+----------------------------------------------------------------
                 weight |  -4.594996   2.572273    -1.79   0.079    -9.738574    .5485819
                        |
      c.weight#c.weight |   .0012672   .0004029     3.14   0.003     .0004615    .0020729
                        |
                foreign |
               Foreign  |    3208.35   804.0958     3.99   0.000     1600.462    4816.239
                        |
                  rep78 |
                     2  |   446.8182   1596.598     0.28   0.781    -2745.776    3639.413
                     3  |   324.6413   1487.146     0.22   0.828    -2649.091    3298.373
                     4  |  -225.7175    1562.19    -0.14   0.886    -3349.509    2898.074
                     5  |   472.3776   1657.326     0.29   0.777    -2841.651    3786.406
                        |
                  _cons |   6457.738   4381.542     1.47   0.146    -2303.697    15219.17
      -----------------------------------------------------------------------------------
      
      . mat list e(b)
      
      e(b)[1,10]
                        c.weight#         0b.          1.         1b.          2.          3.          4.          5.            
              weight    c.weight     foreign     foreign       rep78       rep78       rep78       rep78       rep78       _cons
      y1  -4.5949961    .0012672           0   3208.3505           0   446.81819   324.64128  -225.71747   472.37758   6457.7375
      
      
      . test _b[0b.foreign] = _b[1.foreign]
      
       ( 1)  0b.foreign - 1.foreign = 0
      
             F(  1,    61) =   15.92
                  Prob > F =    0.0002
      
      . lincom  _b[0b.foreign] -  _b[1.foreign]
      
       ( 1)  0b.foreign - 1.foreign = 0
      
      ------------------------------------------------------------------------------
             price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
      -------------+----------------------------------------------------------------
               (1) |   -3208.35   804.0958    -3.99   0.000    -4816.239   -1600.462
      ------------------------------------------------------------------------------
      
      . di 3.99^2
      15.9201
      
      . di F(1,61,15.9201)
      .9998203
      
      . di 1-F(1,61,15.9201)
      .0001797
      
      .
      The results of this OLS tell us that it makes sense to consider both a linear and a squared terms for -weight-, as this predictor shows a non-linear relationship with the regressand (i.e., price).
      In addition, you can compare the coefficients of your regression via -test- and -lincom- (the latter allows the linear combination of two or more coefficients; see -test- and -lincom- entries in Stata .pdf manual to grab more details about these really useful postestimation commands). As expected, in this case -lincom- simply repeats what already reported in the -regress- outcome table.
      As a sidelight, you can see how these two commands are linked (see p-value calculation).
      Kind regards,
      Carlo
      (Stata 18.0 SE)

      Comment


      • #4
        Originally posted by Richard Williams View Post
        Welcome to Statalist.

        can you clarify exactly what your dependent variable is, and how it is measured? For that matter it could help to be clear on what all your vars are and how they are measured.
        My dependant variable is earnings and there is a variety of variables available including level of education, age, experience, a dummy for male, two dummies for black and hispanic premade (I haven't chosen this dataset but a look at th data says that there's individuals who are neither black nor hispanic hence my thought to use both), sector of work, whether the person lived in an urban area etc. The specification I was thinking of using was ln(earnings) = b0 + b1Level of education + b2exp + b3exp^2 + b4male + b5black + b6hispanic + b7sector. The task asks to check for differences in returns to education between genders, race and sector of work.

        Comment


        • #5
          Originally posted by Carlo Lazzaro View Post
          Ralph:
          Richard's wise advice aboute being clearer about your research goals (and about what you typed and what Stata gave you back, even better if topped off with an example/excerpt of your dataset shared cia -dataex-, as reminded to all of us by the FAQ) cannot be overestimated.
          For instance, I do not understand the need to have two different categorical variables for -ethnicity-. Can't you group them together in a single, two-level categorical variable to be created via -fvvarlist- notation? This way, the -fvvarlist- machinery will protect you by the so called dummy trap (https://en.wikipedia.org/wiki/Dummy_...le_(statistics)) automatically.
          That said, the following toy-example can hopefully be helpful:
          Code:
          . use "C:\Program Files\Stata17\ado\base\a\auto.dta"
          (1978 automobile data)
          
          . regress price c.weight##c.weight i.foreign i.rep78
          
          Source | SS df MS Number of obs = 69
          -------------+---------------------------------- F(7, 61) = 11.58
          Model | 329138744 7 47019820.6 Prob > F = 0.0000
          Residual | 247658215 61 4059970.73 R-squared = 0.5706
          -------------+---------------------------------- Adj R-squared = 0.5214
          Total | 576796959 68 8482308.22 Root MSE = 2014.9
          
          -----------------------------------------------------------------------------------
          price | Coefficient Std. err. t P>|t| [95% conf. interval]
          ------------------+----------------------------------------------------------------
          weight | -4.594996 2.572273 -1.79 0.079 -9.738574 .5485819
          |
          c.weight#c.weight | .0012672 .0004029 3.14 0.003 .0004615 .0020729
          |
          foreign |
          Foreign | 3208.35 804.0958 3.99 0.000 1600.462 4816.239
          |
          rep78 |
          2 | 446.8182 1596.598 0.28 0.781 -2745.776 3639.413
          3 | 324.6413 1487.146 0.22 0.828 -2649.091 3298.373
          4 | -225.7175 1562.19 -0.14 0.886 -3349.509 2898.074
          5 | 472.3776 1657.326 0.29 0.777 -2841.651 3786.406
          |
          _cons | 6457.738 4381.542 1.47 0.146 -2303.697 15219.17
          -----------------------------------------------------------------------------------
          
          . mat list e(b)
          
          e(b)[1,10]
          c.weight# 0b. 1. 1b. 2. 3. 4. 5.
          weight c.weight foreign foreign rep78 rep78 rep78 rep78 rep78 _cons
          y1 -4.5949961 .0012672 0 3208.3505 0 446.81819 324.64128 -225.71747 472.37758 6457.7375
          
          
          . test _b[0b.foreign] = _b[1.foreign]
          
          ( 1) 0b.foreign - 1.foreign = 0
          
          F( 1, 61) = 15.92
          Prob > F = 0.0002
          
          . lincom _b[0b.foreign] - _b[1.foreign]
          
          ( 1) 0b.foreign - 1.foreign = 0
          
          ------------------------------------------------------------------------------
          price | Coefficient Std. err. t P>|t| [95% conf. interval]
          -------------+----------------------------------------------------------------
          (1) | -3208.35 804.0958 -3.99 0.000 -4816.239 -1600.462
          ------------------------------------------------------------------------------
          
          . di 3.99^2
          15.9201
          
          . di F(1,61,15.9201)
          .9998203
          
          . di 1-F(1,61,15.9201)
          .0001797
          
          .
          The results of this OLS tell us that it makes sense to consider both a linear and a squared terms for -weight-, as this predictor shows a non-linear relationship with the regressand (i.e., price).
          In addition, you can compare the coefficients of your regression via -test- and -lincom- (the latter allows the linear combination of two or more coefficients; see -test- and -lincom- entries in Stata .pdf manual to grab more details about these really useful postestimation commands). As expected, in this case -lincom- simply repeats what already reported in the -regress- outcome table.
          As a sidelight, you can see how these two commands are linked (see p-value calculation).
          Thank you for the advice, I have read through the commands and have a better understanding. To compare for the difference in returns to education between races for example do I have to interact the race dummy variables with the level of education as use the interacted variable in my regression?

          Comment


          • #6
            Ralph:
            that's the way to go, usually.
            My reply is temptative as I cannot take a look at an excerpt of your dataset.
            Be careful with the way you deal with categorical variables: the previous advice to exploit the wonderful capabilities of -fvvarlist- still holds.
            Kind regards,
            Carlo
            (Stata 18.0 SE)

            Comment


            • #7
              Originally posted by Carlo Lazzaro View Post
              Ralph:
              that's the way to go, usually.
              My reply is temptative as I cannot take a look at an excerpt of your dataset.
              Be careful with the way you deal with categorical variables: the previous advice to exploit the wonderful capabilities of -fvvarlist- still holds.
              Could you please let me know if my strategy and specification is correct to compare the differences in returns between gender, ethnicities and sector. The two given in the dataset are black and hispanic, both as dummy variables. There's also a population of people who are of neither group. The specification I've used is ln(earnings) = b0 + b1Level of education + b2exp + b3exp^2 + b4male + b5black + b6hispanic + b7sector. I've added the sector variable (classgov - government sector employee) because I have to compare returns between sectors. I didn't add age because of collinearity with experience. I plan to interact the level of education with the target variables (male, the dummies for ethnicities and the sector dummy) to extract the differences. Is this the right way to go about it? What tests could I run to strengthen the results?
              Attached Files

              Comment


              • #8
                Ralph:
                1) your code can be written more efficiently, provided that you switch to -fvvarlist- notation as far as creating categorical variables and interactions is concerned. It's also recommended to get yourself familiar with the -i.- and -c.- prefixes available from -fvvarlist-;
                2) as already, advised, why not creating a single, two-level categorical variable for ethnicity (-i.ethnicity-);
                3)
                Code:
                regress ln_earnings c.exp##c.exp i.male i.ethnicity i.classgov
                ;
                4) too many interactions make results diffcult to disseminate. Be parsimonious with them;
                5) test you model for heteroskedastcity (-estat hettest-), autocorrealtion of the epsilon and secification of the functional form of the regressand (-linktest- to be preferred to -estat ovtest-);
                6) use -test- and -lincom- when appropriate in postestimation.
                Kind regards,
                Carlo
                (Stata 18.0 SE)

                Comment


                • #9
                  Originally posted by Carlo Lazzaro View Post
                  Ralph:
                  1) your code can be written more efficiently, provided that you switch to -fvvarlist- notation as far as creating categorical variables and interactions is concerned. It's also recommended to get yourself familiar with the -i.- and -c.- prefixes available from -fvvarlist-;
                  2) as already, advised, why not creating a single, two-level categorical variable for ethnicity (-i.ethnicity-);
                  3)
                  Code:
                  regress ln_earnings c.exp##c.exp i.male i.ethnicity i.classgov
                  ;
                  4) too many interactions make results diffcult to disseminate. Be parsimonious with them;
                  5) test you model for heteroskedastcity (-estat hettest-), autocorrealtion of the epsilon and secification of the functional form of the regressand (-linktest- to be preferred to -estat ovtest-);
                  6) use -test- and -lincom- when appropriate in postestimation.
                  Thank you for your help. It is much appreciated

                  Comment

                  Working...
                  X