Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Adding Independent Variables in Regression

    Hi there! I am new to Stata and am eager to learn.

    I am currently doing an OLS regression on a cross-sectional data. As I add more independent variables to the regression, some of my P values and coefficient changes. Is this normal?

    I am trying to check if my OLS regression is a good model. Are there any basic indicators of a concerning/problematic model that I should be aware of?

    Thank you in advance.

  • #2
    Originally posted by chelle anny View Post
    Hi there! I am new to Stata and am eager to learn.

    I am currently doing an OLS regression on a cross-sectional data. As I add more independent variables to the regression, some of my P values and coefficient changes. Is this normal?

    I am trying to check if my OLS regression is a good model. Are there any basic indicators of a concerning/problematic model that I should be aware of?

    Thank you in advance.
    Part 1 yes that is entirely normal. As you add more variables more variables are explaining the changes in your dep var and that is why they change.
    Part 2 That is a very broad question. There are 6 or so assumptions for OLS. Normality, heteroskedasticty etc. I suggest you do an online search and find out what they are and how to deal with them etc.

    Comment


    • #3
      Chelle Anny:
      welcome to this forum.
      As Oscar wisely pointed out, there are different issues to be checked after any regression model.
      As far as -regress- is concerned, I would investigate possible heteroskedasticity and, much more relevant, model misspecification (the following toy-example suffers from both):
      Code:
      . use https://www.stata-press.com/data/r16/auto
      (1978 Automobile Data)
      
      . regress mpg weight i.foreign
      
            Source |       SS           df       MS      Number of obs   =        74
      -------------+----------------------------------   F(2, 71)        =     69.75
             Model |   1619.2877         2  809.643849   Prob > F        =    0.0000
          Residual |  824.171761        71   11.608053   R-squared       =    0.6627
      -------------+----------------------------------   Adj R-squared   =    0.6532
             Total |  2443.45946        73  33.4720474   Root MSE        =    3.4071
      
      ------------------------------------------------------------------------------
               mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
            weight |  -.0065879   .0006371   -10.34   0.000    -.0078583   -.0053175
                   |
           foreign |
          Foreign  |  -1.650029   1.075994    -1.53   0.130      -3.7955    .4954422
             _cons |    41.6797   2.165547    19.25   0.000     37.36172    45.99768
      ------------------------------------------------------------------------------
      
      . estat hettest
      
      Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
               Ho: Constant variance
               Variables: fitted values of mpg
      
               chi2(1)      =     7.89
               Prob > chi2  =   0.0050
      
      . linktest
      
            Source |       SS           df       MS      Number of obs   =        74
      -------------+----------------------------------   F(2, 71)        =     77.36
             Model |  1674.86992         2  837.434958   Prob > F        =    0.0000
          Residual |  768.589544        71  10.8252048   R-squared       =    0.6855
      -------------+----------------------------------   Adj R-squared   =    0.6766
             Total |  2443.45946        73  33.4720474   Root MSE        =    3.2902
      
      ------------------------------------------------------------------------------
               mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
              _hat |  -.4684797   .6532021    -0.72   0.476    -1.770928    .8339683
            _hatsq |   .0352427   .0155532     2.27   0.027     .0042305    .0662549
             _cons |   14.51826    6.65057     2.18   0.032     1.257397    27.77912
      ------------------------------------------------------------------------------
      The statistical significance of square fitted values (_hatsq) raises concerns about model misspecification.

      As an aside, please take a look at the FAQ on how to post more effectively and, as such, increasing your chances of getting (more) helpful replies.
      Last edited by Carlo Lazzaro; 09 Feb 2021, 07:54.
      Kind regards,
      Carlo
      (Stata 19.0)

      Comment


      • #4
        It is indeed normal and expected for regression coefficients, standard errors and p-statistics to change when you add variables to a multiple regession. The only circumstance in which the coefficients would not change is when all of the independent variables are perfectly uncorrelated with one another.

        I suggest looking up the topics "multicollinearity" and "omitted variable bias" in your statistics book.

        Using the -auto.dta- data as an example,you might be interested in estimating the impact of a vehicle's weight on its price. With a simple regression of -price- on -weight-, you estimate that the impact of an additional pound of weight on price is about $2.04 and is highly statistically significant with a t-statistic of 5.4, an R2 = 0.29 and a p-value less than 0.0005. Here's the output:

        Code:
        . sysuse auto, clear
        (1978 Automobile Data)
        
        . regress price weight
        
              Source |       SS           df       MS      Number of obs   =        74
        -------------+----------------------------------   F(1, 72)        =     29.42
               Model |   184233937         1   184233937   Prob > F        =    0.0000
            Residual |   450831459        72  6261548.04   R-squared       =    0.2901
        -------------+----------------------------------   Adj R-squared   =    0.2802
               Total |   635065396        73  8699525.97   Root MSE        =    2502.3
        
        ------------------------------------------------------------------------------
               price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
        -------------+----------------------------------------------------------------
              weight |   2.044063   .3768341     5.42   0.000     1.292857    2.795268
               _cons |  -6.707353    1174.43    -0.01   0.995     -2347.89    2334.475
        ------------------------------------------------------------------------------
        But you might wonder if this estimate is correct. After all other features of a vehicle also affect price. Two candidate variables in the data are the vehicle's engine displacement and its repair record. By adding these two variables to the regression, we are "controlling" for them. Because they are correlated with the weight variable, we expect the results to change. Here is the result:

        Code:
        . regress price weight displacement rep78
        
              Source |       SS           df       MS      Number of obs   =        69
        -------------+----------------------------------   F(3, 65)        =     13.20
               Model |   218371825         3  72790608.4   Prob > F        =    0.0000
            Residual |   358425134        65  5514232.83   R-squared       =    0.3786
        -------------+----------------------------------   Adj R-squared   =    0.3499
               Total |   576796959        68  8482308.22   Root MSE        =    2348.2
        
        ------------------------------------------------------------------------------
               price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
        -------------+----------------------------------------------------------------
              weight |   1.171487   .9895068     1.18   0.241    -.8046943    3.147669
        displacement |    11.5273   8.470084     1.36   0.178    -5.388624    28.44323
               rep78 |   841.7272   316.0884     2.66   0.010     210.4551    1472.999
               _cons |  -2555.098   2135.043    -1.20   0.236    -6819.073    1708.878
        ------------------------------------------------------------------------------
        This example illustrates three ways that adding independent variables can change your results. First, you have included the variable -rep78-, which has missing values on five of the 74 observations, so you are omitting five observatons on -weight- from your analysis. The absence of those five observations may have affected your the estimated coefficient, t-statistic and p-value of -weight-. Second, the added variables have iimproved the equation fit as measured by the increase in the R2 to 0.38. Third, the repair record now eclipses weight in statistical significance. The p-value on weight is now 0.24. Although the best unbiased estimate of the impact of an extra pound on price is still positive at $1.17, you can no longer reject the hypothesis that weight has a zero impact on price.

        A program you might find helpful in diagnosing this situation, especially in cross-section data, is bivariate, which you can install from inside Stata by typing:
        view net describe bivariate, from("http://digital.cgdev.org/doc/stata/MO/Misc")
        The syntax for bivariate is similar to that for regress. Here is what it looks like with the option -obsgain-​.​​​​​​
        Code:
        . bivariate price weight displacement rep78, obsgain
        
        Results for the dependent variable: price and each of the independent variables:
            weight displacement rep78
        
        Casewise deletion drops: 5 observations.
        
        The analysis uses :              69 observations.
        The variance inflation factor is: Centered
        Without the variable rep78 N would be: 74
        
        
        Table of bivariate correlation coefficients for the dependent variable: price
        
                         Coef. of       Means of variables:
                     | Correlation       t-stat      p-value          VIF   Obs Gained 
        -------------+-----------------------------------------------------------------
              weight |     0.54784      5.36021      0.00000      7.59006      0.00000 
        displacement |     0.54792      5.36138      0.00000      7.67618      0.00000 
               rep78 |     0.00655      0.05364      0.95738      1.20740      5.00000
        The bivariate results with the -obsgain- option show that you would gain five observations if you omitted the variable -rep78- from the regression. But you probably don't want to do that, since it makes theoretical sense that a car's repair record has a strong effect on price.

        In the bivariate results we see that , despite performing poorly in the multiple regression, the variables -weight- and -displacement- both have high and very statistically significant bivariate correlations with -price-. Why are they statistically significant in the bivariate results, but not in the multiple regression?

        Look in the column labeled "VIF", which stands for "Variance Inflation Factor". You will see that the two variables -weight- and -displacement- have high variables of "VIF", indicating that each is highly correlated with the other two independent variables. And since the variable -rep78- has a much smaller VIF, -weight- and -displacement must be highly correlated with each other. This makes sense logically because a heavier car requires an engine with a larger displacement. The two variables are measuring almost the same thing.

        A suggested remedy is to keep the -rep78- variable togehter with only one of the two correlated variables. If we keep -weight, the resulting regression is:
        Code:
        . regress price weight rep78
        
              Source |       SS           df       MS      Number of obs   =        69
        -------------+----------------------------------   F(2, 66)        =     18.63
               Model |   208158551         2   104079275   Prob > F        =    0.0000
            Residual |   368638408        66  5585430.43   R-squared       =    0.3609
        -------------+----------------------------------   Adj R-squared   =    0.3415
               Total |   576796959        68  8482308.22   Root MSE        =    2363.4
        
        ------------------------------------------------------------------------------
               price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
        -------------+----------------------------------------------------------------
              weight |      2.408   .3944697     6.10   0.000     1.620416    3.195584
               rep78 |   791.3852   315.9366     2.50   0.015     160.5974    1422.173
               _cons |  -3850.381   1923.469    -2.00   0.049    -7690.711   -10.05079
        ------------------------------------------------------------------------------
        which makes sense on theoretical grounds and reveals that the estimated impact of weight is even larger than it appeared to be in the other regressions - and more statistically significant. Note that the R2 is almost as high as it was with -displacement- also in the regression. This example illustrates the possibility that including an additional variable in a regression (like -rep78- in this example) sometimes improves the performance of a variable already in the regression (like -weight-).


        Comment


        • #5
          Thank you so much all! I didn't expect such prompt and helpful responses. Thanks again for helping me with my learning process. Appreciate it.

          Comment

          Working...
          X