Adding Independent Variables in Regression

chelle anny

Join Date: Feb 2021

Posts: 7
#1

Adding Independent Variables in Regression

09 Feb 2021, 05:40

Hi there! I am new to Stata and am eager to learn.

I am currently doing an OLS regression on a cross-sectional data. As I add more independent variables to the regression, some of my P values and coefficient changes. Is this normal?

I am trying to check if my OLS regression is a good model. Are there any basic indicators of a concerning/problematic model that I should be aware of?

Thank you in advance.
Tags: multiple variables, OLS, ols regression, regression
Oscar Ozfidan

Join Date: Sep 2018

Posts: 257
#2

09 Feb 2021, 07:15

Originally posted by chelle anny View Post

Hi there! I am new to Stata and am eager to learn.

I am currently doing an OLS regression on a cross-sectional data. As I add more independent variables to the regression, some of my P values and coefficient changes. Is this normal?

I am trying to check if my OLS regression is a good model. Are there any basic indicators of a concerning/problematic model that I should be aware of?

Thank you in advance.

Part 1 yes that is entirely normal. As you add more variables more variables are explaining the changes in your dep var and that is why they change.
Part 2 That is a very broad question. There are 6 or so assumptions for OLS. Normality, heteroskedasticty etc. I suggest you do an online search and find out what they are and how to deal with them etc.
1 like
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17712

09 Feb 2021, 07:46

Chelle Anny:
welcome to this forum.
As Oscar wisely pointed out, there are different issues to be checked after any regression model.
As far as -regress- is concerned, I would investigate possible heteroskedasticity and, much more relevant, model misspecification (the following toy-example suffers from both):

Code:

. use https://www.stata-press.com/data/r16/auto
(1978 Automobile Data)

. regress mpg weight i.foreign

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(2, 71)        =     69.75
       Model |   1619.2877         2  809.643849   Prob > F        =    0.0000
    Residual |  824.171761        71   11.608053   R-squared       =    0.6627
-------------+----------------------------------   Adj R-squared   =    0.6532
       Total |  2443.45946        73  33.4720474   Root MSE        =    3.4071

------------------------------------------------------------------------------
         mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      weight |  -.0065879   .0006371   -10.34   0.000    -.0078583   -.0053175
             |
     foreign |
    Foreign  |  -1.650029   1.075994    -1.53   0.130      -3.7955    .4954422
       _cons |    41.6797   2.165547    19.25   0.000     37.36172    45.99768
------------------------------------------------------------------------------

. estat hettest

Breusch-Pagan / Cook-Weisberg test for heteroskedasticity
         Ho: Constant variance
         Variables: fitted values of mpg

         chi2(1)      =     7.89
         Prob > chi2  =   0.0050

. linktest

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(2, 71)        =     77.36
       Model |  1674.86992         2  837.434958   Prob > F        =    0.0000
    Residual |  768.589544        71  10.8252048   R-squared       =    0.6855
-------------+----------------------------------   Adj R-squared   =    0.6766
       Total |  2443.45946        73  33.4720474   Root MSE        =    3.2902

------------------------------------------------------------------------------
         mpg |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        _hat |  -.4684797   .6532021    -0.72   0.476    -1.770928    .8339683
      _hatsq |   .0352427   .0155532     2.27   0.027     .0042305    .0662549
       _cons |   14.51826    6.65057     2.18   0.032     1.257397    27.77912
------------------------------------------------------------------------------

The statistical significance of square fitted values (_hatsq) raises concerns about model misspecification.

As an aside, please take a look at the FAQ on how to post more effectively and, as such, increasing your chances of getting (more) helpful replies.

Last edited by Carlo Lazzaro; 09 Feb 2021, 07:54.

Kind regards,
Carlo
(Stata 19.0)

Comment

Mead Over

Join Date: Sep 2014

Posts: 112
#4

09 Feb 2021, 09:56

It is indeed normal and expected for regression coefficients, standard errors and p-statistics to change when you add variables to a multiple regession. The only circumstance in which the coefficients would not change is when all of the independent variables are perfectly uncorrelated with one another.

I suggest looking up the topics "multicollinearity" and "omitted variable bias" in your statistics book.

Using the -auto.dta- data as an example,you might be interested in estimating the impact of a vehicle's weight on its price. With a simple regression of -price- on -weight-, you estimate that the impact of an additional pound of weight on price is about $2.04 and is highly statistically significant with a t-statistic of 5.4, an R² = 0.29 and a p-value less than 0.0005. Here's the output:

Code:

. sysuse auto, clear (1978 Automobile Data) . regress price weight Source | SS df MS Number of obs = 74 -------------+---------------------------------- F(1, 72) = 29.42 Model | 184233937 1 184233937 Prob > F = 0.0000 Residual | 450831459 72 6261548.04 R-squared = 0.2901 -------------+---------------------------------- Adj R-squared = 0.2802 Total | 635065396 73 8699525.97 Root MSE = 2502.3 ------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- weight | 2.044063 .3768341 5.42 0.000 1.292857 2.795268 _cons | -6.707353 1174.43 -0.01 0.995 -2347.89 2334.475 ------------------------------------------------------------------------------

But you might wonder if this estimate is correct. After all other features of a vehicle also affect price. Two candidate variables in the data are the vehicle's engine displacement and its repair record. By adding these two variables to the regression, we are "controlling" for them. Because they are correlated with the weight variable, we expect the results to change. Here is the result:

Code:

. regress price weight displacement rep78 Source | SS df MS Number of obs = 69 -------------+---------------------------------- F(3, 65) = 13.20 Model | 218371825 3 72790608.4 Prob > F = 0.0000 Residual | 358425134 65 5514232.83 R-squared = 0.3786 -------------+---------------------------------- Adj R-squared = 0.3499 Total | 576796959 68 8482308.22 Root MSE = 2348.2 ------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- weight | 1.171487 .9895068 1.18 0.241 -.8046943 3.147669 displacement | 11.5273 8.470084 1.36 0.178 -5.388624 28.44323 rep78 | 841.7272 316.0884 2.66 0.010 210.4551 1472.999 _cons | -2555.098 2135.043 -1.20 0.236 -6819.073 1708.878 ------------------------------------------------------------------------------

This example illustrates three ways that adding independent variables can change your results. First, you have included the variable -rep78-, which has missing values on five of the 74 observations, so you are omitting five observatons on -weight- from your analysis. The absence of those five observations may have affected your the estimated coefficient, t-statistic and p-value of -weight-. Second, the added variables have iimproved the equation fit as measured by the increase in the R² to 0.38. Third, the repair record now eclipses weight in statistical significance. The p-value on weight is now 0.24. Although the best unbiased estimate of the impact of an extra pound on price is still positive at $1.17, you can no longer reject the hypothesis that weight has a zero impact on price.

A program you might find helpful in diagnosing this situation, especially in cross-section data, is bivariate, which you can install from inside Stata by typing:
view net describe bivariate, from("http://digital.cgdev.org/doc/stata/MO/Misc")
The syntax for bivariate is similar to that for regress. Here is what it looks like with the option -obsgain-.

Code:

. bivariate price weight displacement rep78, obsgain Results for the dependent variable: price and each of the independent variables: weight displacement rep78 Casewise deletion drops: 5 observations. The analysis uses : 69 observations. The variance inflation factor is: Centered Without the variable rep78 N would be: 74 Table of bivariate correlation coefficients for the dependent variable: price Coef. of Means of variables: | Correlation t-stat p-value VIF Obs Gained -------------+----------------------------------------------------------------- weight | 0.54784 5.36021 0.00000 7.59006 0.00000 displacement | 0.54792 5.36138 0.00000 7.67618 0.00000 rep78 | 0.00655 0.05364 0.95738 1.20740 5.00000

The bivariate results with the -obsgain- option show that you would gain five observations if you omitted the variable -rep78- from the regression. But you probably don't want to do that, since it makes theoretical sense that a car's repair record has a strong effect on price.

In the bivariate results we see that , despite performing poorly in the multiple regression, the variables -weight- and -displacement- both have high and very statistically significant bivariate correlations with -price-. Why are they statistically significant in the bivariate results, but not in the multiple regression?

Look in the column labeled "VIF", which stands for "Variance Inflation Factor". You will see that the two variables -weight- and -displacement- have high variables of "VIF", indicating that each is highly correlated with the other two independent variables. And since the variable -rep78- has a much smaller VIF, -weight- and -displacement must be highly correlated with each other. This makes sense logically because a heavier car requires an engine with a larger displacement. The two variables are measuring almost the same thing.

A suggested remedy is to keep the -rep78- variable togehter with only one of the two correlated variables. If we keep -weight, the resulting regression is:

Code:

. regress price weight rep78 Source | SS df MS Number of obs = 69 -------------+---------------------------------- F(2, 66) = 18.63 Model | 208158551 2 104079275 Prob > F = 0.0000 Residual | 368638408 66 5585430.43 R-squared = 0.3609 -------------+---------------------------------- Adj R-squared = 0.3415 Total | 576796959 68 8482308.22 Root MSE = 2363.4 ------------------------------------------------------------------------------ price | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------- weight | 2.408 .3944697 6.10 0.000 1.620416 3.195584 rep78 | 791.3852 315.9366 2.50 0.015 160.5974 1422.173 _cons | -3850.381 1923.469 -2.00 0.049 -7690.711 -10.05079 ------------------------------------------------------------------------------

which makes sense on theoretical grounds and reveals that the estimated impact of weight is even larger than it appeared to be in the other regressions - and more statistically significant. Note that the R² is almost as high as it was with -displacement- also in the regression. This example illustrates the possibility that including an additional variable in a regression (like -rep78- in this example) sometimes improves the performance of a variable already in the regression (like -weight-).
Comment
chelle anny

Join Date: Feb 2021

Posts: 7
#5

11 Feb 2021, 11:05

Thank you so much all! I didn't expect such prompt and helpful responses. Thanks again for helping me with my learning process. Appreciate it.
Comment

Announcement