Should I use vce(robust) for my OLS regression?

Erin Mead-Morse

Join Date: Aug 2021

Posts: 12
#1

Should I use vce(robust) for my OLS regression?

25 Aug 2021, 14:49

Hi everyone, this is my first time posting, so I hope that I include all the relevant information. I am using regress in Stata17.0 for a multivariable regression, and unsure if I should include the option vce(robust). There are 61 observations used in the regression. In case this is useful, below is a sample of the dataset using dataex.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float cigar_mceq_satisfaction byte cigar_intensity double(cigar_bp cigar_omax cigar_finalpmax cigar_alpha) 2.3333333 2 1.9999999999999998 1.2 .6 .05395 7 10 3 9 .9 .007364 6.333333 4 1.9999999999999998 4 1 .01633 5.666667 5 5 6 2.5 .008666 3 1 3 1.9999999999999998 1.9999999999999998 .0472 end

The significance of one variable--cigar_omax--changes when I use vce(robust):

. regress cigar_mceq_sat cigar_intensity cigar_bp cigar_omax cigar_finalpmax cigar_alpha

Source | SS df MS Number of obs = 61
-------------+---------------------------------- F(5, 55) = 2.31
Model | 18.5321839 5 3.70643678 Prob > F = 0.0564
Residual | 88.2802022 55 1.60509459 R-squared = 0.1735
-------------+---------------------------------- Adj R-squared = 0.0984
Total | 106.812386 60 1.78020644 Root MSE = 1.2669

---------------------------------------------------------------------------------
cigar_mceq_sa~n | Coefficient Std. err. t P>|t| [95% conf. interval]
----------------+----------------------------------------------------------------
cigar_intensity | .0043516 .01773 0.25 0.807 -.0311802 .0398834
cigar_bp | .0013295 .0462788 0.03 0.977 -.0914152 .0940743
cigar_omax | .0630785 .0338445 1.86 0.068 -.0047474 .1309043
cigar_finalpmax | .0305393 .1443719 0.21 0.833 -.2587884 .3198671
cigar_alpha | .7171226 9.763465 0.07 0.942 -18.8493 20.28354
_cons | 3.78496 .4116146 9.20 0.000 2.960066 4.609854
---------------------------------------------------------------------------------

.
. regress cigar_mceq_sat cigar_intensity cigar_bp cigar_omax cigar_finalpmax cigar_alpha, vce(r
> obust)

Linear regression Number of obs = 61
F(5, 55) = 4.64
Prob > F = 0.0013
R-squared = 0.1735
Root MSE = 1.2669

---------------------------------------------------------------------------------
| Robust
cigar_mceq_sa~n | Coefficient std. err. t P>|t| [95% conf. interval]
----------------+----------------------------------------------------------------
cigar_intensity | .0043516 .0143781 0.30 0.763 -.0244629 .033166
cigar_bp | .0013295 .0516327 0.03 0.980 -.1021447 .1048038
cigar_omax | .0630785 .0182502 3.46 0.001 .0265043 .0996526
cigar_finalpmax | .0305393 .1549317 0.20 0.844 -.2799508 .3410294
cigar_alpha | .7171226 5.977279 0.12 0.905 -11.26161 12.69586
_cons | 3.78496 .3438619 11.01 0.000 3.095845 4.474074
---------------------------------------------------------------------------------

I thought that this could be an indication of heteroskedasticity. I ran estat hettest and imtest after I performed the first regression (without vce(robust)), but both tests show that I fail to reject the null hypothesis:

. estat hettest

Breusch–Pagan/Cook–Weisberg test for heteroskedasticity
Assumption: Normal error terms
Variable: Fitted values of cigar_mceq_satisfaction

H0: Constant variance

chi2(1) = 0.22
Prob > chi2 = 0.6358

. estat imtest

Cameron & Trivedi's decomposition of IM-test

--------------------------------------------------
Source | chi2 df p
---------------------+----------------------------
Heteroskedasticity | 10.29 20 0.9626
Skewness | 1.89 5 0.8642
Kurtosis | 0.11 1 0.7446
---------------------+----------------------------
Total | 12.29 26 0.9894
--------------------------------------------------

I also used rvfplot. Visually I believe there does appear to be heteroskedasticity. In particular one ID (26) is notable.

Code:

rvfplot, yline(0) mlabel(pid)

If I drop that observation from the model, cigar_omax is not statistically significant, with or without vce(robust):

. regress cigar_mceq_sat cigar_intensity cigar_bp cigar_omax cigar_finalpmax cigar_alpha if pid!=26

Source | SS df MS Number of obs = 60
-------------+---------------------------------- F(5, 54) = 1.44
Model | 11.7738544 5 2.35477088 Prob > F = 0.2246
Residual | 88.2465161 54 1.63419474 R-squared = 0.1177
-------------+---------------------------------- Adj R-squared = 0.0360
Total | 100.02037 59 1.69526052 Root MSE = 1.2784

---------------------------------------------------------------------------------
cigar_mceq_sa~n | Coefficient Std. err. t P>|t| [95% conf. interval]
----------------+----------------------------------------------------------------
cigar_intensity | .0047858 .0181438 0.26 0.793 -.0315904 .041162
cigar_bp | .0038414 .0498663 0.08 0.939 -.0961345 .1038173
cigar_omax | .0567864 .0555594 1.02 0.311 -.0546035 .1681762
cigar_finalpmax | .029057 .1460401 0.20 0.843 -.2637358 .3218499
cigar_alpha | .0352273 10.93667 0.00 0.997 -21.89148 21.96194
_cons | 3.825782 .5033311 7.60 0.000 2.816664 4.8349
---------------------------------------------------------------------------------

. regress cigar_mceq_sat cigar_intensity cigar_bp cigar_omax cigar_finalpmax cigar_alpha if pid!=26, vce(robust)

Linear regression Number of obs = 60
F(5, 54) = 1.68
Prob > F = 0.1543
R-squared = 0.1177
Root MSE = 1.2784

---------------------------------------------------------------------------------
| Robust
cigar_mceq_sa~n | Coefficient std. err. t P>|t| [95% conf. interval]
----------------+----------------------------------------------------------------
cigar_intensity | .0047858 .0140686 0.34 0.735 -.0234201 .0329916
cigar_bp | .0038414 .0557472 0.07 0.945 -.107925 .1156078
cigar_omax | .0567864 .0462053 1.23 0.224 -.0358497 .1494224
cigar_finalpmax | .029057 .1563093 0.19 0.853 -.2843242 .3424383
cigar_alpha | .0352273 7.080134 0.00 0.996 -14.15959 14.23004
_cons | 3.825782 .4503772 8.49 0.000 2.92283 4.728734
---------------------------------------------------------------------------------

I previously examined my data set for points of high leverage, residuals, and influence. This observation does have high leverage, but very low residuals, so it did not appear to have considerable influence. Also, when I calculated dfbetas for each predictor, ID 26 had a high influence for cigar_omax but it wasn't the observation with the highest influence.

. quietly: regress cigar_mceq_sat cigar_intensity cigar_bp cigar_omax cigar_finalpmax cigar_alpha
.
. for var cigar_intensity cigar_bp cigar_omax cigar_finalpmax cigar_alpha: predict DFn_X, dfbeta(X)

-> predict DFn_cigar_intensity, dfbeta(cigar_intensity)
(4 missing values generated)

-> predict DFn_cigar_bp, dfbeta(cigar_bp)
(4 missing values generated)

-> predict DFn_cigar_omax, dfbeta(cigar_omax)
(4 missing values generated)

-> predict DFn_cigar_finalpmax, dfbeta(cigar_finalpmax)
(4 missing values generated)

-> predict DFn_cigar_alpha, dfbeta(cigar_alpha)
(4 missing values generated)

.
. quietly lv DFn_cigar_omax

. sort DFn_cigar_omax

. list pid DFn_cigar_omax cigar_omax if ( (DFn_cigar_omax >=( r(u_F) + 1.5*(r(u_F) - r(l_F)))) | (DFn_cigar_omax <=( r(l_F) - 1.5*(r(u_F) - r(l_F)))) ) & DFn_cigar_omax!=.

+----------------------------+
| pid DFn_~omax c~r_omax |
|----------------------------|
1. | 34 -.3107229 18 |
2. | 56 -.2782602 20 |
3. | 11 -.1311674 3 |
4. | 7 -.113518 3 |
56. | 72 .1330477 15 |
|----------------------------|
57. | 12 .1347013 .5 |
58. | 62 .1422614 15 |
59. | 17 .1662419 10 |
60. | 26 .1842496 45 |
61. | 55 .2723851 20 |
+----------------------------+

. lvr2plot, mlabel(pid)

So I am not sure what is a better course of action. I do not know why vce(robust) would change a statistical significance test if there was no heterskedasticity. I am also not sure why dropping that observation changes the test so much when it does not appear to have high influence.

I appreciate any thoughts or advice. Thank you!
Tags: None
Jackson Monroe

Join Date: Jul 2019

Posts: 60
#2

25 Aug 2021, 17:34

Heteroskedasticity is about the variance of the residuals (the y-axis of your graph, so ID 26 doesn't indicate much to me), and it seems plausible that there is some. In any case using the robust version of the covariance estimator almost always makes more sense because homoskedasticity rarely makes sense. If your results turn on the difference I would note that in your write up, something like "the data are compatible with cigar_omax having a positive effect on cigar_mseq_sat and with it having no effect." The methods Stata is employing here are consistent with very large amounts of data, how they perform differently with n=61 and what that indicates is not really something deserving strong scrutiny in my opinion.

As a side note, ID 26 is increasing the variance of your design matrix in the cigar_omax dimension, which reduces se_beta_cigar_omax, remember beta=cov(y,x)/var(x), and when you drop ID 26 var(x) is drastically declining, increasing se_beta_cigar_omax, decreasing t etc.
1 like
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#3

26 Aug 2021, 04:01

It seems to me that your residual vs fitted plot clearly shows heteroskedasticity, and that you should use robust variance.
1 like
Comment

Announcement

Should I use vce(robust) for my OLS regression?

Comment

Comment