Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Should I use vce(robust) for my OLS regression?

    Hi everyone, this is my first time posting, so I hope that I include all the relevant information. I am using regress in Stata17.0 for a multivariable regression, and unsure if I should include the option vce(robust). There are 61 observations used in the regression. In case this is useful, below is a sample of the dataset using dataex.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float cigar_mceq_satisfaction byte cigar_intensity double(cigar_bp cigar_omax cigar_finalpmax cigar_alpha)
    2.3333333  2 1.9999999999999998                1.2                 .6  .05395
            7 10                  3                  9                 .9 .007364
     6.333333  4 1.9999999999999998                  4                  1  .01633
     5.666667  5                  5                  6                2.5 .008666
            3  1                  3 1.9999999999999998 1.9999999999999998   .0472
    end


    The significance of one variable--cigar_omax--changes when I use vce(robust):


    . regress cigar_mceq_sat cigar_intensity cigar_bp cigar_omax cigar_finalpmax cigar_alpha

    Source | SS df MS Number of obs = 61
    -------------+---------------------------------- F(5, 55) = 2.31
    Model | 18.5321839 5 3.70643678 Prob > F = 0.0564
    Residual | 88.2802022 55 1.60509459 R-squared = 0.1735
    -------------+---------------------------------- Adj R-squared = 0.0984
    Total | 106.812386 60 1.78020644 Root MSE = 1.2669

    ---------------------------------------------------------------------------------
    cigar_mceq_sa~n | Coefficient Std. err. t P>|t| [95% conf. interval]
    ----------------+----------------------------------------------------------------
    cigar_intensity | .0043516 .01773 0.25 0.807 -.0311802 .0398834
    cigar_bp | .0013295 .0462788 0.03 0.977 -.0914152 .0940743
    cigar_omax | .0630785 .0338445 1.86 0.068 -.0047474 .1309043
    cigar_finalpmax | .0305393 .1443719 0.21 0.833 -.2587884 .3198671
    cigar_alpha | .7171226 9.763465 0.07 0.942 -18.8493 20.28354
    _cons | 3.78496 .4116146 9.20 0.000 2.960066 4.609854
    ---------------------------------------------------------------------------------

    .
    . regress cigar_mceq_sat cigar_intensity cigar_bp cigar_omax cigar_finalpmax cigar_alpha, vce(r
    > obust)

    Linear regression Number of obs = 61
    F(5, 55) = 4.64
    Prob > F = 0.0013
    R-squared = 0.1735
    Root MSE = 1.2669

    ---------------------------------------------------------------------------------
    | Robust
    cigar_mceq_sa~n | Coefficient std. err. t P>|t| [95% conf. interval]
    ----------------+----------------------------------------------------------------
    cigar_intensity | .0043516 .0143781 0.30 0.763 -.0244629 .033166
    cigar_bp | .0013295 .0516327 0.03 0.980 -.1021447 .1048038
    cigar_omax | .0630785 .0182502 3.46 0.001 .0265043 .0996526
    cigar_finalpmax | .0305393 .1549317 0.20 0.844 -.2799508 .3410294
    cigar_alpha | .7171226 5.977279 0.12 0.905 -11.26161 12.69586
    _cons | 3.78496 .3438619 11.01 0.000 3.095845 4.474074
    ---------------------------------------------------------------------------------



    I thought that this could be an indication of heteroskedasticity. I ran estat hettest and imtest after I performed the first regression (without vce(robust)), but both tests show that I fail to reject the null hypothesis:

    . estat hettest

    Breusch–Pagan/Cook–Weisberg test for heteroskedasticity
    Assumption: Normal error terms
    Variable: Fitted values of cigar_mceq_satisfaction

    H0: Constant variance

    chi2(1) = 0.22
    Prob > chi2 = 0.6358

    . estat imtest

    Cameron & Trivedi's decomposition of IM-test

    --------------------------------------------------
    Source | chi2 df p
    ---------------------+----------------------------
    Heteroskedasticity | 10.29 20 0.9626
    Skewness | 1.89 5 0.8642
    Kurtosis | 0.11 1 0.7446
    ---------------------+----------------------------
    Total | 12.29 26 0.9894
    --------------------------------------------------


    I also used rvfplot. Visually I believe there does appear to be heteroskedasticity. In particular one ID (26) is notable.

    Code:
    rvfplot, yline(0) mlabel(pid)
    Click image for larger version

Name:	Graph.png
Views:	1
Size:	45.0 KB
ID:	1624891


    If I drop that observation from the model, cigar_omax is not statistically significant, with or without vce(robust):

    . regress cigar_mceq_sat cigar_intensity cigar_bp cigar_omax cigar_finalpmax cigar_alpha if pid!=26

    Source | SS df MS Number of obs = 60
    -------------+---------------------------------- F(5, 54) = 1.44
    Model | 11.7738544 5 2.35477088 Prob > F = 0.2246
    Residual | 88.2465161 54 1.63419474 R-squared = 0.1177
    -------------+---------------------------------- Adj R-squared = 0.0360
    Total | 100.02037 59 1.69526052 Root MSE = 1.2784

    ---------------------------------------------------------------------------------
    cigar_mceq_sa~n | Coefficient Std. err. t P>|t| [95% conf. interval]
    ----------------+----------------------------------------------------------------
    cigar_intensity | .0047858 .0181438 0.26 0.793 -.0315904 .041162
    cigar_bp | .0038414 .0498663 0.08 0.939 -.0961345 .1038173
    cigar_omax | .0567864 .0555594 1.02 0.311 -.0546035 .1681762
    cigar_finalpmax | .029057 .1460401 0.20 0.843 -.2637358 .3218499
    cigar_alpha | .0352273 10.93667 0.00 0.997 -21.89148 21.96194
    _cons | 3.825782 .5033311 7.60 0.000 2.816664 4.8349
    ---------------------------------------------------------------------------------

    . regress cigar_mceq_sat cigar_intensity cigar_bp cigar_omax cigar_finalpmax cigar_alpha if pid!=26, vce(robust)

    Linear regression Number of obs = 60
    F(5, 54) = 1.68
    Prob > F = 0.1543
    R-squared = 0.1177
    Root MSE = 1.2784

    ---------------------------------------------------------------------------------
    | Robust
    cigar_mceq_sa~n | Coefficient std. err. t P>|t| [95% conf. interval]
    ----------------+----------------------------------------------------------------
    cigar_intensity | .0047858 .0140686 0.34 0.735 -.0234201 .0329916
    cigar_bp | .0038414 .0557472 0.07 0.945 -.107925 .1156078
    cigar_omax | .0567864 .0462053 1.23 0.224 -.0358497 .1494224
    cigar_finalpmax | .029057 .1563093 0.19 0.853 -.2843242 .3424383
    cigar_alpha | .0352273 7.080134 0.00 0.996 -14.15959 14.23004
    _cons | 3.825782 .4503772 8.49 0.000 2.92283 4.728734
    ---------------------------------------------------------------------------------


    I previously examined my data set for points of high leverage, residuals, and influence. This observation does have high leverage, but very low residuals, so it did not appear to have considerable influence. Also, when I calculated dfbetas for each predictor, ID 26 had a high influence for cigar_omax but it wasn't the observation with the highest influence.

    . quietly: regress cigar_mceq_sat cigar_intensity cigar_bp cigar_omax cigar_finalpmax cigar_alpha

    .
    . for var cigar_intensity cigar_bp cigar_omax cigar_finalpmax cigar_alpha: predict DFn_X, dfbeta(X)

    -> predict DFn_cigar_intensity, dfbeta(cigar_intensity)
    (4 missing values generated)

    -> predict DFn_cigar_bp, dfbeta(cigar_bp)
    (4 missing values generated)

    -> predict DFn_cigar_omax, dfbeta(cigar_omax)
    (4 missing values generated)

    -> predict DFn_cigar_finalpmax, dfbeta(cigar_finalpmax)
    (4 missing values generated)

    -> predict DFn_cigar_alpha, dfbeta(cigar_alpha)
    (4 missing values generated)

    .
    . quietly lv DFn_cigar_omax

    . sort DFn_cigar_omax


    . list pid DFn_cigar_omax cigar_omax if ( (DFn_cigar_omax >=( r(u_F) + 1.5*(r(u_F) - r(l_F)))) | (DFn_cigar_omax <=( r(l_F) - 1.5*(r(u_F) - r(l_F)))) ) & DFn_cigar_omax!=.

    +----------------------------+
    | pid DFn_~omax c~r_omax |
    |----------------------------|
    1. | 34 -.3107229 18 |
    2. | 56 -.2782602 20 |
    3. | 11 -.1311674 3 |
    4. | 7 -.113518 3 |
    56. | 72 .1330477 15 |
    |----------------------------|
    57. | 12 .1347013 .5 |
    58. | 62 .1422614 15 |
    59. | 17 .1662419 10 |
    60. | 26 .1842496 45 |
    61. | 55 .2723851 20 |
    +----------------------------+



    . lvr2plot, mlabel(pid)

    Click image for larger version

Name:	Graph2.png
Views:	1
Size:	39.7 KB
ID:	1624892


    So I am not sure what is a better course of action. I do not know why vce(robust) would change a statistical significance test if there was no heterskedasticity. I am also not sure why dropping that observation changes the test so much when it does not appear to have high influence.

    I appreciate any thoughts or advice. Thank you!

  • #2
    Heteroskedasticity is about the variance of the residuals (the y-axis of your graph, so ID 26 doesn't indicate much to me), and it seems plausible that there is some. In any case using the robust version of the covariance estimator almost always makes more sense because homoskedasticity rarely makes sense. If your results turn on the difference I would note that in your write up, something like "the data are compatible with cigar_omax having a positive effect on cigar_mseq_sat and with it having no effect." The methods Stata is employing here are consistent with very large amounts of data, how they perform differently with n=61 and what that indicates is not really something deserving strong scrutiny in my opinion.

    As a side note, ID 26 is increasing the variance of your design matrix in the cigar_omax dimension, which reduces se_beta_cigar_omax, remember beta=cov(y,x)/var(x), and when you drop ID 26 var(x) is drastically declining, increasing se_beta_cigar_omax, decreasing t etc.

    Comment


    • #3
      It seems to me that your residual vs fitted plot clearly shows heteroskedasticity, and that you should use robust variance.

      Comment

      Working...
      X