Sample split vs. interactions, different t-values

Wouter Wakker

Join Date: Nov 2018
Posts: 621

Sample split vs. interactions, different t-values

14 Jul 2019, 06:20

Dear Statalisters,

let's say we are interested in the effect of mpg on price for domestic and foreign cars separately. We could split the sample:

Code:

sysuse auto, clear
reg price mpg if foreign == 0
reg price mpg if foreign == 1

Or we could include an interaction:

Code:

reg price c.mpg##foreign
margins foreign, dydx(mpg)

The coefficients are exactly the same. However, t-values and standard errors are different. Interestingly, for mpg[foreign] the sample split gets smaller standard errors, whereas for mpg[domestic] the model with the interaction term has smaller standard errors. What are the mechanics behind these different standard errors? Results posted below.

Code:

. sysuse auto, clear
(1978 Automobile Data)

. reg price mpg if foreign == 0

      Source |       SS           df       MS      Number of obs   =        52
-------------+----------------------------------   F(1, 50)        =     17.05
       Model |   124392956         1   124392956   Prob > F        =    0.0001
    Residual |   364801844        50  7296036.89   R-squared       =    0.2543
-------------+----------------------------------   Adj R-squared   =    0.2394
       Total |   489194801        51  9592054.92   Root MSE        =    2701.1

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |  -329.2551   79.74034    -4.13   0.000    -489.4183   -169.0919
       _cons |   12600.54   1624.773     7.76   0.000     9337.085    15863.99
------------------------------------------------------------------------------

. reg price mpg if foreign == 1

      Source |       SS           df       MS      Number of obs   =        22
-------------+----------------------------------   F(1, 20)        =     13.25
       Model |  57534941.7         1  57534941.7   Prob > F        =    0.0016
    Residual |  86828271.1        20  4341413.55   R-squared       =    0.3985
-------------+----------------------------------   Adj R-squared   =    0.3685
       Total |   144363213        21   6874438.7   Root MSE        =    2083.6

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |  -250.3668   68.77435    -3.64   0.002    -393.8276    -106.906
       _cons |   12586.95   1760.689     7.15   0.000     8914.217    16259.68
------------------------------------------------------------------------------

Code:

. reg price c.mpg##i.foreign

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(3, 70)        =      9.48
       Model |   183435281         3  61145093.6   Prob > F        =    0.0000
    Residual |   451630115        70  6451858.79   R-squared       =    0.2888
-------------+----------------------------------   Adj R-squared   =    0.2584
       Total |   635065396        73  8699525.97   Root MSE        =    2540.1

-------------------------------------------------------------------------------
        price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
          mpg |  -329.2551   74.98545    -4.39   0.000    -478.8088   -179.7013
              |
      foreign |
     Foreign  |  -13.58741   2634.664    -0.01   0.996    -5268.258    5241.084
              |
foreign#c.mpg |
     Foreign  |   78.88826   112.4812     0.70   0.485    -145.4485     303.225
              |
        _cons |   12600.54   1527.888     8.25   0.000     9553.261    15647.81
-------------------------------------------------------------------------------

. margins foreign, dydx(mpg)

Average marginal effects                        Number of obs     =         74
Model VCE    : OLS

Expression   : Linear prediction, predict()
dy/dx w.r.t. : mpg

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
mpg          |
     foreign |
   Domestic  |  -329.2551   74.98545    -4.39   0.000    -478.8088   -179.7013
    Foreign  |  -250.3668    83.8404    -2.99   0.004    -417.5812    -83.1524
------------------------------------------------------------------------------

Tags: None

Mike Lacy

Join Date: Apr 2014

Posts: 2416
#2

14 Jul 2019, 09:07

The standard error and hence the t-statistic for a regression slope is a function of such things as the residual sum of squares, the sample size, and the sum of squares for the explanatory variable. All of these differ between the two methods of analysis. One source for the relevant formulae is https://www3.nd.edu/~rwilliam/stats1/x91.pdf

Last edited by Mike Lacy; 14 Jul 2019, 09:11.
1 like
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10192

14 Jul 2019, 09:11

The sample size differs between the individual and joint regressions. Recall that the variance of the OLS estimator is

$$\text{Var}(\beta)= \sigma^{2}(X^{\prime}X)^{-1}$$

where we substitute $\sigma^{2}$ with

$$s^{2}= \frac{1}{N-K} \sum_{i=1}^{N} e_{i}^{2}.$$

Thus $s^{2}= f(N)$, the sample size.

Simpler example where we have the same regression yielding same coefficient estimates but sample size differs.

Code:

sysuse auto, clear
reg price mpg
mat list e(V)
gen cons=1
mkmat mpg cons, mat(X)
mat invxpx=  invsym(X'*X)
predict res, r
mkmat res, mat(e)
mat S2_1= (1/(e(N)-2))*(e'*e)
mat list S2_1
mat V_1= S2_1*invxpx
mat list V_1
expand 2
reg price mpg
mat list e(V)
mkmat mpg cons, mat(X)
mat invxpx=  invsym(X'*X)
predict res2, r
mkmat res2, mat(e)
mat S2_2= (1/(e(N)-2))*(e'*e)
mat list S2_2
mat V_2= S2_2*invxpx
mat list V_2

Results:

Same coefficient, different standard errors due to different sample size.

Code:

. reg price mpg

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(1, 72)        =     20.26
       Model |   139449474         1   139449474   Prob > F        =    0.0000
    Residual |   495615923        72  6883554.48   R-squared       =    0.2196
-------------+----------------------------------   Adj R-squared   =    0.2087
       Total |   635065396        73  8699525.97   Root MSE        =    2623.7

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |  -238.8943   53.07669    -4.50   0.000    -344.7008   -133.0879
       _cons |   11253.06   1170.813     9.61   0.000     8919.088    13587.03
------------------------------------------------------------------------------

 expand 2
(74 observations created)

.
. reg price mpg

      Source |       SS           df       MS      Number of obs   =       148
-------------+----------------------------------   F(1, 146)       =     41.08
       Model |   278898947         1   278898947   Prob > F        =    0.0000
    Residual |   991231845       146  6789259.21   R-squared       =    0.2196
-------------+----------------------------------   Adj R-squared   =    0.2142
       Total |  1.2701e+09       147  8640345.53   Root MSE        =    2605.6

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |  -238.8943   37.27294    -6.41   0.000    -312.5586   -165.2301
       _cons |   11253.06   822.1996    13.69   0.000      9628.11    12878.01
------------------------------------------------------------------------------

Calculation of variance.

Code:

. mat list e(V)

symmetric e(V)[2,2]
              mpg       _cons
  mpg   2817.1347
_cons  -59997.356   1370802.5

. mat list S2_1

symmetric S2_1[1,1]
           res
res  6883554.3


. mat list V_1

symmetric V_1[2,2]
             mpg        cons
 mpg   2817.1346
cons  -59997.354   1370802.5

. mat list e(V)

symmetric e(V)[2,2]
              mpg       _cons
  mpg   1389.2719
_cons  -29587.737   676012.21

. mat list S2_2

symmetric S2_2[1,1]
         res2
res2  6789259

 
. mat list V_2

symmetric V_2[2,2]
             mpg        cons
 mpg   1389.2719
cons  -29587.736   676012.19

Comment

Wouter Wakker

Join Date: Nov 2018

Posts: 621
#4

14 Jul 2019, 10:42

Thanks a lot Mike and Andrew.

I expected the results to be different due to sample size, but I can't intuitively get my head around how the sample size exactly affects the foreign and domestic standard errors for mpg, especially because one gets smaller and the other one get bigger. In the case of doubling the sample size it's intuitively easier to understand that the standard errors are smaller, but I must admit that I'm a bit rusty on the formulas for calculating standard errors, so I will use the reference and the code above to see if I can reproduce the results of the example in #1 to see what's going on exactly.
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2167
#5

14 Jul 2019, 17:35

I believe the difference is because when you estimate the equations separately, you're allowing for a kind of heteroskedasicity: the error variance can be different across foreign and domestic. When you include an interaction, you are imposing homoskedasticity: the variance is the same across foreign and domestic. Thus, the standard errors are computed under different assumptions about the error variance. And I don't think using the "robust" option everywhere will give you the same answer because with the separate regressions you're allowing the variances to be different based on foreign and then making the standard errors robust to heteroskedasticity as a function of mph. I believe the most robust standard errors are from

Code:

reg price c.mpg##foreign, robust margins foreign, dydx(mpg)
3 likes
Comment

Leonardo Guizzetti

Join Date: Jul 2016
Posts: 2402

14 Jul 2019, 19:05

I second Jeff Wooldridge 's explanation. One could go further to allow for heteroskedasticity by adopting a mixed model, and letting the residual variance vary by foreign. The resulting standard errors are a bit smaller by the added information from the extra observations, and the point estimates exactly the same.

Separate subset regressions:

Code:

* Domestic
. reg price mpg if 0.foreign

      Source |       SS           df       MS      Number of obs   =        52
-------------+----------------------------------   F(1, 50)        =     17.05
       Model |  1.24392961         1  1.24392961   Prob > F        =    0.0001
    Residual |   3.6480186        50  .072960372   R-squared       =    0.2543
-------------+----------------------------------   Adj R-squared   =    0.2394
       Total |  4.89194821        51  .095920553   Root MSE        =    .27011

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |  -.0329255    .007974    -4.13   0.000    -.0489418   -.0169092
       _cons |   1.260054   .1624773     7.76   0.000     .9337086    1.586399
------------------------------------------------------------------------------

* Foreign
. reg price mpg if 1.foreign

      Source |       SS           df       MS      Number of obs   =        22
-------------+----------------------------------   F(1, 20)        =     13.25
       Model |  .575349439         1  .575349439   Prob > F        =    0.0016
    Residual |  .868282678        20  .043414134   R-squared       =    0.3985
-------------+----------------------------------   Adj R-squared   =    0.3685
       Total |  1.44363212        21  .068744387   Root MSE        =    .20836

------------------------------------------------------------------------------
       price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         mpg |  -.0250367   .0068774    -3.64   0.002    -.0393828   -.0106906
       _cons |   1.258695   .1760689     7.15   0.000     .8914217    1.625968
------------------------------------------------------------------------------

Now, fit a single model to the whole sample with interaction effect of foreign and allow for independent residual variances by foreign.

Code:

. mixed price c.mpg##i.foreign , resid(independent, by(foreign))

// output omitted

Mixed-effects ML regression                     Number of obs     =         74
Group variable: _all                            Number of groups  =          1

                                                Obs per group:
                                                              min =         74
                                                              avg =       74.0
                                                              max =         74

                                                Wald chi2(3)      =      32.62
Log likelihood = -.36281878                     Prob > chi2       =     0.0000

-------------------------------------------------------------------------------
        price |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
          mpg |  -.0329255   .0078192    -4.21   0.000    -.0482508   -.0176002
              |
      foreign |
     Foreign  |  -.0013587   .2314424    -0.01   0.995    -.4549775    .4522601
              |
foreign#c.mpg |
     Foreign  |   .0078888   .0102048     0.77   0.439    -.0121123      .02789
              |
        _cons |   1.260054   .1593221     7.91   0.000     .9477882    1.572319
-------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
_all:                (empty) |
-----------------------------+------------------------------------------------
Residual: Independent,       |
    by foreign               |
            Domestic: var(e) |   .0701542   .0137584       .047766    .1030358
             Foreign: var(e) |   .0394674   .0118999       .021857    .0712665
------------------------------------------------------------------------------
LR test vs. linear model: chi2(1) = 2.35                  Prob > chi2 = 0.1256

Note: The reported degrees of freedom assumes the null hypothesis is not on the boundary of the parameter space.  If this is
      not true, then the reported test is conservative.

. margins foreign, dydx(mpg)

Average marginal effects                        Number of obs     =         74

Expression   : Linear prediction, fixed portion, predict()
dy/dx w.r.t. : mpg

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
mpg          |
     foreign |
   Domestic  |  -.0329255   .0078192    -4.21   0.000    -.0482508   -.0176002
    Foreign  |  -.0250367   .0065574    -3.82   0.000    -.0378889   -.0121845
------------------------------------------------------------------------------

Notice the following:

all 74 observations are used
estimated average marginal effect of mpg within each level of foreign (-0.033 and -0.025) are the same compared to the subset regressions, but the mixed model estimates slightly smaller standard errors on those marginal effects
the variance of the residuals of domestic cars (0.073) and foreign cars (0.043) in the subset regressions are slightly larger than their respective estimates (0.070 and 0.039) in the mixed model

Comment

Wouter Wakker

Join Date: Nov 2018

Posts: 621
#7

16 Jul 2019, 01:49

Thank you Jeff and Leonardo for your insights.

Very interesting to see the results of the mixed model with varying residual variance between the two groups. If you allow me one follow-up question, when estimating effects across groups in the presence of groupwise heteroskedasticity, is the mixed model approach with varying residual variance then an alternative to estimating with robust standard errors? Is it preferred, given the smaller standard errors?
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#8

16 Jul 2019, 07:14

No, I don't think that the mixed approach is any more robust to heteroskedasticity. One could use robust variance estimators, but what is recommended depends on your specific case. There's very little practical difference in this toy example.
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2167
#9

16 Jul 2019, 16:41

Originally posted by Leonardo Guizzetti View Post

No, I don't think that the mixed approach is any more robust to heteroskedasticity. One could use robust variance estimators, but what is recommended depends on your specific case. There's very little practical difference in this toy example.

Using the mixed model where the heteroskedasticity is assumed to only depend on the split variable is not as general as making standard errors robust. It's kind of cheating compared with fully robust standard errors. However, there is an argument to be made for using relatively simple forms of heteroskedasticity, as in a mixed model, and then making inference robust. In fact, in my introduction econometrics book, in the material on weighted least squares I propose exactly that. Use some model of heteroskedasticty to try to improve efficiency of OLS, but always use robust standard errors. Unfortunately, I can't get that work with the mixed command. But one could do the WLS by hand, of course.

JW
2 likes
Comment

Announcement

Sample split vs. interactions, different t-values

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment