Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sample split vs. interactions, different t-values

    Dear Statalisters,

    let's say we are interested in the effect of mpg on price for domestic and foreign cars separately. We could split the sample:
    Code:
    sysuse auto, clear
    reg price mpg if foreign == 0
    reg price mpg if foreign == 1
    Or we could include an interaction:
    Code:
    reg price c.mpg##foreign
    margins foreign, dydx(mpg)
    The coefficients are exactly the same. However, t-values and standard errors are different. Interestingly, for mpg[foreign] the sample split gets smaller standard errors, whereas for mpg[domestic] the model with the interaction term has smaller standard errors. What are the mechanics behind these different standard errors? Results posted below.
    Code:
    . sysuse auto, clear
    (1978 Automobile Data)
    
    . reg price mpg if foreign == 0
    
          Source |       SS           df       MS      Number of obs   =        52
    -------------+----------------------------------   F(1, 50)        =     17.05
           Model |   124392956         1   124392956   Prob > F        =    0.0001
        Residual |   364801844        50  7296036.89   R-squared       =    0.2543
    -------------+----------------------------------   Adj R-squared   =    0.2394
           Total |   489194801        51  9592054.92   Root MSE        =    2701.1
    
    ------------------------------------------------------------------------------
           price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
             mpg |  -329.2551   79.74034    -4.13   0.000    -489.4183   -169.0919
           _cons |   12600.54   1624.773     7.76   0.000     9337.085    15863.99
    ------------------------------------------------------------------------------
    
    . reg price mpg if foreign == 1
    
          Source |       SS           df       MS      Number of obs   =        22
    -------------+----------------------------------   F(1, 20)        =     13.25
           Model |  57534941.7         1  57534941.7   Prob > F        =    0.0016
        Residual |  86828271.1        20  4341413.55   R-squared       =    0.3985
    -------------+----------------------------------   Adj R-squared   =    0.3685
           Total |   144363213        21   6874438.7   Root MSE        =    2083.6
    
    ------------------------------------------------------------------------------
           price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
             mpg |  -250.3668   68.77435    -3.64   0.002    -393.8276    -106.906
           _cons |   12586.95   1760.689     7.15   0.000     8914.217    16259.68
    ------------------------------------------------------------------------------
    Code:
    . reg price c.mpg##i.foreign
    
          Source |       SS           df       MS      Number of obs   =        74
    -------------+----------------------------------   F(3, 70)        =      9.48
           Model |   183435281         3  61145093.6   Prob > F        =    0.0000
        Residual |   451630115        70  6451858.79   R-squared       =    0.2888
    -------------+----------------------------------   Adj R-squared   =    0.2584
           Total |   635065396        73  8699525.97   Root MSE        =    2540.1
    
    -------------------------------------------------------------------------------
            price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    --------------+----------------------------------------------------------------
              mpg |  -329.2551   74.98545    -4.39   0.000    -478.8088   -179.7013
                  |
          foreign |
         Foreign  |  -13.58741   2634.664    -0.01   0.996    -5268.258    5241.084
                  |
    foreign#c.mpg |
         Foreign  |   78.88826   112.4812     0.70   0.485    -145.4485     303.225
                  |
            _cons |   12600.54   1527.888     8.25   0.000     9553.261    15647.81
    -------------------------------------------------------------------------------
    
    . margins foreign, dydx(mpg)
    
    Average marginal effects                        Number of obs     =         74
    Model VCE    : OLS
    
    Expression   : Linear prediction, predict()
    dy/dx w.r.t. : mpg
    
    ------------------------------------------------------------------------------
                 |            Delta-method
                 |      dy/dx   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
    mpg          |
         foreign |
       Domestic  |  -329.2551   74.98545    -4.39   0.000    -478.8088   -179.7013
        Foreign  |  -250.3668    83.8404    -2.99   0.004    -417.5812    -83.1524
    ------------------------------------------------------------------------------

  • #2
    The standard error and hence the t-statistic for a regression slope is a function of such things as the residual sum of squares, the sample size, and the sum of squares for the explanatory variable. All of these differ between the two methods of analysis. One source for the relevant formulae is https://www3.nd.edu/~rwilliam/stats1/x91.pdf
    Last edited by Mike Lacy; 14 Jul 2019, 09:11.

    Comment


    • #3
      The sample size differs between the individual and joint regressions. Recall that the variance of the OLS estimator is

      $$\text{Var}(\beta)= \sigma^{2}(X^{\prime}X)^{-1}$$

      where we substitute \(\sigma^{2}\) with

      $$s^{2}= \frac{1}{N-K} \sum_{i=1}^{N} e_{i}^{2}.$$

      Thus \(s^{2}= f(N)\), the sample size.

      Simpler example where we have the same regression yielding same coefficient estimates but sample size differs.

      Code:
      sysuse auto, clear
      reg price mpg
      mat list e(V)
      gen cons=1
      mkmat mpg cons, mat(X)
      mat invxpx=  invsym(X'*X)
      predict res, r
      mkmat res, mat(e)
      mat S2_1= (1/(e(N)-2))*(e'*e)
      mat list S2_1
      mat V_1= S2_1*invxpx
      mat list V_1
      expand 2
      reg price mpg
      mat list e(V)
      mkmat mpg cons, mat(X)
      mat invxpx=  invsym(X'*X)
      predict res2, r
      mkmat res2, mat(e)
      mat S2_2= (1/(e(N)-2))*(e'*e)
      mat list S2_2
      mat V_2= S2_2*invxpx
      mat list V_2
      Results:

      Same coefficient, different standard errors due to different sample size.

      Code:
      . reg price mpg
      
            Source |       SS           df       MS      Number of obs   =        74
      -------------+----------------------------------   F(1, 72)        =     20.26
             Model |   139449474         1   139449474   Prob > F        =    0.0000
          Residual |   495615923        72  6883554.48   R-squared       =    0.2196
      -------------+----------------------------------   Adj R-squared   =    0.2087
             Total |   635065396        73  8699525.97   Root MSE        =    2623.7
      
      ------------------------------------------------------------------------------
             price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
               mpg |  -238.8943   53.07669    -4.50   0.000    -344.7008   -133.0879
             _cons |   11253.06   1170.813     9.61   0.000     8919.088    13587.03
      ------------------------------------------------------------------------------
      
       expand 2
      (74 observations created)
      
      .
      . reg price mpg
      
            Source |       SS           df       MS      Number of obs   =       148
      -------------+----------------------------------   F(1, 146)       =     41.08
             Model |   278898947         1   278898947   Prob > F        =    0.0000
          Residual |   991231845       146  6789259.21   R-squared       =    0.2196
      -------------+----------------------------------   Adj R-squared   =    0.2142
             Total |  1.2701e+09       147  8640345.53   Root MSE        =    2605.6
      
      ------------------------------------------------------------------------------
             price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
               mpg |  -238.8943   37.27294    -6.41   0.000    -312.5586   -165.2301
             _cons |   11253.06   822.1996    13.69   0.000      9628.11    12878.01
      ------------------------------------------------------------------------------

      Calculation of variance.

      Code:
      . mat list e(V)
      
      symmetric e(V)[2,2]
                    mpg       _cons
        mpg   2817.1347
      _cons  -59997.356   1370802.5
      
      . mat list S2_1
      
      symmetric S2_1[1,1]
                 res
      res  6883554.3
      
      
      . mat list V_1
      
      symmetric V_1[2,2]
                   mpg        cons
       mpg   2817.1346
      cons  -59997.354   1370802.5
      
      . mat list e(V)
      
      symmetric e(V)[2,2]
                    mpg       _cons
        mpg   1389.2719
      _cons  -29587.737   676012.21
      
      . mat list S2_2
      
      symmetric S2_2[1,1]
               res2
      res2  6789259
      
       
      . mat list V_2
      
      symmetric V_2[2,2]
                   mpg        cons
       mpg   1389.2719
      cons  -29587.736   676012.19

      Comment


      • #4
        Thanks a lot Mike and Andrew.

        I expected the results to be different due to sample size, but I can't intuitively get my head around how the sample size exactly affects the foreign and domestic standard errors for mpg, especially because one gets smaller and the other one get bigger. In the case of doubling the sample size it's intuitively easier to understand that the standard errors are smaller, but I must admit that I'm a bit rusty on the formulas for calculating standard errors, so I will use the reference and the code above to see if I can reproduce the results of the example in #1 to see what's going on exactly.

        Comment


        • #5
          I believe the difference is because when you estimate the equations separately, you're allowing for a kind of heteroskedasicity: the error variance can be different across foreign and domestic. When you include an interaction, you are imposing homoskedasticity: the variance is the same across foreign and domestic. Thus, the standard errors are computed under different assumptions about the error variance. And I don't think using the "robust" option everywhere will give you the same answer because with the separate regressions you're allowing the variances to be different based on foreign and then making the standard errors robust to heteroskedasticity as a function of mph. I believe the most robust standard errors are from

          Code:
          reg price c.mpg##foreign, robust
          margins foreign, dydx(mpg)

          Comment


          • #6
            I second Jeff Wooldridge 's explanation. One could go further to allow for heteroskedasticity by adopting a mixed model, and letting the residual variance vary by foreign. The resulting standard errors are a bit smaller by the added information from the extra observations, and the point estimates exactly the same.

            Separate subset regressions:

            Code:
            * Domestic
            . reg price mpg if 0.foreign
            
                  Source |       SS           df       MS      Number of obs   =        52
            -------------+----------------------------------   F(1, 50)        =     17.05
                   Model |  1.24392961         1  1.24392961   Prob > F        =    0.0001
                Residual |   3.6480186        50  .072960372   R-squared       =    0.2543
            -------------+----------------------------------   Adj R-squared   =    0.2394
                   Total |  4.89194821        51  .095920553   Root MSE        =    .27011
            
            ------------------------------------------------------------------------------
                   price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                     mpg |  -.0329255    .007974    -4.13   0.000    -.0489418   -.0169092
                   _cons |   1.260054   .1624773     7.76   0.000     .9337086    1.586399
            ------------------------------------------------------------------------------
            
            * Foreign
            . reg price mpg if 1.foreign
            
                  Source |       SS           df       MS      Number of obs   =        22
            -------------+----------------------------------   F(1, 20)        =     13.25
                   Model |  .575349439         1  .575349439   Prob > F        =    0.0016
                Residual |  .868282678        20  .043414134   R-squared       =    0.3985
            -------------+----------------------------------   Adj R-squared   =    0.3685
                   Total |  1.44363212        21  .068744387   Root MSE        =    .20836
            
            ------------------------------------------------------------------------------
                   price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
                     mpg |  -.0250367   .0068774    -3.64   0.002    -.0393828   -.0106906
                   _cons |   1.258695   .1760689     7.15   0.000     .8914217    1.625968
            ------------------------------------------------------------------------------
            Now, fit a single model to the whole sample with interaction effect of foreign and allow for independent residual variances by foreign.

            Code:
            . mixed price c.mpg##i.foreign , resid(independent, by(foreign))
            
            // output omitted
            
            Mixed-effects ML regression                     Number of obs     =         74
            Group variable: _all                            Number of groups  =          1
            
                                                            Obs per group:
                                                                          min =         74
                                                                          avg =       74.0
                                                                          max =         74
            
                                                            Wald chi2(3)      =      32.62
            Log likelihood = -.36281878                     Prob > chi2       =     0.0000
            
            -------------------------------------------------------------------------------
                    price |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
            --------------+----------------------------------------------------------------
                      mpg |  -.0329255   .0078192    -4.21   0.000    -.0482508   -.0176002
                          |
                  foreign |
                 Foreign  |  -.0013587   .2314424    -0.01   0.995    -.4549775    .4522601
                          |
            foreign#c.mpg |
                 Foreign  |   .0078888   .0102048     0.77   0.439    -.0121123      .02789
                          |
                    _cons |   1.260054   .1593221     7.91   0.000     .9477882    1.572319
            -------------------------------------------------------------------------------
            
            ------------------------------------------------------------------------------
              Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
            -----------------------------+------------------------------------------------
            _all:                (empty) |
            -----------------------------+------------------------------------------------
            Residual: Independent,       |
                by foreign               |
                        Domestic: var(e) |   .0701542   .0137584       .047766    .1030358
                         Foreign: var(e) |   .0394674   .0118999       .021857    .0712665
            ------------------------------------------------------------------------------
            LR test vs. linear model: chi2(1) = 2.35                  Prob > chi2 = 0.1256
            
            Note: The reported degrees of freedom assumes the null hypothesis is not on the boundary of the parameter space.  If this is
                  not true, then the reported test is conservative.
            
            . margins foreign, dydx(mpg)
            
            Average marginal effects                        Number of obs     =         74
            
            Expression   : Linear prediction, fixed portion, predict()
            dy/dx w.r.t. : mpg
            
            ------------------------------------------------------------------------------
                         |            Delta-method
                         |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
            -------------+----------------------------------------------------------------
            mpg          |
                 foreign |
               Domestic  |  -.0329255   .0078192    -4.21   0.000    -.0482508   -.0176002
                Foreign  |  -.0250367   .0065574    -3.82   0.000    -.0378889   -.0121845
            ------------------------------------------------------------------------------
            Notice the following:
            • all 74 observations are used
            • estimated average marginal effect of mpg within each level of foreign (-0.033 and -0.025) are the same compared to the subset regressions, but the mixed model estimates slightly smaller standard errors on those marginal effects
            • the variance of the residuals of domestic cars (0.073) and foreign cars (0.043) in the subset regressions are slightly larger than their respective estimates (0.070 and 0.039) in the mixed model

            Comment


            • #7
              Thank you Jeff and Leonardo for your insights.

              Very interesting to see the results of the mixed model with varying residual variance between the two groups. If you allow me one follow-up question, when estimating effects across groups in the presence of groupwise heteroskedasticity, is the mixed model approach with varying residual variance then an alternative to estimating with robust standard errors? Is it preferred, given the smaller standard errors?

              Comment


              • #8
                No, I don't think that the mixed approach is any more robust to heteroskedasticity. One could use robust variance estimators, but what is recommended depends on your specific case. There's very little practical difference in this toy example.

                Comment


                • #9
                  Originally posted by Leonardo Guizzetti View Post
                  No, I don't think that the mixed approach is any more robust to heteroskedasticity. One could use robust variance estimators, but what is recommended depends on your specific case. There's very little practical difference in this toy example.
                  Using the mixed model where the heteroskedasticity is assumed to only depend on the split variable is not as general as making standard errors robust. It's kind of cheating compared with fully robust standard errors. However, there is an argument to be made for using relatively simple forms of heteroskedasticity, as in a mixed model, and then making inference robust. In fact, in my introduction econometrics book, in the material on weighted least squares I propose exactly that. Use some model of heteroskedasticty to try to improve efficiency of OLS, but always use robust standard errors. Unfortunately, I can't get that work with the mixed command. But one could do the WLS by hand, of course.

                  JW

                  Comment

                  Working...
                  X