Standard errors for "regress ... , robust" depend on how data are sorted

Christian Bommer

Join Date: Nov 2014

Posts: 9
#1

Standard errors for "regress ... , robust" depend on how data are sorted

25 Nov 2022, 07:48

Hi,

I am using a census dataset and ran OLS regressions with a binary dependent variable ("employed") and binary explanatory variable (yob_1919_age; equal to one if respondent was born in 1919) as well as two continuous explanatory variables (yob_age, yob2_age; these are the year of birth and the year of birth squared; please ignore the "_age" part in the variable names which doesn't mean anything).

I found that the standard errors and p-values differ quite a lot depending on whether the dataset is sorted by age in ascending or in descending order. Please see a screenshot of the regression output below. This only happens when using the option "robust".

The F statistic is missing, which could indicate that there is something wrong with the model. I limited the regression to those born from 1912 to 1919 and my guess is that the inclusion of a 1919 dummy and a binary outcome means that there is insufficient variation in one of the cells. However, I am still very surprised that the result depends on how the data is sorted (rather than simply having dots for the standard errors). I'd be thankful for any hint what may be causing this.

I'm using Stata 16.1 on a Macbook but I could replicate this problem on a Windows PC.

Best regards,
Christian
Tags: None
Christian Bommer

Join Date: Nov 2014

Posts: 9
#2

25 Nov 2022, 08:26

I have to correct myself: When omitting "robust", there are still slight differences between both regression ouputs (please see below). In particular, the coefficients differ slightly.

Best regards,
Christian
Comment
daniel klein

Join Date: Mar 2014

Posts: 3883
#3

25 Nov 2022, 09:08

How are those variables scaled? Is the predicted probability of being employed for someone born in 1919 really approximately 5,854,600 percent? Those models appear to be seriously flawed, also regarding the specification.

My guess is that the variables are scaled in such an awkward way that the order in which the sums of squares are calculated matters and lead to precision problems. Stata's regress usually does a good job of rescaling the variables internally, so precision problems do not arise. Perhaps, there are limitations to that.

Here is an awkward example, merely to illustrate the point of precision problems when calculating sums:

Code:

. clear all . set obs 26411 Number of observations (_N) was 0, now 26,411. . generate double x = 999999999991+_n . . gsort - x . generate double sum1 = sum(x) . . gsort + x . generate double sum2 = sum(x) . . su sum1 , meanonly . scalar max1 = r(max) . su sum2 , meanonly . scalar max2 = r(max) . . display %21x max1 +1.7752a8d7c0d6aX+036 . display %21x max2 +1.7752a8d7bfc6cX+036 . . assert max1 == max2 assertion is false r(9); end of do-file r(9);

Last edited by daniel klein; 25 Nov 2022, 09:40.
1 like
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2197
#4

25 Nov 2022, 10:12

That's very odd. I can't produce that problem with a couple of data sets I tried.
Comment

daniel klein

Join Date: Mar 2014
Posts: 3883

25 Nov 2022, 10:41

Here is another silly example closer to what is presented in #1:

Code:

. version 17

. set seed 42

.
. clear

.
. set obs 26411
Number of observations (_N) was 0, now 26,411.

. generate double yob = runiformint(1912, 1919)

. generate yob_1919 = yob == 1919

. generate double yob2 = yob^2

. generate employed = runiform() > .5

.
. sort yob

. regress employed yob_1919 yob yob2

      Source |       SS           df       MS      Number of obs   =    26,411
-------------+----------------------------------   F(3, 26407)     =      0.59
       Model |  .442991525         3  .147663842   Prob > F        =    0.6211
    Residual |  6602.05551    26,407  .250011569   R-squared       =    0.0001
-------------+----------------------------------   Adj R-squared   =   -0.0000
       Total |   6602.4985    26,410  .249999943   Root MSE        =    .50001

------------------------------------------------------------------------------
    employed | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
    yob_1919 |   -.017137   .0160737    -1.07   0.286    -.0486423    .0143683
         yob |  -4.459614   3.620588    -1.23   0.218    -11.55616    2.636933
        yob2 |   .0011643   .0009453     1.23   0.218    -.0006886    .0030172
       _cons |   4270.877   3466.726     1.23   0.218    -2524.092    11065.85
------------------------------------------------------------------------------

.
. gsort - yob

. regress employed yob_1919 yob yob2

      Source |       SS           df       MS      Number of obs   =    26,411
-------------+----------------------------------   F(3, 26407)     =      0.59
       Model |  .442991525         3  .147663842   Prob > F        =    0.6211
    Residual |  6602.05551    26,407  .250011569   R-squared       =    0.0001
-------------+----------------------------------   Adj R-squared   =   -0.0000
       Total |   6602.4985    26,410  .249999943   Root MSE        =    .50001

------------------------------------------------------------------------------
    employed | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
    yob_1919 |   -.017137   .0160737    -1.07   0.286    -.0486423    .0143683
         yob |  -4.459606   3.620585    -1.23   0.218    -11.55615    2.636935
        yob2 |   .0011643   .0009453     1.23   0.218    -.0006886    .0030172
       _cons |    4270.87   3466.723     1.23   0.218    -2524.094    11065.83
------------------------------------------------------------------------------

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35765
#6

25 Nov 2022, 11:10

Some related themes are picked up in https://www.stata-journal.com/articl...article=st0394 It took me some years from thinking up the title "Species of origin" to knowing what would go in the paper.

John Tukey used to urge "estimate centercepts, not intercepts". The urging seems to have been in teaching and advising and not written down, but Howard Wainer and Roger Koenker have both echoed the advice. Anyone who knows a printed source for this by Tukey would place me in their debt.

Sir David Cox and many others used to emphasise similarly that regression is better thought of as starting from (y - mean of y) = b (x - mean of x).

Last edited by Nick Cox; 25 Nov 2022, 11:12.
2 likes
Comment
Christian Bommer

Join Date: Nov 2014

Posts: 9
#7

25 Nov 2022, 11:14

Thanks for checking, Daniel. The variables in my dataset are scaled in the way you scaled them for your example. I understand that the linear probability model I used has problems, but my main worry is that there might a bug somewhere in the regress command. Now that you managed to reproduce the problem, would you still go with your guess that this is a scaling issue (such that rounding errors occur when calculating the sum of squares)?
Comment

daniel klein

Join Date: Mar 2014
Posts: 3883

25 Nov 2022, 11:26

Originally posted by Christian Bommer View Post

Now that you managed to reproduce the problem, would you still go with your guess that this is a scaling issue (such that rounding errors occur when calculating the sum of squares)?

Yes.

Code:

. version 17

. set seed 42

. 
. clear

. 
. set obs 26411
Number of observations (_N) was 0, now 26,411.

. generate double yob = runiformint(1912, 1919)

. generate yob_1919 = yob == 1919

. generate double yob2 = yob^2

. generate employed = runiform() > .5

. 
. generate double yob_rescaled = yob - 1912

. generate double yob_rescaled2 = yob_rescaled^2

. 
. sort yob

. regress employed yob_1919 yob_rescaled yob_rescaled2

      Source |       SS           df       MS      Number of obs   =    26,411
-------------+----------------------------------   F(3, 26407)     =      0.59
       Model |  .442991525         3  .147663842   Prob > F        =    0.6211
    Residual |  6602.05551    26,407  .250011569   R-squared       =    0.0001
-------------+----------------------------------   Adj R-squared   =   -0.0000
       Total |   6602.4985    26,410  .249999943   Root MSE        =    .50001

-------------------------------------------------------------------------------
     employed | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
--------------+----------------------------------------------------------------
     yob_1919 |  -.0171371   .0160737    -1.07   0.286    -.0486425    .0143682
 yob_rescaled |  -.0073039   .0059211    -1.23   0.217    -.0189096    .0043018
yob_rescaled2 |   .0011643   .0009453     1.23   0.218    -.0006886    .0030172
        _cons |   .5042408   .0076283    66.10   0.000     .4892889    .5191928
-------------------------------------------------------------------------------

. 
. gsort - yob

. regress employed yob_1919 yob_rescaled yob_rescaled2

      Source |       SS           df       MS      Number of obs   =    26,411
-------------+----------------------------------   F(3, 26407)     =      0.59
       Model |  .442991525         3  .147663842   Prob > F        =    0.6211
    Residual |  6602.05551    26,407  .250011569   R-squared       =    0.0001
-------------+----------------------------------   Adj R-squared   =   -0.0000
       Total |   6602.4985    26,410  .249999943   Root MSE        =    .50001

-------------------------------------------------------------------------------
     employed | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
--------------+----------------------------------------------------------------
     yob_1919 |  -.0171371   .0160737    -1.07   0.286    -.0486425    .0143682
 yob_rescaled |  -.0073039   .0059211    -1.23   0.217    -.0189096    .0043018
yob_rescaled2 |   .0011643   .0009453     1.23   0.218    -.0006886    .0030172
        _cons |   .5042408   .0076283    66.10   0.000     .4892889    .5191928
-------------------------------------------------------------------------------

Last edited by daniel klein; 25 Nov 2022, 11:29. Reason: generated variables in float now double; results remain stable

Comment

Christian Bommer

Join Date: Nov 2014

Posts: 9
#9

25 Nov 2022, 11:43

Thanks, that's very helpful!
Comment

Announcement

Standard errors for "regress ... , robust" depend on how data are sorted

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment