Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Standard errors for "regress ... , robust" depend on how data are sorted

    Hi,

    I am using a census dataset and ran OLS regressions with a binary dependent variable ("employed") and binary explanatory variable (yob_1919_age; equal to one if respondent was born in 1919) as well as two continuous explanatory variables (yob_age, yob2_age; these are the year of birth and the year of birth squared; please ignore the "_age" part in the variable names which doesn't mean anything).

    I found that the standard errors and p-values differ quite a lot depending on whether the dataset is sorted by age in ascending or in descending order. Please see a screenshot of the regression output below. This only happens when using the option "robust".

    The F statistic is missing, which could indicate that there is something wrong with the model. I limited the regression to those born from 1912 to 1919 and my guess is that the inclusion of a 1919 dummy and a binary outcome means that there is insufficient variation in one of the cells. However, I am still very surprised that the result depends on how the data is sorted (rather than simply having dots for the standard errors). I'd be thankful for any hint what may be causing this.

    I'm using Stata 16.1 on a Macbook but I could replicate this problem on a Windows PC.

    Best regards,
    Christian





  • #2
    I have to correct myself: When omitting "robust", there are still slight differences between both regression ouputs (please see below). In particular, the coefficients differ slightly.

    Best regards,
    Christian

    Click image for larger version

Name:	Bildschirmfoto 2022-11-25 um 16.22.04.png
Views:	1
Size:	114.5 KB
ID:	1690910

    Comment


    • #3
      How are those variables scaled? Is the predicted probability of being employed for someone born in 1919 really approximately 5,854,600 percent? Those models appear to be seriously flawed, also regarding the specification.

      My guess is that the variables are scaled in such an awkward way that the order in which the sums of squares are calculated matters and lead to precision problems. Stata's regress usually does a good job of rescaling the variables internally, so precision problems do not arise. Perhaps, there are limitations to that.


      Here is an awkward example, merely to illustrate the point of precision problems when calculating sums:

      Code:
      . clear all
      
      . set obs 26411
      Number of observations (_N) was 0, now 26,411.
      
      . generate double x = 999999999991+_n
      
      . 
      . gsort - x
      
      . generate double sum1 = sum(x)
      
      . 
      . gsort + x
      
      . generate double sum2 = sum(x)
      
      . 
      . su sum1 , meanonly
      
      . scalar max1 = r(max)
      
      . su sum2 , meanonly
      
      . scalar max2 = r(max)
      
      . 
      . display %21x max1
      +1.7752a8d7c0d6aX+036
      
      . display %21x max2
      +1.7752a8d7bfc6cX+036
      
      . 
      . assert max1 == max2
      assertion is false
      r(9);
      
      end of do-file
      
      r(9);
      Last edited by daniel klein; 25 Nov 2022, 09:40.

      Comment


      • #4
        That's very odd. I can't produce that problem with a couple of data sets I tried.

        Comment


        • #5
          Here is another silly example closer to what is presented in #1:

          Code:
          . version 17
          
          . set seed 42
          
          .
          . clear
          
          .
          . set obs 26411
          Number of observations (_N) was 0, now 26,411.
          
          . generate double yob = runiformint(1912, 1919)
          
          . generate yob_1919 = yob == 1919
          
          . generate double yob2 = yob^2
          
          . generate employed = runiform() > .5
          
          .
          . sort yob
          
          . regress employed yob_1919 yob yob2
          
                Source |       SS           df       MS      Number of obs   =    26,411
          -------------+----------------------------------   F(3, 26407)     =      0.59
                 Model |  .442991525         3  .147663842   Prob > F        =    0.6211
              Residual |  6602.05551    26,407  .250011569   R-squared       =    0.0001
          -------------+----------------------------------   Adj R-squared   =   -0.0000
                 Total |   6602.4985    26,410  .249999943   Root MSE        =    .50001
          
          ------------------------------------------------------------------------------
              employed | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
          -------------+----------------------------------------------------------------
              yob_1919 |   -.017137   .0160737    -1.07   0.286    -.0486423    .0143683
                   yob |  -4.459614   3.620588    -1.23   0.218    -11.55616    2.636933
                  yob2 |   .0011643   .0009453     1.23   0.218    -.0006886    .0030172
                 _cons |   4270.877   3466.726     1.23   0.218    -2524.092    11065.85
          ------------------------------------------------------------------------------
          
          .
          . gsort - yob
          
          . regress employed yob_1919 yob yob2
          
                Source |       SS           df       MS      Number of obs   =    26,411
          -------------+----------------------------------   F(3, 26407)     =      0.59
                 Model |  .442991525         3  .147663842   Prob > F        =    0.6211
              Residual |  6602.05551    26,407  .250011569   R-squared       =    0.0001
          -------------+----------------------------------   Adj R-squared   =   -0.0000
                 Total |   6602.4985    26,410  .249999943   Root MSE        =    .50001
          
          ------------------------------------------------------------------------------
              employed | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
          -------------+----------------------------------------------------------------
              yob_1919 |   -.017137   .0160737    -1.07   0.286    -.0486423    .0143683
                   yob |  -4.459606   3.620585    -1.23   0.218    -11.55615    2.636935
                  yob2 |   .0011643   .0009453     1.23   0.218    -.0006886    .0030172
                 _cons |    4270.87   3466.723     1.23   0.218    -2524.094    11065.83
          ------------------------------------------------------------------------------

          Comment


          • #6
            Some related themes are picked up in https://www.stata-journal.com/articl...article=st0394 It took me some years from thinking up the title "Species of origin" to knowing what would go in the paper.

            John Tukey used to urge "estimate centercepts, not intercepts". The urging seems to have been in teaching and advising and not written down, but Howard Wainer and Roger Koenker have both echoed the advice. Anyone who knows a printed source for this by Tukey would place me in their debt.

            Sir David Cox and many others used to emphasise similarly that regression is better thought of as starting from (y - mean of y) = b (x - mean of x).
            Last edited by Nick Cox; 25 Nov 2022, 11:12.

            Comment


            • #7
              Thanks for checking, Daniel. The variables in my dataset are scaled in the way you scaled them for your example. I understand that the linear probability model I used has problems, but my main worry is that there might a bug somewhere in the regress command. Now that you managed to reproduce the problem, would you still go with your guess that this is a scaling issue (such that rounding errors occur when calculating the sum of squares)?

              Comment


              • #8
                Originally posted by Christian Bommer View Post
                Now that you managed to reproduce the problem, would you still go with your guess that this is a scaling issue (such that rounding errors occur when calculating the sum of squares)?
                Yes.

                Code:
                . version 17
                
                . set seed 42
                
                . 
                . clear
                
                . 
                . set obs 26411
                Number of observations (_N) was 0, now 26,411.
                
                . generate double yob = runiformint(1912, 1919)
                
                . generate yob_1919 = yob == 1919
                
                . generate double yob2 = yob^2
                
                . generate employed = runiform() > .5
                
                . 
                . generate double yob_rescaled = yob - 1912
                
                . generate double yob_rescaled2 = yob_rescaled^2
                
                . 
                . sort yob
                
                . regress employed yob_1919 yob_rescaled yob_rescaled2
                
                      Source |       SS           df       MS      Number of obs   =    26,411
                -------------+----------------------------------   F(3, 26407)     =      0.59
                       Model |  .442991525         3  .147663842   Prob > F        =    0.6211
                    Residual |  6602.05551    26,407  .250011569   R-squared       =    0.0001
                -------------+----------------------------------   Adj R-squared   =   -0.0000
                       Total |   6602.4985    26,410  .249999943   Root MSE        =    .50001
                
                -------------------------------------------------------------------------------
                     employed | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
                --------------+----------------------------------------------------------------
                     yob_1919 |  -.0171371   .0160737    -1.07   0.286    -.0486425    .0143682
                 yob_rescaled |  -.0073039   .0059211    -1.23   0.217    -.0189096    .0043018
                yob_rescaled2 |   .0011643   .0009453     1.23   0.218    -.0006886    .0030172
                        _cons |   .5042408   .0076283    66.10   0.000     .4892889    .5191928
                -------------------------------------------------------------------------------
                
                . 
                . gsort - yob
                
                . regress employed yob_1919 yob_rescaled yob_rescaled2
                
                      Source |       SS           df       MS      Number of obs   =    26,411
                -------------+----------------------------------   F(3, 26407)     =      0.59
                       Model |  .442991525         3  .147663842   Prob > F        =    0.6211
                    Residual |  6602.05551    26,407  .250011569   R-squared       =    0.0001
                -------------+----------------------------------   Adj R-squared   =   -0.0000
                       Total |   6602.4985    26,410  .249999943   Root MSE        =    .50001
                
                -------------------------------------------------------------------------------
                     employed | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
                --------------+----------------------------------------------------------------
                     yob_1919 |  -.0171371   .0160737    -1.07   0.286    -.0486425    .0143682
                 yob_rescaled |  -.0073039   .0059211    -1.23   0.217    -.0189096    .0043018
                yob_rescaled2 |   .0011643   .0009453     1.23   0.218    -.0006886    .0030172
                        _cons |   .5042408   .0076283    66.10   0.000     .4892889    .5191928
                -------------------------------------------------------------------------------
                Last edited by daniel klein; 25 Nov 2022, 11:29. Reason: generated variables in float now double; results remain stable

                Comment


                • #9
                  Thanks, that's very helpful!

                  Comment

                  Working...
                  X