Boottest: comparing coefficients to baseline category in factor variable regression.

Tom Ford

Join Date: Mar 2021
Posts: 88

Boottest: comparing coefficients to baseline category in factor variable regression.

10 Sep 2023, 06:38

Hello,
I am running a regression that looks like this.

Code:

sysuse auto,clear
gen brand = word(make, 1)
encode brand, generate(brand_en)
reg price ib1.brand_en

these are my results.

Code:

Source | SS df MS Number of obs = 74
-------------+---------------------------------- F(22, 51) = 9.52
Model | 510687926 22 23213087.6 Prob > F = 0.0000
Residual | 124377470 51 2438773.92 R-squared = 0.8042
-------------+---------------------------------- Adj R-squared = 0.7197
Total | 635065396 73 8699525.97 Root MSE = 1561.7

------------------------------------------------------------------------------
price | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
brand_en |
Audi | 3776.833 1425.592 2.65 0.011 914.8386 6638.828
BMW | 5519.333 1803.247 3.06 0.004 1899.165 9139.502
Buick | 1859.619 1077.646 1.73 0.090 -303.8456 4023.084
Cad. | 9714.667 1275.088 7.62 0.000 7154.821 12274.51
Chev. | 156.6667 1104.259 0.14 0.888 -2060.225 2373.558
Datsun | 1790.833 1192.736 1.50 0.139 -603.6832 4185.35
Dodge | 839.8333 1192.736 0.70 0.485 -1554.683 3234.35
Fiat | 80.33333 1803.247 0.04 0.965 -3539.835 3700.502
Ford | 72.33333 1425.592 0.05 0.960 -2789.661 2934.328
Honda | 933.3333 1425.592 0.65 0.516 -1928.661 3795.328
Linc. | 8636.667 1275.088 6.77 0.000 6076.821 11196.51
Mazda | -220.6667 1803.247 -0.12 0.903 -3840.835 3399.502
Merc. | 698.1667 1104.259 0.63 0.530 -1518.725 2915.058
Olds | 1835.19 1077.646 1.70 0.095 -328.2742 3998.655
Peugeot | 8774.333 1803.247 4.87 0.000 5154.165 12394.5
Plym. | 604.3333 1140.473 0.53 0.598 -1685.262 2893.929
Pont. | 663.1667 1104.259 0.60 0.551 -1553.725 2880.058
Renault | -320.6667 1803.247 -0.18 0.860 -3940.835 3299.502
Subaru | -417.6667 1803.247 -0.23 0.818 -4037.835 3202.502
Toyota | 906.3333 1275.088 0.71 0.480 -1653.513 3466.179
VW | 1805.333 1192.736 1.51 0.136 -589.1832 4199.85
Volvo | 7779.333 1803.247 4.31 0.000 4159.165 11399.5
|
_cons | 4215.667 901.6233 4.68 0.000 2405.582 6025.751
------------------------------------------------------------------------------

I would like to test if my coefficients are significantly different from one another using bootstrapped standard errors and accounting the bonferroni adjustment.

I do not understand however why I obtain different results. if I use 'boottest {4.brand_en=_cons}' from "boottest {4.brand_en}".

Here is an example. If anyone could help understand this it would be ideal.

here is an example

Code:

. boottest {4.brand_en=_cons} ///
> {4.brand_en=10.brand_en}, ///
> madjust(bonferroni) nograph reps(99) seed(123)

Wild bootstrap-t, null imposed, 99 replications, Wald test, Rademacher weights:
  4.brand_en=_cons

                           t(51) =    -1.2417
                        Prob>|t| =     0.1717
  Bonferroni-adjusted prob =     0.3434

95% confidence set for null hypothesis expression: [−30073, 22243]

Wild bootstrap-t, null imposed, 99 replications, Wald test, Rademacher weights:
  4.brand_en=10.brand_en

                           t(51) =     1.4274
                        Prob>|t| =     0.2222
  Bonferroni-adjusted prob =     0.4444

95% confidence set for null hypothesis expression: [−7635, 11210]

.
. boottest {4.brand_en} ///
> {4.brand_en=10.brand_en}, /// export private vs export state *** no-diff
> madjust(bonferroni) nograph reps(99) seed(123)

Wild bootstrap-t, null imposed, 99 replications, Wald test, Rademacher weights:
  4.brand_en

                           t(51) =     1.7256
                        Prob>|t| =     0.1313
  Bonferroni-adjusted prob =     0.2626

95% confidence set for null hypothesis expression: [−5296, 9254]

Wild bootstrap-t, null imposed, 99 replications, Wald test, Rademacher weights:
  4.brand_en=10.brand_en

                           t(51) =     1.4274
                        Prob>|t| =     0.2222
  Bonferroni-adjusted prob =     0.4444

95% confidence set for null hypothesis expression: [−7635, 11210]

As you can see the Prob>|t| value is different if I specify 'boottest {4.brand_en=_cons}' or "boottest {4.brand_en}".

Tags: None

Bruce Weaver

Join Date: May 2014

Posts: 1144
#2

10 Sep 2023, 17:14

For those who are interested, I believe Tom is referring to this package from SSC:

Code:

ssc describe boottest

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10298
#3

10 Sep 2023, 23:02

Originally posted by Tom Ford View Post

I do not understand however why I obtain different results. if I use 'boottest {4.brand_en=_cons}' from "boottest {4.brand_en}".

In the first instance, you are testing

Code:

4.brand_en-_cons=0

whereas in the second instance, you are testing

Code:

4.brand_en=0

As these are distinct tests, you should not expect to get the same result given a nonzero estimate of the intercept. Notice that you obtain the same t-statistic on 4.brand_en (Buick) in the regression table for the latter.

Last edited by Andrew Musau; 10 Sep 2023, 23:04.
Comment

Tom Ford

Join Date: Mar 2021
Posts: 88

11 Sep 2023, 03:03

Dear Andrew,
thanks this is what I thought and what I was initially doing.
The problem, however, is that using my real data this interpretation appears to be incorrect.

Take the following code using my data

Code:

reg y ib4.x,cluster(factory_code)

Linear regression                               Number of obs     =      2,482
                                                F(4, 49)          =       4.79
                                                Prob > F          =     0.0024
                                                R-squared         =     0.0065
                                                Root MSE          =     .76774

                                  (Std. err. adjusted for 50 clusters in factory_code)
--------------------------------------------------------------------------------------
                     |               Robust
           y| Coefficient  std. err.      t    P>|t|     [95% conf. interval]
---------------------+----------------------------------------------------------------
      x |
     First treat  |   -.002304    .050211    -0.05   0.964    -.1032069    .0985988
    Second treat  |   .0683877   .0459239     1.49   0.143    -.0238999    .1606753
      Third treat  |   .1641617   .0429564     3.82   0.000     .0778377    .2504858
Fifth treat  |   .0946966   .0513413     1.84   0.071    -.0084776    .1978708
                     |
               _cons |   .4542056   .0434892    10.44   0.000     .3668109    .5416003
--------------------------------------------------------------------------------------

The regression results suggest that the first treatment is not significantly different from the fourth treatment (_cons).

However, If I run the bootest command specifying {1.x=_cons}, it finds a strongly significant difference between the two coefficients. How is this possible?

Code:

. boottest {1.x=_cons} ///
> {1.x=2.x}, /// 
> madjust(bonferroni) nograph reps(9999) seed(123)

Wild bootstrap-t, null imposed, 9999 replications, Wald test, bootstrap clustering by factory_code, Rademacher weights:
  1.x=_cons

                           t(49) =    -5.3436
                        Prob>|t| =     0.0000
  Bonferroni-adjusted prob =     0.0000

95% confidence set for null hypothesis expression: [−.6513, −.2584]

Wild bootstrap-t, null imposed, 9999 replications, Wald test, bootstrap clustering by factory_code, Rademacher weights:
  1.x=2.x

                           t(49) =    -1.5642
                        Prob>|t| =     0.1258
  Bonferroni-adjusted prob =     0.2516

95% confidence set for null hypothesis expression: [−.1754, .03339]

This seems incorrect. Or at least that bootest is not doing what I think it is.

Conversely if I run boottest using simply {1.x} results are as expected.

Code:

. boottest {1.x} /// Export vs no export *** no-diff
> {1.x=2.x}, /// export private vs export state *** no-diff
> madjust(bonferroni) nograph reps(9999) seed(123)

Wild bootstrap-t, null imposed, 9999 replications, Wald test, bootstrap clustering by factory_code, Rademacher weights:
  1.x

                           t(49) =    -0.0459
                        Prob>|t| =     0.9635
  Bonferroni-adjusted prob =     1.0000

95% confidence set for null hypothesis expression: [−.1192, .1166]

Wild bootstrap-t, null imposed, 9999 replications, Wald test, bootstrap clustering by factory_code, Rademacher weights:
  1.x=2.x

                           t(49) =    -1.5642
                        Prob>|t| =     0.1258
  Bonferroni-adjusted prob =     0.2516

95% confidence set for null hypothesis expression: [−.1754, .03339]

Note that in the first case, the t statistic is the same as in the regression coefficient as I would expect.

Can you help me interpret this. Is it possible that with factor variable regressions? using "{1.x}" rather than "{1.x=_cons}" is the correct way to check if the coefficient of interest is equal to baseline coefficient?

thanks a lot

Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10298

11 Sep 2023, 04:19

OK, I think that I see your confusion. You are correct that the coefficient on the intercept term represents the coefficient on the base category that is omitted because of collinearity. But the coefficients on the non-base categories are expressed as differences, so you cannot subtract the coefficient on the intercept term from them. To do that, omit the intercept and include all levels of the factor variable. You can see the expressed relation highlighted below, where \(t= \sqrt{F}\).

Code:

sysuse auto,clear
gen brand = word(make, 1)
encode brand, generate(brand_en)
reg price ib1.brand_en, baselevels
reg price ibn.brand_en, nocons
test 4.brand_en=1.brand_en
display "t= " sqrt(r(F))

Res.:

Code:

. reg price ib1.brand_en, baselevels

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(22, 51)       =      9.52
       Model |   510687926        22  23213087.6   Prob > F        =    0.0000
    Residual |   124377470        51  2438773.92   R-squared       =    0.8042
-------------+----------------------------------   Adj R-squared   =    0.7197
       Total |   635065396        73  8699525.97   Root MSE        =    1561.7

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
    brand_en |
        AMC  |          0  (base)
       Audi  |   3776.833   1425.592     2.65   0.011     914.8386    6638.828
        BMW  |   5519.333   1803.247     3.06   0.004     1899.165    9139.502
      Buick  |   1859.619   1077.646     1.73  0.090    -303.8456    4023.084
       Cad.  |   9714.667   1275.088     7.62   0.000     7154.821    12274.51
      Chev.  |   156.6667   1104.259     0.14   0.888    -2060.225    2373.558
     Datsun  |   1790.833   1192.736     1.50   0.139    -603.6832     4185.35
      Dodge  |   839.8333   1192.736     0.70   0.485    -1554.683     3234.35
       Fiat  |   80.33333   1803.247     0.04   0.965    -3539.835    3700.502
       Ford  |   72.33333   1425.592     0.05   0.960    -2789.661    2934.328
      Honda  |   933.3333   1425.592     0.65   0.516    -1928.661    3795.328
      Linc.  |   8636.667   1275.088     6.77   0.000     6076.821    11196.51
      Mazda  |  -220.6667   1803.247    -0.12   0.903    -3840.835    3399.502
      Merc.  |   698.1667   1104.259     0.63   0.530    -1518.725    2915.058
       Olds  |    1835.19   1077.646     1.70   0.095    -328.2742    3998.655
    Peugeot  |   8774.333   1803.247     4.87   0.000     5154.165     12394.5
      Plym.  |   604.3333   1140.473     0.53   0.598    -1685.262    2893.929
      Pont.  |   663.1667   1104.259     0.60   0.551    -1553.725    2880.058
    Renault  |  -320.6667   1803.247    -0.18   0.860    -3940.835    3299.502
     Subaru  |  -417.6667   1803.247    -0.23   0.818    -4037.835    3202.502
     Toyota  |   906.3333   1275.088     0.71   0.480    -1653.513    3466.179
         VW  |   1805.333   1192.736     1.51   0.136    -589.1832     4199.85
      Volvo  |   7779.333   1803.247     4.31   0.000     4159.165     11399.5
             |
       _cons |   4215.667   901.6233     4.68   0.000     2405.582    6025.751
------------------------------------------------------------------------------

. 
. reg price ibn.brand_en, nocons

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(23, 51)       =     59.25
       Model |  3.3235e+09        23   144498124   Prob > F        =    0.0000
    Residual |   124377470        51  2438773.92   R-squared       =    0.9639
-------------+----------------------------------   Adj R-squared   =    0.9477
       Total |  3.4478e+09        74  46592355.7   Root MSE        =    1561.7

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
    brand_en |
        AMC  |   4215.667   901.6233     4.68   0.000     2405.582    6025.751
       Audi  |     7992.5   1104.259     7.24   0.000     5775.608    10209.39
        BMW  |       9735   1561.657     6.23   0.000     6599.842    12870.16
      Buick  |   6075.286    590.251    10.29   0.000     4890.307    7260.264
       Cad.  |   13930.33   901.6233    15.45   0.000     12120.25    15740.42
      Chev.  |   4372.333    637.544     6.86   0.000      3092.41    5652.256
     Datsun  |     6006.5   780.8287     7.69   0.000     4438.921    7574.079
      Dodge  |     5055.5   780.8287     6.47   0.000     3487.921    6623.079
       Fiat  |       4296   1561.657     2.75   0.008     1160.842    7431.158
       Ford  |       4288   1104.259     3.88   0.000     2071.108    6504.892
      Honda  |       5149   1104.259     4.66   0.000     2932.108    7365.892
      Linc.  |   12852.33   901.6233    14.25   0.000     11042.25    14662.42
      Mazda  |       3995   1561.657     2.56   0.014     859.8419    7130.158
      Merc.  |   4913.833    637.544     7.71   0.000      3633.91    6193.756
       Olds  |   6050.857    590.251    10.25   0.000     4865.879    7235.836
    Peugeot  |      12990   1561.657     8.32   0.000     9854.842    16125.16
      Plym.  |       4820   698.3944     6.90   0.000     3417.915    6222.085
      Pont.  |   4878.833    637.544     7.65   0.000      3598.91    6158.756
    Renault  |       3895   1561.657     2.49   0.016     759.8419    7030.158
     Subaru  |       3798   1561.657     2.43   0.019     662.8419    6933.158
     Toyota  |       5122   901.6233     5.68   0.000     3311.916    6932.084
         VW  |       6021   780.8287     7.71   0.000     4453.421    7588.579
      Volvo  |      11995   1561.657     7.68   0.000     8859.842    15130.16
------------------------------------------------------------------------------

. 
. test 4.brand_en=1.brand_en

 ( 1)  - 1bn.brand_en + 4.brand_en = 0

       F(  1,    51) =    2.98
            Prob > F =    0.0905

. 
. display "t= " sqrt(r(F))
t= 1.7256307

.

Comment

Tom Ford

Join Date: Mar 2021

Posts: 88
#6

11 Sep 2023, 10:48

Thanks Andrew, this is amazing and makes a lot of sense. You solved my problem!

Just to clarify. the issue with using

Code:

reg price ib4.brand_en test _cons=1.brand_en

is that in this context 1.brand_en is the difference between the coefficient 1.brand_en and the _constant, while using the nocons approach. I am comparing the two values of the coefficients (rather than difference with the value) Thanks again
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10298
#7

11 Sep 2023, 12:21

Originally posted by Tom Ford View Post

Just to clarify. the issue with using

Code:

reg price ib4.brand_en test _cons=1.brand_en

is that in this context 1.brand_en is the difference between the coefficient 1.brand_en and the _constant, while using the nocons approach. I am comparing the two values of the coefficients (rather than difference with the value)

The test calculates the difference between the coefficient on the intercept and the coefficient on 1.brand_en. The coefficient on 1.brand_en is the difference between 1.brand_en and the base level, in this case 4.brand_en as you have "ib4.brand_en" on the RHS. The rest is correct.

Last edited by Andrew Musau; 11 Sep 2023, 12:27.
1 like
Comment
Tom Ford

Join Date: Mar 2021

Posts: 88
#8

12 Sep 2023, 05:39

Thanks this is great!
Comment

Announcement

Boottest: comparing coefficients to baseline category in factor variable regression.

Comment

Comment

Comment

Comment

Comment

Comment

Comment