Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Boottest: comparing coefficients to baseline category in factor variable regression.

    Hello,
    I am running a regression that looks like this.

    Code:
    sysuse auto,clear
    gen brand = word(make, 1)
    encode brand, generate(brand_en)
    reg price ib1.brand_en
    these are my results.

    Code:
    Source | SS df MS Number of obs = 74
    -------------+---------------------------------- F(22, 51) = 9.52
    Model | 510687926 22 23213087.6 Prob > F = 0.0000
    Residual | 124377470 51 2438773.92 R-squared = 0.8042
    -------------+---------------------------------- Adj R-squared = 0.7197
    Total | 635065396 73 8699525.97 Root MSE = 1561.7
    
    ------------------------------------------------------------------------------
    price | Coefficient Std. err. t P>|t| [95% conf. interval]
    -------------+----------------------------------------------------------------
    brand_en |
    Audi | 3776.833 1425.592 2.65 0.011 914.8386 6638.828
    BMW | 5519.333 1803.247 3.06 0.004 1899.165 9139.502
    Buick | 1859.619 1077.646 1.73 0.090 -303.8456 4023.084
    Cad. | 9714.667 1275.088 7.62 0.000 7154.821 12274.51
    Chev. | 156.6667 1104.259 0.14 0.888 -2060.225 2373.558
    Datsun | 1790.833 1192.736 1.50 0.139 -603.6832 4185.35
    Dodge | 839.8333 1192.736 0.70 0.485 -1554.683 3234.35
    Fiat | 80.33333 1803.247 0.04 0.965 -3539.835 3700.502
    Ford | 72.33333 1425.592 0.05 0.960 -2789.661 2934.328
    Honda | 933.3333 1425.592 0.65 0.516 -1928.661 3795.328
    Linc. | 8636.667 1275.088 6.77 0.000 6076.821 11196.51
    Mazda | -220.6667 1803.247 -0.12 0.903 -3840.835 3399.502
    Merc. | 698.1667 1104.259 0.63 0.530 -1518.725 2915.058
    Olds | 1835.19 1077.646 1.70 0.095 -328.2742 3998.655
    Peugeot | 8774.333 1803.247 4.87 0.000 5154.165 12394.5
    Plym. | 604.3333 1140.473 0.53 0.598 -1685.262 2893.929
    Pont. | 663.1667 1104.259 0.60 0.551 -1553.725 2880.058
    Renault | -320.6667 1803.247 -0.18 0.860 -3940.835 3299.502
    Subaru | -417.6667 1803.247 -0.23 0.818 -4037.835 3202.502
    Toyota | 906.3333 1275.088 0.71 0.480 -1653.513 3466.179
    VW | 1805.333 1192.736 1.51 0.136 -589.1832 4199.85
    Volvo | 7779.333 1803.247 4.31 0.000 4159.165 11399.5
    |
    _cons | 4215.667 901.6233 4.68 0.000 2405.582 6025.751
    ------------------------------------------------------------------------------
    I would like to test if my coefficients are significantly different from one another using bootstrapped standard errors and accounting the bonferroni adjustment.

    I do not understand however why I obtain different results. if I use 'boottest {4.brand_en=_cons}' from "boottest {4.brand_en}".

    Here is an example. If anyone could help understand this it would be ideal.

    here is an example


    Code:
    . boottest {4.brand_en=_cons} ///
    > {4.brand_en=10.brand_en}, ///
    > madjust(bonferroni) nograph reps(99) seed(123)
    
    Wild bootstrap-t, null imposed, 99 replications, Wald test, Rademacher weights:
      4.brand_en=_cons
    
                               t(51) =    -1.2417
                            Prob>|t| =     0.1717
      Bonferroni-adjusted prob =     0.3434
    
    95% confidence set for null hypothesis expression: [−30073, 22243]
    
    Wild bootstrap-t, null imposed, 99 replications, Wald test, Rademacher weights:
      4.brand_en=10.brand_en
    
                               t(51) =     1.4274
                            Prob>|t| =     0.2222
      Bonferroni-adjusted prob =     0.4444
    
    95% confidence set for null hypothesis expression: [−7635, 11210]
    
    .
    . boottest {4.brand_en} ///
    > {4.brand_en=10.brand_en}, /// export private vs export state *** no-diff
    > madjust(bonferroni) nograph reps(99) seed(123)
    
    Wild bootstrap-t, null imposed, 99 replications, Wald test, Rademacher weights:
      4.brand_en
    
                               t(51) =     1.7256
                            Prob>|t| =     0.1313
      Bonferroni-adjusted prob =     0.2626
    
    95% confidence set for null hypothesis expression: [−5296, 9254]
    
    Wild bootstrap-t, null imposed, 99 replications, Wald test, Rademacher weights:
      4.brand_en=10.brand_en
    
                               t(51) =     1.4274
                            Prob>|t| =     0.2222
      Bonferroni-adjusted prob =     0.4444
    
    95% confidence set for null hypothesis expression: [−7635, 11210]
    As you can see the Prob>|t| value is different if I specify 'boottest {4.brand_en=_cons}' or "boottest {4.brand_en}".

  • #2
    For those who are interested, I believe Tom is referring to this package from SSC:

    Code:
    ssc describe boottest
    --
    Bruce Weaver
    Email: [email protected]
    Version: Stata/MP 18.5 (Windows)

    Comment


    • #3
      Originally posted by Tom Ford View Post

      I do not understand however why I obtain different results. if I use 'boottest {4.brand_en=_cons}' from "boottest {4.brand_en}".
      In the first instance, you are testing

      Code:
      4.brand_en-_cons=0
      whereas in the second instance, you are testing


      Code:
      4.brand_en=0
      As these are distinct tests, you should not expect to get the same result given a nonzero estimate of the intercept. Notice that you obtain the same t-statistic on 4.brand_en (Buick) in the regression table for the latter.





      Last edited by Andrew Musau; 10 Sep 2023, 23:04.

      Comment


      • #4
        Dear Andrew,
        thanks this is what I thought and what I was initially doing.
        The problem, however, is that using my real data this interpretation appears to be incorrect.

        Take the following code using my data

        Code:
        reg y ib4.x,cluster(factory_code)
        
        Linear regression                               Number of obs     =      2,482
                                                        F(4, 49)          =       4.79
                                                        Prob > F          =     0.0024
                                                        R-squared         =     0.0065
                                                        Root MSE          =     .76774
        
                                          (Std. err. adjusted for 50 clusters in factory_code)
        --------------------------------------------------------------------------------------
                             |               Robust
                   y| Coefficient  std. err.      t    P>|t|     [95% conf. interval]
        ---------------------+----------------------------------------------------------------
              x |
             First treat  |   -.002304    .050211    -0.05   0.964    -.1032069    .0985988
            Second treat  |   .0683877   .0459239     1.49   0.143    -.0238999    .1606753
              Third treat  |   .1641617   .0429564     3.82   0.000     .0778377    .2504858
        Fifth treat  |   .0946966   .0513413     1.84   0.071    -.0084776    .1978708
                             |
                       _cons |   .4542056   .0434892    10.44   0.000     .3668109    .5416003
        --------------------------------------------------------------------------------------
        The regression results suggest that the first treatment is not significantly different from the fourth treatment (_cons).

        However, If I run the bootest command specifying {1.x=_cons}, it finds a strongly significant difference between the two coefficients. How is this possible?


        Code:
        . boottest {1.x=_cons} ///
        > {1.x=2.x}, /// 
        > madjust(bonferroni) nograph reps(9999) seed(123)
        
        Wild bootstrap-t, null imposed, 9999 replications, Wald test, bootstrap clustering by factory_code, Rademacher weights:
          1.x=_cons
        
                                   t(49) =    -5.3436
                                Prob>|t| =     0.0000
          Bonferroni-adjusted prob =     0.0000
        
        95% confidence set for null hypothesis expression: [−.6513, −.2584]
        
        Wild bootstrap-t, null imposed, 9999 replications, Wald test, bootstrap clustering by factory_code, Rademacher weights:
          1.x=2.x
        
                                   t(49) =    -1.5642
                                Prob>|t| =     0.1258
          Bonferroni-adjusted prob =     0.2516
        
        95% confidence set for null hypothesis expression: [−.1754, .03339]
        This seems incorrect. Or at least that bootest is not doing what I think it is.

        Conversely if I run boottest using simply {1.x} results are as expected.


        Code:
        . boottest {1.x} /// Export vs no export *** no-diff
        > {1.x=2.x}, /// export private vs export state *** no-diff
        > madjust(bonferroni) nograph reps(9999) seed(123)
        
        Wild bootstrap-t, null imposed, 9999 replications, Wald test, bootstrap clustering by factory_code, Rademacher weights:
          1.x
        
                                   t(49) =    -0.0459
                                Prob>|t| =     0.9635
          Bonferroni-adjusted prob =     1.0000
        
        95% confidence set for null hypothesis expression: [−.1192, .1166]
        
        Wild bootstrap-t, null imposed, 9999 replications, Wald test, bootstrap clustering by factory_code, Rademacher weights:
          1.x=2.x
        
                                   t(49) =    -1.5642
                                Prob>|t| =     0.1258
          Bonferroni-adjusted prob =     0.2516
        
        95% confidence set for null hypothesis expression: [−.1754, .03339]
        Note that in the first case, the t statistic is the same as in the regression coefficient as I would expect.

        Can you help me interpret this. Is it possible that with factor variable regressions? using "{1.x}" rather than "{1.x=_cons}" is the correct way to check if the coefficient of interest is equal to baseline coefficient?


        thanks a lot

        Comment


        • #5
          OK, I think that I see your confusion. You are correct that the coefficient on the intercept term represents the coefficient on the base category that is omitted because of collinearity. But the coefficients on the non-base categories are expressed as differences, so you cannot subtract the coefficient on the intercept term from them. To do that, omit the intercept and include all levels of the factor variable. You can see the expressed relation highlighted below, where \(t= \sqrt{F}\).

          Code:
          sysuse auto,clear
          gen brand = word(make, 1)
          encode brand, generate(brand_en)
          reg price ib1.brand_en, baselevels
          reg price ibn.brand_en, nocons
          test 4.brand_en=1.brand_en
          display "t= " sqrt(r(F))
          Res.:

          Code:
          . reg price ib1.brand_en, baselevels
          
                Source |       SS           df       MS      Number of obs   =        74
          -------------+----------------------------------   F(22, 51)       =      9.52
                 Model |   510687926        22  23213087.6   Prob > F        =    0.0000
              Residual |   124377470        51  2438773.92   R-squared       =    0.8042
          -------------+----------------------------------   Adj R-squared   =    0.7197
                 Total |   635065396        73  8699525.97   Root MSE        =    1561.7
          
          ------------------------------------------------------------------------------
                 price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
          -------------+----------------------------------------------------------------
              brand_en |
                  AMC  |          0  (base)
                 Audi  |   3776.833   1425.592     2.65   0.011     914.8386    6638.828
                  BMW  |   5519.333   1803.247     3.06   0.004     1899.165    9139.502
                Buick  |   1859.619   1077.646     1.73  0.090    -303.8456    4023.084
                 Cad.  |   9714.667   1275.088     7.62   0.000     7154.821    12274.51
                Chev.  |   156.6667   1104.259     0.14   0.888    -2060.225    2373.558
               Datsun  |   1790.833   1192.736     1.50   0.139    -603.6832     4185.35
                Dodge  |   839.8333   1192.736     0.70   0.485    -1554.683     3234.35
                 Fiat  |   80.33333   1803.247     0.04   0.965    -3539.835    3700.502
                 Ford  |   72.33333   1425.592     0.05   0.960    -2789.661    2934.328
                Honda  |   933.3333   1425.592     0.65   0.516    -1928.661    3795.328
                Linc.  |   8636.667   1275.088     6.77   0.000     6076.821    11196.51
                Mazda  |  -220.6667   1803.247    -0.12   0.903    -3840.835    3399.502
                Merc.  |   698.1667   1104.259     0.63   0.530    -1518.725    2915.058
                 Olds  |    1835.19   1077.646     1.70   0.095    -328.2742    3998.655
              Peugeot  |   8774.333   1803.247     4.87   0.000     5154.165     12394.5
                Plym.  |   604.3333   1140.473     0.53   0.598    -1685.262    2893.929
                Pont.  |   663.1667   1104.259     0.60   0.551    -1553.725    2880.058
              Renault  |  -320.6667   1803.247    -0.18   0.860    -3940.835    3299.502
               Subaru  |  -417.6667   1803.247    -0.23   0.818    -4037.835    3202.502
               Toyota  |   906.3333   1275.088     0.71   0.480    -1653.513    3466.179
                   VW  |   1805.333   1192.736     1.51   0.136    -589.1832     4199.85
                Volvo  |   7779.333   1803.247     4.31   0.000     4159.165     11399.5
                       |
                 _cons |   4215.667   901.6233     4.68   0.000     2405.582    6025.751
          ------------------------------------------------------------------------------
          
          . 
          . reg price ibn.brand_en, nocons
          
                Source |       SS           df       MS      Number of obs   =        74
          -------------+----------------------------------   F(23, 51)       =     59.25
                 Model |  3.3235e+09        23   144498124   Prob > F        =    0.0000
              Residual |   124377470        51  2438773.92   R-squared       =    0.9639
          -------------+----------------------------------   Adj R-squared   =    0.9477
                 Total |  3.4478e+09        74  46592355.7   Root MSE        =    1561.7
          
          ------------------------------------------------------------------------------
                 price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
          -------------+----------------------------------------------------------------
              brand_en |
                  AMC  |   4215.667   901.6233     4.68   0.000     2405.582    6025.751
                 Audi  |     7992.5   1104.259     7.24   0.000     5775.608    10209.39
                  BMW  |       9735   1561.657     6.23   0.000     6599.842    12870.16
                Buick  |   6075.286    590.251    10.29   0.000     4890.307    7260.264
                 Cad.  |   13930.33   901.6233    15.45   0.000     12120.25    15740.42
                Chev.  |   4372.333    637.544     6.86   0.000      3092.41    5652.256
               Datsun  |     6006.5   780.8287     7.69   0.000     4438.921    7574.079
                Dodge  |     5055.5   780.8287     6.47   0.000     3487.921    6623.079
                 Fiat  |       4296   1561.657     2.75   0.008     1160.842    7431.158
                 Ford  |       4288   1104.259     3.88   0.000     2071.108    6504.892
                Honda  |       5149   1104.259     4.66   0.000     2932.108    7365.892
                Linc.  |   12852.33   901.6233    14.25   0.000     11042.25    14662.42
                Mazda  |       3995   1561.657     2.56   0.014     859.8419    7130.158
                Merc.  |   4913.833    637.544     7.71   0.000      3633.91    6193.756
                 Olds  |   6050.857    590.251    10.25   0.000     4865.879    7235.836
              Peugeot  |      12990   1561.657     8.32   0.000     9854.842    16125.16
                Plym.  |       4820   698.3944     6.90   0.000     3417.915    6222.085
                Pont.  |   4878.833    637.544     7.65   0.000      3598.91    6158.756
              Renault  |       3895   1561.657     2.49   0.016     759.8419    7030.158
               Subaru  |       3798   1561.657     2.43   0.019     662.8419    6933.158
               Toyota  |       5122   901.6233     5.68   0.000     3311.916    6932.084
                   VW  |       6021   780.8287     7.71   0.000     4453.421    7588.579
                Volvo  |      11995   1561.657     7.68   0.000     8859.842    15130.16
          ------------------------------------------------------------------------------
          
          . 
          . test 4.brand_en=1.brand_en
          
           ( 1)  - 1bn.brand_en + 4.brand_en = 0
          
                 F(  1,    51) =    2.98
                      Prob > F =    0.0905
          
          . 
          . display "t= " sqrt(r(F))
          t= 1.7256307
          
          .

          Comment


          • #6
            Thanks Andrew, this is amazing and makes a lot of sense. You solved my problem!

            Just to clarify. the issue with using
            Code:
            reg price ib4.brand_en test _cons=1.brand_en
            is that in this context 1.brand_en is the difference between the coefficient 1.brand_en and the _constant, while using the nocons approach. I am comparing the two values of the coefficients (rather than difference with the value) Thanks again

            Comment


            • #7
              Originally posted by Tom Ford View Post

              Just to clarify. the issue with using
              Code:
              reg price ib4.brand_en
              test _cons=1.brand_en
              is that in this context 1.brand_en is the difference between the coefficient 1.brand_en and the _constant, while using the nocons approach. I am comparing the two values of the coefficients (rather than difference with the value)
              The test calculates the difference between the coefficient on the intercept and the coefficient on 1.brand_en. The coefficient on 1.brand_en is the difference between 1.brand_en and the base level, in this case 4.brand_en as you have "ib4.brand_en" on the RHS. The rest is correct.
              Last edited by Andrew Musau; 11 Sep 2023, 12:27.

              Comment


              • #8
                Thanks this is great!

                Comment

                Working...
                X