Testing the difference between multiple categorical variables on dependent variable

Sandra Arendonk

Join Date: Jun 2021

Posts: 32
#1

Testing the difference between multiple categorical variables on dependent variable

21 Jun 2021, 12:06

I am dealing with unbalanced panel data. Regressions are done using reghdfe from SSC.

I have quite some categorical variables (7 total, of which i want to know the difference between 6)

So let's say i have:
A
B
C
D
E
F
G

I want to know whether the effect of A on dep. var is different from B, C, D, E, and F. I want to do the same for B vs. A, C, D, E, and F. And so forth.
This would give me 30 pairs if my math is correct.

How could i go about this without running 30 different regressions? Also, what do i do with group G, do i just leave it out? Observations in that category are still a part of my sample , but i have already obtained results that A through F are different from G by grouping them together in 1 dummy. Now i want to know the effect of each different group.

My data and models are quite elaborate, but I can attach them if that would help.
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30060

21 Jun 2021, 16:11

I'm not sure I understand what you want, but is it this?

Code:

clear*

sysuse auto

local 7_vars mpg headroom trunk weight length turn displacement

regress price `7_vars'

forvalues i = 1/7 {
    forvalues j = `=`i'+1'/7 {
        local vari: word `i' of `7_vars'
        local varj: word `j' of `7_vars'
        test `vari' = `varj'
    }
}

Comment

Sandra Arendonk

Join Date: Jun 2021
Posts: 32

21 Jun 2021, 19:17

Originally posted by Clyde Schechter View Post

I'm not sure I understand what you want, but is it this?

Code:

clear*

sysuse auto

local 7_vars mpg headroom trunk weight length turn displacement

regress price `7_vars'

forvalues i = 1/7 {
forvalues j = `=`i'+1'/7 {
local vari: word `i' of `7_vars'
local varj: word `j' of `7_vars'
test `vari' = `varj'
}
}

Honestly not sure, leads to "unexpected end of file" if I run it

Code:

    gen BB = 0 
    replace BB = 1 if rating == "BB"
    gen BB_MIN = 0 
    replace BB_MIN = 1 if rating == "BB-"
    gen BB_PLUS = 0 
    replace BB_PLUS = 1 if rating == "BB+"
    gen BBB_PLUS = 0 
    replace BBB_PLUS = 1 if rating == "BBB+"
    gen BBB = 0 
    replace BBB = 1 if rating == "BBB"
    gen BBB_MIN = 0 
    replace BBB_MIN = 1 if rating == "BBB-"
    replace rating = "CCC" if inlist(rating, "CC", "C")
    gen CCC = 0 
    replace CCC = 1 if rating == "CCC"
    g rest = 0 
    replace rest = 1 if rating == "CCC" | rating == "B" | rating == "A" | rating == "AAA" | rating == "AA"

    label variable BBB_PLUS "BBB+"    
    label variable BBB_MIN "BBB-"    
    label variable BB_PLUS "BB+"    
    label variable BB_MIN "BB-"    
    label variable BBB_PLUS_POST "BBB+_POST"    
    label variable BBB_MIN_POST "BBB-_POST"    
    label variable BB_PLUS_POST "BB+_POST"    
    label variable BB_MIN_POST "BB-_POST"        



local 7_vars BBB_PLUS BBB BBB_MIN BB_PLUS BB BB_MIN rest 

regress ABCFO BORDERLINE POST BORDERLINE_POST DSIZE DLEV DBTM DROA DMKTCAP ABACC  `BBB_PLUS BBB BBB_MIN BB_PLUS BB BB_MIN rest'

forvalues i = 1/7 {
    forvalues j = `=`i'+1'/7 {
        local vari: word `i' of `7_vars'
        local varj: word `j' of `7_vars'
        test `vari' = `varj'
    }
}

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30060

21 Jun 2021, 19:34

I can't reproduce your difficulty here. It runs on my setup:

Code:

. clear*

.
. sysuse auto
(1978 automobile data)

.
. local 7_vars mpg headroom trunk weight length turn displacement

.
. regress price `7_vars'

      Source |       SS           df       MS      Number of obs   =        74
-------------+----------------------------------   F(7, 66)        =      7.41
       Model |   279445483         7  39920783.3   Prob > F        =    0.0000
    Residual |   355619913        66   5388180.5   R-squared       =    0.4400
-------------+----------------------------------   Adj R-squared   =    0.3806
       Total |   635065396        73  8699525.97   Root MSE        =    2321.2

------------------------------------------------------------------------------
       price | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
         mpg |  -96.52796   80.92228    -1.19   0.237    -258.0945    65.03858
    headroom |  -758.6516   432.5522    -1.75   0.084     -1622.27    104.9667
       trunk |   98.39413   106.2801     0.93   0.358    -113.8009    310.5892
      weight |   4.647519   1.399998     3.32   0.001     1.852332    7.442706
      length |  -70.92467   43.49361    -1.63   0.108    -157.7625    15.91313
        turn |  -332.5483   127.3763    -2.61   0.011    -586.8633   -78.23326
displacement |   3.669968   6.734176     0.54   0.588    -9.775247    17.11518
       _cons |   20895.35    6177.24     3.38   0.001     8562.095    33228.61
------------------------------------------------------------------------------

.
. forvalues i = 1/7 {
  2. forvalues j = `=`i'+1'/7 {
  3. local vari: word `i' of `7_vars'
  4. local varj: word `j' of `7_vars'
  5. test `vari' = `varj'
  6. }
  7. }

 ( 1)  mpg - headroom = 0

       F(  1,    66) =    2.27
            Prob > F =    0.1369

 ( 1)  mpg - trunk = 0

       F(  1,    66) =    2.19
            Prob > F =    0.1439

 ( 1)  mpg - weight = 0

       F(  1,    66) =    1.58
            Prob > F =    0.2137

 ( 1)  mpg - length = 0

       F(  1,    66) =    0.09
            Prob > F =    0.7699

 ( 1)  mpg - turn = 0

       F(  1,    66) =    2.55
            Prob > F =    0.1152

 ( 1)  mpg - displacement = 0

       F(  1,    66) =    1.51
            Prob > F =    0.2237

 ( 1)  headroom - trunk = 0

       F(  1,    66) =    3.03
            Prob > F =    0.0866

 ( 1)  headroom - weight = 0

       F(  1,    66) =    3.11
            Prob > F =    0.0822

 ( 1)  headroom - length = 0

       F(  1,    66) =    2.49
            Prob > F =    0.1196

 ( 1)  headroom - turn = 0

       F(  1,    66) =    0.91
            Prob > F =    0.3445

 ( 1)  headroom - displacement = 0

       F(  1,    66) =    3.09
            Prob > F =    0.0832

 ( 1)  trunk - weight = 0

       F(  1,    66) =    0.78
            Prob > F =    0.3807

 ( 1)  trunk - length = 0

       F(  1,    66) =    1.78
            Prob > F =    0.1863

 ( 1)  trunk - turn = 0

       F(  1,    66) =    7.07
            Prob > F =    0.0098

 ( 1)  trunk - displacement = 0

       F(  1,    66) =    0.79
            Prob > F =    0.3769

 ( 1)  weight - length = 0

       F(  1,    66) =    2.90
            Prob > F =    0.0933

 ( 1)  weight - turn = 0

       F(  1,    66) =    6.99
            Prob > F =    0.0103

 ( 1)  weight - displacement = 0

       F(  1,    66) =    0.02
            Prob > F =    0.8982

 ( 1)  length - turn = 0

       F(  1,    66) =    3.17
            Prob > F =    0.0798

 ( 1)  length - displacement = 0

       F(  1,    66) =    2.97
            Prob > F =    0.0895

 ( 1)  turn - displacement = 0

       F(  1,    66) =    6.89
            Prob > F =    0.0108

Comment

Sandra Arendonk

Join Date: Jun 2021
Posts: 32

22 Jun 2021, 04:26

Originally posted by Clyde Schechter View Post

I can't reproduce your difficulty here. It runs on my setup:

Code:

. clear*

.
. sysuse auto
(1978 automobile data)

.
. local 7_vars mpg headroom trunk weight length turn displacement

.
. regress price `7_vars'

Source | SS df MS Number of obs = 74
-------------+---------------------------------- F(7, 66) = 7.41
Model | 279445483 7 39920783.3 Prob > F = 0.0000
Residual | 355619913 66 5388180.5 R-squared = 0.4400
-------------+---------------------------------- Adj R-squared = 0.3806
Total | 635065396 73 8699525.97 Root MSE = 2321.2

------------------------------------------------------------------------------
price | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
mpg | -96.52796 80.92228 -1.19 0.237 -258.0945 65.03858
headroom | -758.6516 432.5522 -1.75 0.084 -1622.27 104.9667
trunk | 98.39413 106.2801 0.93 0.358 -113.8009 310.5892
weight | 4.647519 1.399998 3.32 0.001 1.852332 7.442706
length | -70.92467 43.49361 -1.63 0.108 -157.7625 15.91313
turn | -332.5483 127.3763 -2.61 0.011 -586.8633 -78.23326
displacement | 3.669968 6.734176 0.54 0.588 -9.775247 17.11518
_cons | 20895.35 6177.24 3.38 0.001 8562.095 33228.61
------------------------------------------------------------------------------

.
. forvalues i = 1/7 {
2. forvalues j = `=`i'+1'/7 {
3. local vari: word `i' of `7_vars'
4. local varj: word `j' of `7_vars'
5. test `vari' = `varj'
6. }
7. }

( 1) mpg - headroom = 0

F( 1, 66) = 2.27
Prob > F = 0.1369

( 1) mpg - trunk = 0

F( 1, 66) = 2.19
Prob > F = 0.1439

( 1) mpg - weight = 0

F( 1, 66) = 1.58
Prob > F = 0.2137

( 1) mpg - length = 0

F( 1, 66) = 0.09
Prob > F = 0.7699

( 1) mpg - turn = 0

F( 1, 66) = 2.55
Prob > F = 0.1152

( 1) mpg - displacement = 0

F( 1, 66) = 1.51
Prob > F = 0.2237

( 1) headroom - trunk = 0

F( 1, 66) = 3.03
Prob > F = 0.0866

( 1) headroom - weight = 0

F( 1, 66) = 3.11
Prob > F = 0.0822

( 1) headroom - length = 0

F( 1, 66) = 2.49
Prob > F = 0.1196

( 1) headroom - turn = 0

F( 1, 66) = 0.91
Prob > F = 0.3445

( 1) headroom - displacement = 0

F( 1, 66) = 3.09
Prob > F = 0.0832

( 1) trunk - weight = 0

F( 1, 66) = 0.78
Prob > F = 0.3807

( 1) trunk - length = 0

F( 1, 66) = 1.78
Prob > F = 0.1863

( 1) trunk - turn = 0

F( 1, 66) = 7.07
Prob > F = 0.0098

( 1) trunk - displacement = 0

F( 1, 66) = 0.79
Prob > F = 0.3769

( 1) weight - length = 0

F( 1, 66) = 2.90
Prob > F = 0.0933

( 1) weight - turn = 0

F( 1, 66) = 6.99
Prob > F = 0.0103

( 1) weight - displacement = 0

F( 1, 66) = 0.02
Prob > F = 0.8982

( 1) length - turn = 0

F( 1, 66) = 3.17
Prob > F = 0.0798

( 1) length - displacement = 0

F( 1, 66) = 2.97
Prob > F = 0.0895

( 1) turn - displacement = 0

F( 1, 66) = 6.89
Prob > F = 0.0108

I found my mistake, thanks.

Another question, is it possible to do a one sided t test in this way?

I tried replacing test `vari' = `varj' with test `vari' > `varj'. But alas, that is not the way to go

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30060
#6

22 Jun 2021, 09:33

There is no option on the -test- command for one sided. Note, however, that the -test- command does not return a judgment about "statistical significance." Rather it just calculates a test statistic and gives a two-tailed p-value. Now, F-statistics are generally not "tailed" because they are usually about multiple dimensions. But in this case, these are all 1 df hypotheses, so the F statistics are just the squares of t-statistics for the same hypotheses. In other words, you can interpret these as one-tailed tests on your own. Your results will be 1-tailed "statistically significant" at the .05 level if they are 2-tailed "statistically significant" at the .10 level. In other words, you can just use a different (doubled) cutoff on the p-value.
Comment
Sandra Arendonk

Join Date: Jun 2021

Posts: 32
#7

23 Jun 2021, 13:05

Originally posted by Clyde Schechter View Post

There is no option on the -test- command for one sided. Note, however, that the -test- command does not return a judgment about "statistical significance." Rather it just calculates a test statistic and gives a two-tailed p-value. Now, F-statistics are generally not "tailed" because they are usually about multiple dimensions. But in this case, these are all 1 df hypotheses, so the F statistics are just the squares of t-statistics for the same hypotheses. In other words, you can interpret these as one-tailed tests on your own. Your results will be 1-tailed "statistically significant" at the .05 level if they are 2-tailed "statistically significant" at the .10 level. In other words, you can just use a different (doubled) cutoff on the p-value.

Thank you. I don't think I quite understand.

If I look at the example you provided, your first statistically sign case is:
trunk - turn = 0 F( 1, 66) = 7.07 Prob > F = 0.0098 Would this be the equivalent of testing
trunk > turn
Which is significant at 1% because 0.0098/2 = 0.0049?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30060
#8

23 Jun 2021, 14:46

Yes, that's right.

That said, you are aware that you are doing multiple hypothesis testing here on an industrial scale, so these nominal p-values (either 1-tailed or 2-tailed) can be seriously challenged. This kind of thing is a limitation of p-values and significance testing. You might want to read up on the American Statistical Association's position that statistical significance as a concept should be abandoned. See https://www.tandfonline.com/doi/full...5.2019.1583913 for the "executive summary" and https://www.tandfonline.com/toc/utas20/73/sup1 for all 43 supporting articles. Or https://www.nature.com/articles/d41586-019-00857-9 for the tl;dr.
Comment
Sandra Arendonk

Join Date: Jun 2021

Posts: 32
#9

26 Jun 2021, 11:46

Originally posted by Clyde Schechter View Post

Yes, that's right.

That said, you are aware that you are doing multiple hypothesis testing here on an industrial scale, so these nominal p-values (either 1-tailed or 2-tailed) can be seriously challenged. This kind of thing is a limitation of p-values and significance testing. You might want to read up on the American Statistical Association's position that statistical significance as a concept should be abandoned. See https://www.tandfonline.com/doi/full...5.2019.1583913 for the "executive summary" and https://www.tandfonline.com/toc/utas20/73/sup1 for all 43 supporting articles. Or https://www.nature.com/articles/d41586-019-00857-9 for the tl;dr.

Thank you so much Clyde. Very helpful as always!
Comment

Sandra Arendonk

Join Date: Jun 2021
Posts: 32

#10

28 Jun 2021, 17:11

Clyde Schechter

Is there a way to do the same thing but then include i.industry operators?

Specifically, I want to know the difference between each industry (40 different sic codes)

I have tried modifying your code by doing:

Code:

    
local sic_vars i.sic i.sic#POST i.sic#BORDERLINE_POST

reghdfe ABPROD BORDERLINE BORDERLINE_POST i.sic i.sic#BORDERLINE i.sic#BORDERLINE_POST DSIZE DLEV DBTM DROA DMKTCAP ABACC, noabsorb cluster(gvkey)  `sic_vars i.sic i.sic#POST i.sic#BORDERLINE_POST'
outreg2 using hddhd.xls, replace ctitle(ABPROD) bdec(3) adjr2 label addtext(Firm FE, Yes)

forvalues i = 1/14 {
    forvalues j = `=`i'+1'/7 {
        local vari: word `i' of `sic_vars'
        local varj: word `j' of `sic_vars'
        test `vari' = `varj'
    }
}

However, I get:
invalid options: i.sic i.sic # POST i.sic # BORDERLINE_POST

I'm guessing including i.operators is not possible? However, I am not looking to generate 120 dummy variables. Is there another approach to this?

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30060
#11

28 Jun 2021, 17:40

This is different because of the factor variable notation. i.sic is not a variable: it is an abbreviation for 40 (hidden) variables, one for each sic code. So the code would have to be modified to take that into account. And I could write that code. But, first:

1. I cannot think of any context in which it would make sense to test the equality of i.sic#POST with any or all of the i.sic coefficients. (Nor i.sic#BORDERLINE_POST with any of the i.sic coefficients.) Can you explain what you are trying to get at with these tests?

2. As for the i.sic variables themselves, since you have 40 industries, you will end up with741 tests. What on earth are you going to do with that?
Comment
Sandra Arendonk

Join Date: Jun 2021

Posts: 32
#12

28 Jun 2021, 18:18

Originally posted by Clyde Schechter View Post

This is different because of the factor variable notation. i.sic is not a variable: it is an abbreviation for 40 (hidden) variables, one for each sic code. So the code would have to be modified to take that into account. And I could write that code. But, first:

1. I cannot think of any context in which it would make sense to test the equality of i.sic#POST with any or all of the i.sic coefficients. (Nor i.sic#BORDERLINE_POST with any of the i.sic coefficients.) Can you explain what you are trying to get at with these tests?

2. As for the i.sic variables themselves, since you have 40 industries, you will end up with741 tests. What on earth are you going to do with that?

I am researching the effect the passage of a certain Act has on the earnings management behavior of firms that are close to being investment- or speculative grade (which is the BORDERLINE variable). Rationale being that these firms would have higher incentives to inflate their earnings to get (or obtain) an investment-grade status.

Now I want to see whether managers in certain industries have higher incentives to inflate these earnings after the passage of the act (i.sic#POST_BORDERLINE is the variable of interest, the others are included just to include the main effects).

I have ran

Code:

reghdfe ABPROD BORDERLINE POST BORDERLINE_POST i.sic i.sic#POST i.sic#BORDERLINE_POST SIZE LEV BTM ROA MKTCAP ABACC, noabsorb cluster(gvkey)

But the coefficients on i.sic#BORDERLINE_POST would all be relative to industry #40, so I cannot conclude that these industries manage their earnings disproportionately more.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30060
#13

28 Jun 2021, 19:08

Oh, so you don't actually want to compare all of those things with each other. You want to compare the marginal effects of POST_BORDERLINE in each of the different industries. Which you can't get directly from any of the regression output--because the relevant statistic is POST_BORDERLINE + i.sic#POST_BORDERLINE. But you can get all of those marginal effects from the -margins- command. After you run your regression:

Code:

margins sic, dydx(POST_BORDERLINE) saving(marginal_effects, replace)

to get the average marginal effect of POST_BORDERLINE in each industry.

Now, in principle you can also get all pairwise comparisons among these by following that with:

Code:

margins sic, dydx(POST_BORDERLINE) pwcompare saving(marginal_effects_contrasts, replace)

But I don't recommend that because you will get 780 pairwise comparisons (when I said 741 above that was a mistake) and I cannot see any way you can make use of that mass of data. Also, it will be a slow calculation, and if your data set is large I wonder if it would finish in your lifetime. (Yes, I'm being hyperbolic.) I would just look at the marginal effects themselves and pick out a certain number of industries where it is highest, and a certain number where it is lowest, and then identify which industries those are. Or perhaps just create a rank-ordered list of the industries by marginal effect. Doing all 780 comparisons really doesn't seem practical, nor does it actually contribute directly to your stated research goal. Because I have specified the -saving- option in the -margins- command, you will have the -margins- results in a data set that you can then use to sort them. The names of the variables in that data set will be a bit odd, but if you open the browser, it will be easy to spot the sic variable and the marginal effect variable.
Comment
Sandra Arendonk

Join Date: Jun 2021

Posts: 32
#14

29 Jun 2021, 08:45

Originally posted by Clyde Schechter View Post

Oh, so you don't actually want to compare all of those things with each other. You want to compare the marginal effects of POST_BORDERLINE in each of the different industries. Which you can't get directly from any of the regression output--because the relevant statistic is POST_BORDERLINE + i.sic#POST_BORDERLINE. But you can get all of those marginal effects from the -margins- command. After you run your regression:

Code:

margins sic, dydx(POST_BORDERLINE) saving(marginal_effects, replace)

to get the average marginal effect of POST_BORDERLINE in each industry.

Now, in principle you can also get all pairwise comparisons among these by following that with:

Code:

margins sic, dydx(POST_BORDERLINE) pwcompare saving(marginal_effects_contrasts, replace)

But I don't recommend that because you will get 780 pairwise comparisons (when I said 741 above that was a mistake) and I cannot see any way you can make use of that mass of data. Also, it will be a slow calculation, and if your data set is large I wonder if it would finish in your lifetime. (Yes, I'm being hyperbolic.) I would just look at the marginal effects themselves and pick out a certain number of industries where it is highest, and a certain number where it is lowest, and then identify which industries those are. Or perhaps just create a rank-ordered list of the industries by marginal effect. Doing all 780 comparisons really doesn't seem practical, nor does it actually contribute directly to your stated research goal. Because I have specified the -saving- option in the -margins- command, you will have the -margins- results in a data set that you can then use to sort them. The names of the variables in that data set will be a bit odd, but if you open the browser, it will be easy to spot the sic variable and the marginal effect variable.

Thanks.

I am running:

Code:

reghdfe ABPROD BORDERLINE POST BORDERLINE_POST i.sic i.sic#POST i.sic#BORDERLINE_POST SIZE LEV BTM ROA MKTCAP ABACC, noabsorb cluster(gvkey) margins sic, dydx(BORDERLINE_POST) saving(marginal_effects, replace)

I get the following error
invalid dydx() option;
variable BORDERLINE_POST may not be present in model as factor and continuous predictor

(which is true of course, both borderline and post are dummies)

How do I proceed?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30060
#15

29 Jun 2021, 11:09

In your -reghdfe- you need to specify BORDERLINE_POST (and POST, too) with the i. prefix. By default, -margins- interprets any variable that appears outside of an interaction as continuous unless it has an i. prefix. And, to make things confusing, it interprets any variable that appears in an interaction as discrete unless it has a c. prefix. Your variables POST and BORDERLINE_POST have both appeared in these conflicting ways. While -reghdfe- doesn't care, -margins- does because it has to handle them differently, so it chokes.

Actually, you can simplify the code and avoid problems like this by using the ## operator for interactions instead of the # operator. The ## operator automatically includes the constituent effects of the interactions it appears in, so you will never inadvertently omit one, nor include one without properly specifying its status as a categorical variable. And you can take it a step farther by taking advantage of the fact that factor-variable notation is an algebra.

Code:

reghdfe ABPROD BORDERLINE i.sic##i.(POST BORDERLINE_POST) SIZE LEV BTM ROA MKTCAP ABACC, noabsorb cluster(gvkey) margins sic, dydx(BORDERLINE_POST) saving(marginal_effects, replace)

Not only is this notation safer because it prevents inadvertent specification of invalid models, I think it also makes the code much easier to read.
Comment

Announcement