Fixed-effects model using simultaneous ANOVA

Thomas Lindner

Join Date: Apr 2015

Posts: 14
#1

Fixed-effects model using simultaneous ANOVA

21 Apr 2015, 01:41

Hello everyone,

I’d like to measure the CEO-effect (i.e. proportion of variance in a firm-level outcome variable that can be attributed to the presence of individual CEOs) within my unbalanced panel data using Stata 12.0 for Windows.

My model looks as follows:

FirmPerformance_ijkt = GrandMean + Year_i + Industry_j + Company_k + CEO_t + Error_ijkt

In accordance with previous papers, I would like to estimate the effects with a fixed-effect model using simultaneous ANOVA, which is supposed to be “… a modified version of sequential ANOVA. While sequential ANOVA does not estimate the covariance among categories, and instead attributes covariance between two categories to whichever is entered earlier in the model, simultaneous ANOVA estimates the covariance between different categories and adjusts the category estimates accordingly.” (Crossland/Hambrick 2007, p. 780). Another article states “… simultaneous ANOVA allows for a full set of covariance effects but does not assume randomness in the model errors.” (Mackey 2008, p. 1360).

Unfortunately, I couldn’t find anything in how to do a simultaneous ANOVA in Stata so far. Does anyone has an idea?
Is it maybe just something like:

. anova a b c a#b a#c b#c a#b#c a##b##c ?

Thanks in advance!
---------------------------------------

Crossland, C. und D. C. Hambrick (2007): How National Systems Differ in Their Constraints on Corporate Executives: A Study of CEO Effects in Three Countries; in: Strategic Management Journal 28 (8); p. 767–789.

Mackey, A. (2008): The Effect of CEOs on Firm Performance; in: Strategic Management Journal 29 (12); p. 1357-1367.
Tags: None
Sergio Correia

Join Date: Apr 2014

Posts: 420
#2

21 Apr 2015, 15:38

I'm not an anova expert, but is it possible to frame the problem in terms of a linear regression, and then run e.g. reghdfe (ssc) or areg and compute the results?
Note though that if you go that way, the CEOs fixed effects would be inconsistent, and so your estimate for their influence. Maybe you can run two models, one without absorbing for CEO FEs, one absorbing for them, and compare the R2s.
Comment
Thomas Lindner

Join Date: Apr 2015

Posts: 14
#3

13 May 2015, 11:30

Thanks a lot for the answer.

After some more research, I found out that apparently authors in this field refer to “simultaneous” instead of “sequential” ANOVA, when they make us of partial (sometime also referred to as unique or simultaneous) sums of squares instead of sequential sums of squares. Therefore, the answer is quit trivial: anova performance year industry company ceo

In this regard I have an additional question. For further calculations, I would like to know the intercepts (constant terms) for each CEO of the underlying regression model of the anova. With the command “regress” after the anova run I’ll get the results. Unfortunately, for some intercepts I just get the result “0 (omitted)”.
Does someone know why this happens und what I have to do in order to get all intercept values?

Thanks in advance!
Comment
Roman Mostazir

Join Date: Apr 2014

Posts: 874
#4

13 May 2015, 18:05

By default Anova produces partial SS. If you want Sequential SS, just use 'sequential' as an option after 'anova' command. Saying that, whichever you think is justified, at least one problem can be identified from your code which resulting 'zero' estimates:

anova a b c a#b a#c b#c a#b#c a##b##c

In the code above, this portion:

Code:

anova a b c a#b a#c b#c a#b#c

is already a full factorial model which is equivalent to:

Code:

anova a##b##c

No wonder the redundant parameters will return 'zero'. Also you might be loosing degrees of freedom if either a, b or c is continuous variable. In that case you need to tell Stata that which variable is continuous i.e. c.a#c.b means they are continuous variable. see

Code:

help fvvarlist

Just for your reference, you omitted the 'y' from your equation code, so did I. For sure we Know there is a "y' in the equation. But as per rules of this forum you are expected to show 'exact' code you used and 'exact' results Stata returned.

Last edited by Roman Mostazir; 13 May 2015, 18:25.

Roman
Comment
Thomas Lindner

Join Date: Apr 2015

Posts: 14
#5

14 May 2015, 05:58

Thanks for your reply.

I think you got my answer wrong. My ‘final’ anova command I use for the analysis is:

anova performance i.year i.industry i.company i.year

with ‘performance’ being my continuous variable (dependent variable) and ‘year’, ‘industry’, ‘company’ and ‘ceo’ being the categorical variables. I’m also aware of the differences in using partial or sequential SS.

As said before, I would now like to know all intercepts of the ‘ceo’ variable of the underlying regression model after the anova run, but for a lot values I just get “0 (omitted)” by using the ‘regress’ command after the anova run. Does anyone know why and how to fix it?
Comment
Roman Mostazir

Join Date: Apr 2014

Posts: 874
#6

14 May 2015, 06:31

I think you got my answer wrong. My ‘final’ anova command I use for the analysis is:

anova performance i.year i.industry i.company i.year

Hmm..I was referring to your codes in #1. But where is 'ceo' in your final model? I can't see it. That is why the FAQ section suggests to provide the 'exact' codes you have used and provide the Stata outputs so that people can easily spot your problem and help you out more efficiently. Now looking at your reply#5, my guess is you need 'margins'. Tried that? Though not sure why 'reg' will provide lots of 'zero' for 'ceo' as it should provide the coefficients for other categories in compared to the reference category. My guess cannot go further unless some outputs are presented.

Best.

Roman
Comment
Thomas Lindner

Join Date: Apr 2015

Posts: 14
#7

14 May 2015, 06:59

Sorry, you are of course totally right, I typed in twice ‘year’ instead of ‘ceo’ in the presented code. Here the ‘correct’ command I used:

anova performance i.year i.industry i.company i.ceo

Then, I get normally looking anova results. Afterwards I type in ‘regress’ in order to find out the intercepts for the variables of the underlying regression model. For both the ‘year’ and ‘industry’ variables the tables are completely filled, whereas for ‘company’ some values are missing and for ‘ceo’ quite a lot. Unfortunately, I couldn’t figure out a pattern why some values are omitted. For the sake of completeness, I attached a screenshot of the results of the regress command.

Thanks in advance for any help!

2 Photos
Comment
Roman Mostazir

Join Date: Apr 2014

Posts: 874
#8

14 May 2015, 07:17

I need to catch up a meeting, so making a quick reply hoping you can sort it out. 'reg' is not giving you the intercepts for each category. It is providing you the difference (coefficient) for each category compared to the reference category, in this case ceo = 1. Try:

Code:

anova performance i.year i.industry i.company i.ceo margins i.year i.ceo

See the results. For future,: it is more convenient and legible if you use code delimiters to post the outputs rather screenshots. Click the A button in the right hand side and click '#' and copy and paste the outputs, Stata codes in between the CODE delimiters.

Best,

Roman
Comment

Thomas Lindner

Join Date: Apr 2015
Posts: 14

16 May 2015, 05:00

Thanks, concerning the stata commands that is exactly what I was looking for. However, if I try to use the margins command for the ‘ceo’ variable, I get a “not estimable” notification (the same occurs for ‘industry’ and ‘company’). See the following stata output:

Code:

 anova performance i.year i.industry i.company i.ceo

                           Number of obs =    1013     R-squared     =  0.4853
                           Root MSE      = 12.7618     Adj R-squared =  0.2999

                  Source |  Partial SS    df       MS           F     Prob > F
              -----------+----------------------------------------------------
                   Model |  114242.222   268  426.276947       2.62     0.0000
                         |
                    year |  1304.54193     9  144.949104       0.89     0.5337
                industry |  646.920433    11  58.8109485       0.36     0.9703
                 company |  72992.8089   126  579.308007       3.56     0.0000
                     ceo |  26083.3129   122  213.797646       1.31     0.0194
                         |
                Residual |  121171.066   744  162.864336  
              -----------+----------------------------------------------------
                   Total |  235413.288  1012  232.621826  

. margins i.ceo

Predictive margins                                Number of obs   =       1013

Expression   : Linear prediction, predict()

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         ceo |
          1  |          .  (not estimable)
          2  |          .  (not estimable)
          3  |          .  (not estimable)
          4  |          .  (not estimable)
          5  |          .  (not estimable)

...

As far as I can tell, it seems like I have problems with collinearity. Using the ‘noestimcheck’ option for the ‘margins’ command, results are calculated. However, for all ‘ceo’ values that where labeled “0 (omitted)” by using the ‘regress’ command, I’ll now get the identical numerical values for each ‘ceo’ value. See the following stata output:

Code:

. margins i.ceo, noestimcheck

Predictive margins                                Number of obs   =       1013

Expression   : Linear prediction, predict()

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         ceo |
          1  |   1.677583   .6844242     2.45   0.014      .336136    3.019029
          2  |   .0656615   12.82456     0.01   0.996    -25.07001    25.20133
          3  |   .9846664   14.00152     0.07   0.944     -26.4578    28.42713
          4  |   1.677583   .6844242     2.45   0.014      .336136    3.019029
          5  |  -.2455593   8.153771    -0.03   0.976    -16.22666    15.73554
          6  |   .0731999   18.16001     0.00   0.997    -35.51977    35.66617
          7  |   33.69258   10.71732     3.14   0.002     12.68703    54.69814
          8  |   22.90153    14.3586     1.59   0.111    -5.240803    51.04387
          9  |   1.677583   .6844242     2.45   0.014      .336136    3.019029

.....
        253  |   1.677583   .6844242     2.45   0.014      .336136    3.019029
        254  |   1.677583   .6844242     2.45   0.014      .336136    3.019029
        255  |   1.677583   .6844242     2.45   0.014      .336136    3.019029
        256  |   1.677583   .6844242     2.45   0.014      .336136    3.019029
        257  |   1.677583   .6844242     2.45   0.014      .336136    3.019029
        258  |   1.677583   .6844242     2.45   0.014      .336136    3.019029
------------------------------------------------------------------------------

Does anyone know how to fix this? Are my data just too sparse to correctly conduct this kind of analysis? For the sake of completness see an overview of the data:

Code:

. summarize year industry company ceo

    Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
        year |      1013    2006.803    2.788406       2002       2011
    industry |      1013    6.023692    2.928568          1         12
     company |      1013    125.0237    124.1759          1        559
         ceo |      1013    129.1125     74.7792          1        258

Thanks in advance for any help!

Last edited by Thomas Lindner; 16 May 2015, 05:15.

Comment

Roman Mostazir

Join Date: Apr 2014

Posts: 874
#10

17 May 2015, 20:07

To be honest, I don't know as I don't have the data for further diagnosis. However, looking at your output, my first guess would be because the 'ceo' variable includes 258 categories !! you may have `zero' in 'performance' values in relation to some of the 'ceo' values. But I am not sure if this is the source of the problem. This can be checked with some descriptive stats. Could you please check it for the identical estimates related to 'ceo' values. Run the following to check it:

Code:

foreach j of numlist 1 4 9 253 254 255 256 257 258 { tab ceo if ceo == `j' ,sum(performance) }

See the returned results.

Saying that, another possibility popping in my mind. I don't know what kind of variable 'ceo' is but could you treat it as a continuous variable? Would it make more sense? Interpreting and presenting estimates from 258 categories seems far away from being a decent idea unless you have strong reasons behind that. If a linear fit treating 'ceo' as continuous variable make sense, you can use 'margins' for post estimation means at certain intervals of 'ceo' and plot the effects..for example:

Code:

anova performance ceo margins,at(ceo=(1 (50) 250) marginsplot,xdim(ceo)

Not sure if that solves your problems.

Best,

Roman
Comment

Thomas Lindner

Join Date: Apr 2015
Posts: 14

#11

18 May 2015, 04:03

Thanks again for your answer.

The ‘ceo’ variable is the name of a CEO in charge for a respective company, which conducts business in the related industry and year. So the first value of the ‘ceo’ variable might be “Steve Jobs”, working for “Apple” (‘company’ variable) which is in the “Technology Sector” (‘industry’ variable) in “2008” (‘year’ variable) and the company had an return of sales in that year of “15,7 %” (‘performance’ variable). So the 'ceo' variable should be a categorical variable in my opinion. The ‘performance’ variable is at least in my sample never 0 or identical to another value.

I don’t want to report the different ‘ceo’ intercepts directly, but I would like to know them for further calculations. Therefore, your suggested margins plot doesn’t really solve my problem.

My stata output for running your for-loop looks as follows:

Code:

. foreach j of numlist 1 4 9 253 254 255 256 257 258 {
  2. 
. tab ceo if ceo == `j' ,sum(performance)
  3. 
. }

            |       Summary of performance
        ceo |        Mean   Std. Dev.       Freq.
------------+------------------------------------
          1 |   13.868823   4.3292114           3
------------+------------------------------------
      Total |   13.868823   4.3292114           3

            |       Summary of performance
        ceo |        Mean   Std. Dev.       Freq.
------------+------------------------------------
         4  |   2.4502015   1.3382299           9
------------+------------------------------------
      Total |   2.4502015   1.3382299           9

            |       Summary of performance
        ceo |        Mean   Std. Dev.       Freq.
------------+------------------------------------
          9 |   3.5016346   3.1326687           7
------------+------------------------------------
      Total |   3.5016346   3.1326687           7

            |       Summary of performance
        ceo |        Mean   Std. Dev.       Freq.
------------+------------------------------------
       253  |   4.9944388   .39269566           7
------------+------------------------------------
      Total |   4.9944388   .39269566           7

            |       Summary of performance
        ceo |        Mean   Std. Dev.       Freq.
------------+------------------------------------
        254  |   2.6932446   1.2328878          10
------------+------------------------------------
      Total |   2.6932446   1.2328878          10

            |       Summary of performance
        ceo |        Mean   Std. Dev.       Freq.
------------+------------------------------------
       255  |   1.8880489   7.5637416           2
------------+------------------------------------
      Total |   1.8880489   7.5637416           2

            |       Summary of performance
        ceo |        Mean   Std. Dev.       Freq.
------------+------------------------------------
        256 |   .70206482   2.3068254           5
------------+------------------------------------
      Total |   .70206482   2.3068254           5

            |       Summary of performance
        ceo |        Mean   Std. Dev.       Freq.
------------+------------------------------------
       257  |  -1.9966718   1.7356451           4
------------+------------------------------------
      Total |  -1.9966718   1.7356451           4

            |       Summary of performance
        ceo |        Mean   Std. Dev.       Freq.
------------+------------------------------------
        258 |  -.97718828   2.8261166           2
------------+------------------------------------
      Total |  -.97718828   2.8261166           2

Looking at the raw data I just can't figure out a particular pattern why some values of 'ceo' are properly calculated and others not. Anyone an idea?

Thanks in advance for any help!

Comment

Roman Mostazir

Join Date: Apr 2014

Posts: 874
#12

18 May 2015, 10:32

First of all, you are correct. 'ceo' should be categorical. Now we have a basic problem here. Because I focused on the the anova command and partial/sequential-SS issue, I didn't notice what kind of dataset you have. Now from post-1 and from the post above #11 it is clear that you have a panel dataset, therefore, your model is misspecified. You are fitting a single-level model with multi-level structure (firm – year–ceo) . You either need 'repeated measured anova' or 'mixed' ('mixed' is preferred, see help mixed). After fitting the right model then try estimating the `ceo' means with `margins'.

Last edited by Roman Mostazir; 18 May 2015, 10:35.

Roman
Comment

Thomas Lindner

Join Date: Apr 2015
Posts: 14

#13

20 May 2015, 14:45

You are totally right, I have an unbalanced panel dataset over the timespan of 10 years. I should have mentioned that right away.

Within the greater context of my analysis, I was going to estimate the so called “CEO-effect” (i.e. the portion in firm performance variance explained by the respective CEO in charge) not only with an ANOVA as above (which most papers do), but also with a multi-level model using ‘xtmixed’ in order to explicitly account for the nested structure of the data. Articles applying this MLM approach used the following Stata command:

Code:

 xtmixed performance || industry: || firm: || ceo: || year: , var

As you can see, no cross-level covariances are modeled. If, however, I run the model with my data, the “year-effect” (i.e. the portion in firm performance variance explained by the year) is unusually high compared to my ANOVA estimates. In other articles the different effects were at least in comparable range. Therefore, I’m not sure if the MLM model is correctly specified for my dataset. Any suggestions?

For the sake of completeness see my Stata output for the partial ANOVA in comparison to the MLM approach:

Code:

. xtmixed performance || industry: || company: || ceo:  || year: , var technique(bfgs)

...

Computing standard errors:

Mixed-effects ML regression                     Number of obs      =      1013

-----------------------------------------------------------
                |   No. of       Observations per Group
 Group Variable |   Groups    Minimum    Average    Maximum
----------------+------------------------------------------
       industry |       12         20       84.4        308
        company |      138          1        7.3         10
            ceo |      258          1        3.9         10
           year |     1013          1        1.0          1
-----------------------------------------------------------

                                                Wald chi2(0)       =         .
Log likelihood = -4129.8384                     Prob > chi2        =         .

------------------------------------------------------------------------------
 performance |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _cons |   1.448331   .8931908     1.62   0.105    -.3022908    3.198953
------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects Parameters  |   Estimate   Std. Err.     [95% Conf. Interval]
-----------------------------+------------------------------------------------
industry: Identity           |
                  var(_cons) |   .9782162   3.447573      .0009784    978.0579
-----------------------------+------------------------------------------------
company: Identity            |
                  var(_cons) |   47.34626   14.25791       26.2394    85.43141
-----------------------------+------------------------------------------------
ceo: Identity                |
                  var(_cons) |   28.49681   13.00188      11.65277    69.68884
-----------------------------+------------------------------------------------
year: Identity               |
                  var(_cons) |   141.8553   1694.618      9.62e-09    2.09e+12
-----------------------------+------------------------------------------------
               var(Residual) |   20.26191   1694.602      1.31e-70    3.14e+72
------------------------------------------------------------------------------
LR test vs. linear regression:       chi2(4) =   134.35   Prob > chi2 = 0.0000



anova performance i.year i.industry i.company i.ceo

                           Number of obs =    1013     R-squared     =  0.4861
                           Root MSE      = 12.7608     Adj R-squared =  0.3000

                  Source |  Partial SS    df       MS           F     Prob > F
              -----------+----------------------------------------------------
                   Model |  114425.437   269  425.373373       2.61     0.0000
                         |
                    year |  1397.10497     9  155.233885       0.95     0.4778
                industry |  2149.56357    11   195.41487       1.20     0.2828
                 company |  82560.7515   126   655.24406       4.02     0.0000
                     ceo |  26266.5283   123  213.549011       1.31     0.0195
                         |
                Residual |   120987.85   743  162.836945   
              -----------+----------------------------------------------------
                   Total |  235413.288  1012  232.621826

Thanks for any help!

Comment

Roman Mostazir

Join Date: Apr 2014

Posts: 874
#14

20 May 2015, 15:57

Your 'mixed-model' is incorrectly specified no doubt. I am guessing you have a three-level model, therefore two random effects but your specifications are causing confusions. Therefore, before suggesting a model, a sample of your data will be useful to define the hierarchical structure of the dataset. Could you please provide a sample of the data set. Just run the command below and copy the output and paste here with the code-delimiters.

Code:

list industry firm ceo year performance in 1/30,clean

Roman
Comment

Thomas Lindner

Join Date: Apr 2015
Posts: 14

#15

21 May 2015, 02:05

Thanks again for your answer Roman! Here is my Stata output:

Code:

. list industry company ceo year performance in 1/30, clean

       industry   company   ceo   year   perform~e  
  1.          1         1     1   2002   5.3644982  
  2.          1         1     1   2003   6.2105323  
  3.          1         1     1   2004   7.0976729  
  4.          1         1     1   2005    7.420506  
  5.          1         1     1   2006    5.764411  
  6.          1         1     1   2007   6.6186186  
  7.          1         1     1   2008   6.7345012  
  8.          1         1     1   2009   2.7605634  
  9.          1         1     1   2010   5.3399887  
 10.          1         1     1   2011   5.8963093  
 11.          2         3     2   2002   12.060431  
 12.          2         3     2   2003   .88963312  
 13.          2         3     2   2004   7.9966868  
 14.          2         3     2   2005   5.8898665  
 15.          2         3     2   2006   .61083222  
 16.          2         3     3   2007   7.8003133  
 17.          2         3     3   2008   7.3470911  
 18.          2         3     3   2009   14.298538  
 19.          2         3     3   2010   13.635128  
 20.          2         3     3   2011    14.47492  
 21.          2         4     4   2002   8.4291881  
 22.          2         4     5   2003    7.098982  
 23.          2         4     5   2004   8.6865757  
 24.          2         4     5   2005    5.725521  
 25.          2         4     5   2006   2.7502536  
 26.          2         4     5   2007   7.6726214  
 27.          2         4     5   2008   10.115243  
 28.          2         4     5   2009   4.2877742  
 29.          2         4     5   2010   2.7087993  
 30.          2         4     6   2011   5.5520171

As you might see, there is not much variation in this first couple entries. However, my data gets more sparse and variable for other companies as for instance no performance indicators were available. See exemplary the following output:

Code:

. list industry company ceo year performance in 980/1010, clean

        industry   company   ceo   year   performa~e  
 980.          5       455   246   2008    -4.505062  
 981.          5       455   247   2009   -4.1491119  
 982.          5       455   247   2010   -14.133701  
 983.          3       469   248   2002     2.963495  
 984.          3       469   248   2003     3.491723  
 985.          3       469   248   2004    6.3760305  
 986.          3       469   248   2005    6.2084257  
 987.          3       469   248   2006    4.6043165  
 988.          3       478   249   2002   -1.0733204  
 989.          3       478   249   2003     .7910689  
 990.          3       478   249   2004     5.238406  
 991.          3       478   250   2005    .65075203  
 992.         10       496   251   2003   -3.6533432  
 993.         10       505   252   2003    7.4573977  
 994.         10       505   252   2004    16.992098  
 995.         10       505   252   2005    9.5537914  
 996.         10       505   252   2006    12.289505  
 997.          8       509   253   2002    5.5996051  
 998.          8       509   253   2003   -5.5830062  
 999.          1       511   254   2002    4.9023817  
1000.          1       511   254   2003    4.3374542  
1001.          8       534   255   2008    7.6471916  
1002.          8       534   255   2009   -4.2525691  
1003.          8       534   255   2010    .55300577  
1004.          8       534   255   2011    .27123184  
1005.          8       535   256   2008     8.397254  
1006.          8       535   256   2009    7.9749495  
1007.          8       535   256   2010    4.3267016  
1008.          8       535   257   2011    -5.956808  
1009.          8       536   258   2008    25.454743  
1010.          8       536   258   2009    22.419817

The "general idea" behind my data is described in post #11.

Thanks for any help!

Announcement