Panel data - count model: FE vs RE different predicted counts

Marvin Aliaga

Join Date: Feb 2015
Posts: 255

Panel data - count model: FE vs RE different predicted counts

12 Apr 2017, 09:14

Hello everyone,

I have a panel dataset in which my panels are classrooms and the time component are weeks. I am interested to see if the proportion of adolescent (main IV) influences the number of fights in classrooms (DV). I am also including the total of number students in the classroom as a covariate. My data is unbalanced.
Since I am interested in analysis the impact of a variable that vary over time and want to control for between panel variations, I decided to use Fixed Effects (FE) models. After running my model, I run margins and marginsplot commands to see the predicted counts. The predicted values looked inflated to me. I rerun the model using RE to compare predicted values and observed a big differences in in the predicted values as well as in the CI. After doing some reading, I learned that FE models tend to have greater standard errors than RE models specially when there is little variation between subjects compared to within subject variation. This is the case in my data (see below). However, the coefficients from the two models are close.

Why I am seeing considerable big difference in the predicted incident counts (although the coefficients are very similar between FE and RE)?
How can we test which model is given better predictions? Residuals?
Which model is more ideal?
Is there an easy way to run a Hausman test for count data to see which model to use?
I know that overtime the number of fights has increased, is this affecting the model? Should I include time as a covariate in the model?

Thank you in advance,
Marvin

Code:

xtnbreg WeekFight AgeYAp All, fe irr  

Conditional FE negative binomial regression     Number of obs     =      4,133
Group variable: ha                              Number of groups  =        141

                                                Obs per group:
                                                              min =          2
                                                              avg =       29.3
                                                              max =         52

                                                Wald chi2(2)      =       8.89
Log likelihood  = -1630.2392                    Prob > chi2       =     0.0118

------------------------------------------------------------------------------
     WeekFight |        IRR   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      AgeYAp |   1.011179   .0039194     2.87   0.004     1.003526     1.01889
  All |   1.005263   .0080922     0.65   0.514     .9895268    1.021249
       _cons |   1.525927   .7998618     0.81   0.420     .5461977     4.26302
------------------------------------------------------------------------------

xtnbreg WeekFight AgeYAp All, re irr  

Random-effects negative binomial regression     Number of obs     =      5,043
Group variable: ha                              Number of groups  =        243

Random effects u_i ~ Beta                       Obs per group:
                                                              min =          1
                                                              avg =       20.8
                                                              max =         52

                                                Wald chi2(2)      =      87.28
Log likelihood  = -2048.9312                    Prob > chi2       =     0.0000

------------------------------------------------------------------------------
     WeekFight |        IRR   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      AgeYAp |   1.012833    .001718     7.52   0.000     1.009472    1.016206
              All |   .9982385   .0044208    -0.40   0.691     .9896113    1.006941
       _cons |   1.614235    .717855     1.08   0.282     .6752036    3.859211
-------------+----------------------------------------------------------------
       /ln_r |   3.880139   .4167189                      3.063385    4.696893
       /ln_s |   .9124844   .2173875                      .4864128    1.338556
-------------+----------------------------------------------------------------
           r |   48.43095   20.18209                      21.39987    109.6061
           s |   2.490502    .541404                      1.626471    3.813533
--------------------------------------------------------------------------



. xtsum AgeYAp

Variable         |      Mean   Std. Dev.       Min        Max |    Observations
-----------------+--------------------------------------------+----------------
AgeYAp   overall |  33.01448   42.01591          1        100 |     N =    5043
         between |             36.16705          1        100 |     n =     243
         within  |             10.04124  -59.38552   121.3693 | T-bar = 20.7531


* Commands to generate predicted values
xtnbreg WeekFight AgeYAp c.All##c.All  if hatype==1, irr  
margins,at(AgeYAp=(0(10)100))  atmeans vsquish  predict(nu0)
marginsplot , noci

xtnbreg WeekFight AgeYAp   if hatype==1,fe irr  
margins,at(AgeYAp=(0(10)100))  atmeans vsquish  predict(nu0)
marginsplot

Click image for larger version

Name: FE vs RE.png
Views: 1
Size: 15.5 KB
ID: 1383404

Last edited by Marvin Aliaga; 12 Apr 2017, 10:04.

Tags: None

Joao Santos Silva

Join Date: Apr 2014

Posts: 3063
#2

12 Apr 2017, 09:32

Dear Marvin,

I do not think you are doing what you think are doing; there is no proper NB-based FE estimtor. Just use -xtpoisson- with FE and you should be OK.

Best wishes,

Joao
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30353
#3

12 Apr 2017, 09:47

First, there is, in general, the possibility that the fixed-effects (conditional) and random-effects estimators will give different results. But more important here, since the IRR's came out pretty similar, notice that you have different estimation samples. I don't know if you have omitted some additional commands and output that modified the data set between those two regressions. Fixed-effects estimators sometimes exclude panels from the estimation sample because, for example, of lack of variation in the outcome variable in the panel,. For example, if there is any panel (ha) where the outcome variable is always zero, it will be omitted from the estimation sample by -xtnbreg, fe-, but not -xtnbreg, re-. That would also explain why the counts are higher in -fe-: it is a biased estimation sample that excludes zero outcomes. But it is a bit perplexing because when Stata omits panels this way, it gives you a message saying it has done so, and I don't see that in what you posted.

Finally, a suggestion about your modeling. It seems to me it might make sense to make AllInmates an exposure variable, rather than an ordinary covariate, in these models. I'm not sure about that--I'm speculating about what the variable means and what the science of this phenomenon is. But think about it.

Added: Crossed with #2.

Joao is correct that what Stata calls -xtnbreg, fe- is not a fixed-effects estimator in the same sense that, say, -xtreg, fe- or -xtlogit, fe- is. Rather the fixed effects in -xtnbreg, fe- are used to model the overdispersion parameter, not the regular coefficients of the model. It is also true that the fixed-effects Poisson estimator is rather robust and should give you good results. But you will run into the same problem there: -xtpoisson, fe- also omits panels where the outcome is constantly zero and the results will be upwardly biased.

Last edited by Clyde Schechter; 12 Apr 2017, 09:50.
Comment
Marvin Aliaga

Join Date: Feb 2015

Posts: 255
#4

12 Apr 2017, 10:23

Thank you both for the reply. I modified my original post, namely the name of the co-variate. All= the number of students in the clasroom

1.

I don't know if you have omitted some additional commands and output that modified the data set between those two regressions.

I did not modify anything. As you explained, there are fewer panel in the fixed model because some panels do not have outcomes (all 0) and some panels have only one time (I omitted this output on my first posted for better display)

Code:

. xtnbreg WeekFight AgeYAp All, fe irr note: 14 groups (14 obs) dropped because of only one obs per group note: 88 groups (896 obs) dropped because of all zero outcomes

2.

I'm speculating about what the variable means and what the science of this phenomenon is

The All variable is the number of students in the classroom. The idea that classroom with more students will have more fights. However, for this project, I am only interested in the effect of % of adolescent in the classroom.

3.

Joao is correct that what Stata calls -xtnbreg, fe- is not a fixed-effects estimator in the same sense that, say, -xtreg, fe- or -xtlogit, fe- is. Rather the fixed effects in -xtnbreg, fe- are used to model the overdispersion parameter, not the regular coefficients of the model.

So is there a way to predict number of fights after the run a regression?

4. Overall, I think the best approach is to use xtnbreg , re due to panel with 0 outcomes and only one time. Does this make sense?

Thank you,,
Marvin
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30353
#5

12 Apr 2017, 10:42

Re 2. So the question is, is the number of students in the classroom actually an exposure measure. For example, in other contexts one might be interested in the rate of accidents per mile on certain highways. The length of the highway is then an exposure measure. Otherwise put, if you could reasonably think of the phenomenon of fights here in terms of a rate of fights per student in the classroom, then that All variable should be an exposure, not a regular covariate.

Re 3. If you use -xtpoisson-, the same -predict(nu0)- option is available.

Re 4. Yes, I agree a random effects count model is appropriate here. Remember, that this entails the assumption that the random effects are independent of the other predictors in order to get consistent estimates. Now, given that your fixed and random effects models produced very similar coefficients, it is unlikely that you have a problem here. So I think you can go ahead.
Comment

Marvin Aliaga

Join Date: Feb 2015
Posts: 255

12 Apr 2017, 12:36

Hi Clyde,

Thank you so much for the explanation.

1. So after the discussion, I decided to use xtnbreg, RE to get the main coefficient. Then to predict the counts i will use xtpoisson, RE, following by the margins (nu) command. Anyway. the coefficient of my IV of interest does not change considerably from negative binomial o Poisson.

IRR; 1.012 for xtnbreg and 1.013 for xtpoisson

2. When Number of students are included as exposure variable, the predictor coefficient increased from (IR=1.3 to 2.6). The interpretation of the results are similar when using the exposure option, correct? I can say that for every additional percent increase in the Adolescent population, there is an increase of 2.2% in the count of Fights- controlling for the total population of the classroom?

Code:

xtnbreg WeekFight AgeYAp All, re irr
------------------------------------------------------------------------------
     WeekFight |        IRR   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      AgeYAp |   1.012833    .001718     7.52   0.000     1.009472    1.016206
  All        |   .9982385   .0044208    -0.40   0.691     .9896113    1.006941
       _cons |   1.614235    .717855     1.08   0.282     .6752036    3.859211
-------------+----------------------------------------------------------------
       /ln_r |   3.880139   .4167189                      3.063385    4.696893
       /ln_s |   .9124844   .2173875                      .4864128    1.338556
-------------+----------------------------------------------------------------
           r |   48.43095   20.18209                      21.39987    109.6061
           s |   2.490502    .541404                      1.626471    3.813533
------------------------------------------------------------------------------
LR test vs. pooled: chibar2(01) = 102.24               Prob >= chibar2 = 0.000


xtnbreg WeekFight AgeYAp, re irr exposure(All)

------------------------------------------------------------------------------
     WeekFight |        IRR   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      AgeYAp |    1.02127   .0016532    13.00   0.000     1.018035    1.024515
       _cons |   .0459844    .019373    -7.31   0.000     .0201374    .1050067
ln(All     ) |          1  (exposure)
-------------+----------------------------------------------------------------
       /ln_r |   3.616217   .4193659                      2.794275    4.438159
       /ln_s |   .5627317   .1896028                       .191117    .9343464
-------------+----------------------------------------------------------------
           r |   37.19657   15.59897                      16.35076    84.61898
           s |   1.755461   .3328404                      1.210601    2.545549
------------------------------------------------------------------------------
LR test vs. pooled: chibar2(01) = 158.81               Prob >= chibar2 = 0.000

Thank you,
Marvin

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30353
#7

12 Apr 2017, 12:51

When Number of students are included as exposure variable, the predictor coefficient increased from (IR=1.3 to 2.6).

But I don't see either of those numbers, nor anything like them, in the outputs you posted.

Anyway, let's say the IRR is 1.02 (as in the last of your outputs). Then the implication is that an difference of AgeYAp by 1 unit (not 1%, but 1 unit, whatever the unit is--isn't it count of adolescents in the class) is associated with a 2% difference (same direction) in the number of fights, adjusted for number of students in the class.
Comment
Marvin Aliaga

Join Date: Feb 2015

Posts: 255
#8

12 Apr 2017, 13:01

Hi Clyde,

1.

But I don't see either of those numbers, nor anything like them, in the outputs you posted.

It is in the output in my previous post- although I made a little mistake with one of the coefficients (should be 2.1%). Isn't IRR: 1.012833 equals 1.28% and IRR= 1.02127 equals 2.1% (I mistakenly posted 2.6% in my previous post)??

2.

t (not 1%, but 1 unit, whatever the unit is--isn't it count of adolescents in the class) i

No, the variable AgeYAp is the percent of adolescent in the classroom - not the number of adolescents. it goes from 1% to 100%. A classroom with 100% adolescents means that all the students in the classroom are adolescents, 50% half of the students in the classroom are adolescents and so on.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30353
#9

12 Apr 2017, 13:10

1. So, yes 1.02127 is 2.1%. It is also 2.13%, or 2.127%. I said 2% because I was using one less significant figure than you were. If you look at the 95% CI, you can see that calling the result the 2% figure would be an appropriate rounding of the result anywhere within that confidence interval, but the 2.1% figure is claiming a level of precision that is not really supported by the data. That said, I am not always that consistent in how many figures I keep when I round. I should be, but I'm not.

2. Thanks for that clarification.
Comment

Marvin Aliaga

Join Date: Feb 2015
Posts: 255

#10

13 Apr 2017, 10:14

Clyde,

I am sorry to come back to this.

2. So it is valid to have the percent of adolescent as the independent variable when using a exposure option with all students (raw number)? it wouldn't be better to have actual number of adolescent students as the predictor variable if we are controlling (exposure) for total number of students? just to test this, I run 2 models one using the percentage of adolescent as the predictor variable and another using the actual number of adolescent, and got 2 different results. % of Adolescent is significant by actual number of adolescent is not. How can you explain this? Am I missing something?

Code:

. xtnbreg WeekFight AgeYAp if hatype==1,  fe irr exposure(All) // sig
note: 14 groups (14 obs) dropped because of only one obs per group
note: 88 groups (896 obs) dropped because of all zero outcomes

Iteration 0:   log likelihood = -1781.0404  
Iteration 1:   log likelihood = -1641.6473  
Iteration 2:   log likelihood = -1637.7526  
Iteration 3:   log likelihood = -1637.6336  
Iteration 4:   log likelihood = -1637.6331  
Iteration 5:   log likelihood = -1637.6331  

Conditional FE negative binomial regression     Number of obs     =      4,133
Group variable: ha                              Number of groups  =        141

                                                Obs per group:
                                                              min =          2
                                                              avg =       29.3
                                                              max =         52

                                                Wald chi2(1)      =      27.23
Log likelihood  = -1637.6331                    Prob > chi2       =     0.0000

------------------------------------------------------------------------------
     WeekFight |        IRR   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      AgeYAp |    1.01883   .0036422     5.22   0.000     1.011717    1.025994
       _cons |   .0556554   .0249497    -6.44   0.000     .0231166    .1339958
ln(All     ) |          1  (exposure)
------------------------------------------------------------------------------


. xtnbreg Weekfight AgeYA if hatype==1,  fe irr exposure(All)
note: 14 groups (14 obs) dropped because of only one obs per group
note: 88 groups (896 obs) dropped because of all zero outcomes

Iteration 0:   log likelihood = -1832.6785  
Iteration 1:   log likelihood = -1657.2678  
Iteration 2:   log likelihood = -1652.6635  
Iteration 3:   log likelihood = -1652.6592  
Iteration 4:   log likelihood = -1652.6592  

Conditional FE negative binomial regression     Number of obs     =      4,133
Group variable: ha                              Number of groups  =        141

                                                Obs per group:
                                                              min =          2
                                                              avg =       29.3
                                                              max =         52

                                                Wald chi2(1)      =       0.66
Log likelihood  = -1652.6592                    Prob > chi2       =     0.4163

------------------------------------------------------------------------------
     WeekUOF |        IRR   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       AgeYA |   1.008226   .0101614     0.81   0.416     .9885051     1.02834
       _cons |    .129572   .0525454    -5.04   0.000     .0585225    .2868793
ln(All    ) |          1  (exposure)
------------------------------------------------------------------------------

3. Also, I run two models (RE), one model using the All variable (All students in the classroom) as a covariate and in one model I use the All variable as an exposure variable (see output below). For the first model (covariate) my IV is not significant and for the second (exposure) one it is. How can I interpret this? What is the difference between having the All variable as covariate vs exposure? Both control for the effect of the classroom population, no?

Code:

xtnbreg WeekFight AgeYAp All if hatype==2,  re irr  

Random-effects negative binomial regression     Number of obs     =        622
Group variable: ha                              Number of groups  =         27

Random effects u_i ~ Beta                       Obs per group:
                                                              min =          1
                                                              avg =       23.0
                                                              max =         52

                                                Wald chi2(2)      =       0.26
Log likelihood  = -368.07282                    Prob > chi2       =     0.8783

------------------------------------------------------------------------------
     WeekFight |        IRR   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      AgeYAp |   .9972539   .0062042    -0.44   0.658     .9851678    1.009488
  All        |   .9986238   .0170515    -0.08   0.936     .9657566     1.03261
       _cons |   285600.3   1.50e+08     0.02   0.981            0           .
-------------+----------------------------------------------------------------
       /ln_r |   14.54055   525.1086                     -1014.653    1043.734
       /ln_s |   .6526267   .4113489                     -.1536024    1.458856
-------------+----------------------------------------------------------------
           r |    2064813   1.08e+09                             0           .
           s |   1.920579   .7900281                      .8576129    4.301035
------------------------------------------------------------------------------
LR test vs. pooled: chibar2(01) = 40.57                Prob >= chibar2 = 0.000

xtnbreg WeekFight AgeYAp if hatype==2,  re irr exposure(All) // sig



Random-effects negative binomial regression     Number of obs     =        622
Group variable: ha                              Number of groups  =         27

Random effects u_i ~ Beta                       Obs per group:
                                                              min =          1
                                                              avg =       23.0
                                                              max =         52

                                                Wald chi2(1)      =       4.09
Log likelihood  = -367.67731                    Prob > chi2       =     0.0431

------------------------------------------------------------------------------
     WeekUOF |        IRR   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
      AgeYAp |   1.010462   .0051982     2.02   0.043     1.000325    1.020701
       _cons |    5888.06    2878673     0.02   0.986            0           .
ln(All     ) |          1  (exposure)
-------------+----------------------------------------------------------------
       /ln_r |   14.03036   488.8989                     -944.1938    972.2545
       /ln_s |   .4969159   .3729823                      -.234116    1.227948
-------------+----------------------------------------------------------------
           r |    1239678   6.06e+08                             0           .
           s |   1.643644   .6130503                        .79127    3.414216
------------------------------------------------------------------------------
LR test vs. pooled: chibar2(01) = 59.43                Prob >= chibar2 = 0.000

Thank you in advance,
Marvin

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30353
#11

13 Apr 2017, 12:13

So it is valid to have the percent of adolescent as the independent variable when using a exposure option with all students (raw number)? it wouldn't be better to have actual number of adolescent students as the predictor variable if we are controlling (exposure) for total number of students?

The two models answer different questions. Both are "valid." You have to pick the one that answers the question you are asking. And the answers can certainly be different. It gets a little murkier to understand because you are also adjusting the analysis for the total number of students. But both approaches are still meaningful:

1. When you use number of adolescents in the classroom, also adjusting (whether as exposure or covariate) for total number of students in the classroom you are saying that the risk of a fight breaking out is associated with the overall class size, and, on top of that each adolescent in the classroom adds (or subtracts) additional risk. The extent to which adolescents in the classroom multiplies the risk that any classroom of that size would experience depends purely on how many adolescents there are: whether it is 5 out of 6 or 5 out of 100, there is a certain multiplication of risk associated with having 5 adolescents in the class, over and above the risk associated with whatever the class size is.

2. When you use the percent of adolescents in the classroom, also adjusting (whether as exposure or covariate) for total number of students in the classroom, you are saying that the risk of a fight breaking out is associated with the overall class size, and, on top of that, a higher concentration of adolescents in the classroom further multiplies the risk of fights beyond what any class of that size would experience. In this case, the impact of 3 adolescents in a class of 30 would be the same as the impact of 10 adolescents in a class of 100, or 1 adolescent in a class of 10.

Which of these is a more appropriate way to model the risk of fights is a scientific, not a statistical question. If there is no background literature you can draw on for this, then it is reasonable to try both models and to see which model gives a better fit to the data. It is not reasonable to try both models and see which one gives a "statistically significant" result. The latter is not science at all, it is p-hacking, and some even consider it scientific misconduct (though I think that position is a bit extreme given that so many researchers don't understand that it produces false results and do it with innocent intentions.).

As for the difference between using number of students as a covariate and using it as an exposure, when you use it as an exposure, it (actually, its logarithm) is included as a covariate in the model but with its coefficient constrained to be 1. In other words, using it as an exposure, you are implicitly saying that, all else equal, the risk of a fight is doubled when the class size is doubled. The risk of a fight is tripled when the class size is tripled. The risk of a fight is halved when the class size is halved. (I am using causal language for brevity here, but no causal intent is implied.) When you introduce it as a covariate, you are saying that the risk changes with class size, but you are not putting the extent of the changes in exact lockstep. Variables are usually used as exposures when you are trying to model the rate of occurrence of discrete events, the number of events per something else. In such models, that something else should be specified as the covariate, and the count of observed events is the dependent variable in a poisson, or nbreg, or other count variable model with log link. Thus in traffic safety studies, one might use number of accidents as the dependent variable and vehicle-miles driven as the exposure to estimates rates of accident per vehicle mile. If studying the distributions of chocolate chips in cookies, one might use a count of the chips in the cookies in the bag as the dependent variable, and the weight of the cookies in the bag as the exposure to model the chocolate chips per pound (or kg) of cookies. So if you are thinking of this as fights per student in the class, then students should be an exposure. But if you're not thinking about it as a rate of fights per student, but just as risk of fight in a class room, and the total number of students is one of the determinants of that, then number of students should be an ordinary covariate, not an exposure. Again, which is the right way to think about it is a scientific question. If there is no literature about this to guide you, then it is reasonable to use the model that best fits the data; but you should not choose a model based on whether it gives "statistically significant" results.

I'll take you a step farther down this road. Presumably you are modeling this because the organization who hired you for this is interested in finding ways to reduce the number of fights in their classrooms. If they are going to try to design experiments for this, or even adopt policies directly from observational data without experimenting first, the relevant question is how much would a change in some variable they can control affect the frequency of fights. So the relevant statistics in your outputs are the incidence risk ratios themselves (or the predicted fight probabilities that you could get from -margins-), and some sense of how precise those estimates are (confidence intervals). P-values would simply test the hypothesis that there is no effect at all. But presumably the variables you are looking at were chosen because reasonable people think they are relevant determinants of the number of fights. The null hypothesis of no effect is, I think, a straw man here. Nobody thinks that the number or percent of adolescents in the classroom has zero effect on the risk of fights. That's not plausible (at least not to me--I have no expertise in criminology, but I know something about adolescent behaviors). So the question is not whether there is such an effect or not (what a p-value purports to tell you), the question is how large the effect is. Is it large enough that you might be able to test or implement a reasonably effective strategy around it? You won't get the answer to that question from a statistical significance test. You will get that from incidence rate ratios, predicted probabilities or counts of events, and their respective confidence intervals.
Comment
Marvin Aliaga

Join Date: Feb 2015

Posts: 255
#12

13 Apr 2017, 13:36

Clyde,

Thank you so much for your great detailed explanation. I really appreciate it! There is so much to learn from it.
Comment

Announcement

Panel data - count model: FE vs RE different predicted counts

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment