References about heteroskedasticity in Zero-Inflated Negative Binomial models

Gaston Fernandez

Join Date: Jul 2015

Posts: 27
#1

References about heteroskedasticity in Zero-Inflated Negative Binomial models

08 May 2020, 10:03

Hello!

I am trying to set up a zero-inflated negative binomial model using count data. I am not sure if the presence of heteroskedasticity in such data would bias my estimation, making inconsistent the estimated coefficients. In traditional econometrics textbooks that I usually use (e.g. Wooldridge, Cameron and Trivedi, Hayashi, etc) do not appear too much about count data models and potential heteroskedastic issues.

I would be pleased if someone could provide me with some useful references about this.

Many thanks!
Gastón
Tags: None
Joao Santos Silva

Join Date: Apr 2014

Posts: 3006
#2

08 May 2020, 11:24

Dear Gaston Fernandez,

Typically, there is heteroskedasticity in regressions where the dependent variable is a count, so that is taken as a given. Now, I do not know what you are trying to do, but are you sure you need a zero inflated model? Those models are often misused and, additionally, the ZINB likelihood can have multiple maxima and Stata often converges to a local maximum. So, you should be very careful if you really want to use a ZINB model.

Best wishes,

Joao
Comment

Gaston Fernandez

Join Date: Jul 2015
Posts: 27

09 May 2020, 01:57

Dear Joao Santos Silva,

Thanks for your answer. A few comments on your reply.

there is heteroskedasticity in regressions where the dependent variable is a count, so that is taken as a given

Heteroskedasticity would bias the estimators, and therefore, it is necessary to model it (e.g. as using the hetprobit command), or using heteroskedastic standard errors it is sufficient?

Those models are often misused

What would be the main reason why these models are often misused?

the ZINB likelihood can have multiple maxima and Stata often converges to a local maximum

This could be solved by setting the starting values of the likelihood function, for example?

Moreover, what I am trying to do is to set up a model using the following dependent variable that counts the number of times an event occurs in my data:

Code:

tab vgarp, m

      vgarp |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        111       53.88       53.88
          1 |          5        2.43       56.31
          2 |         31       15.05       71.36
          3 |         11        5.34       76.70
          4 |          7        3.40       80.10
          5 |         18        8.74       88.83
          6 |          8        3.88       92.72
          7 |          3        1.46       94.17
          8 |          2        0.97       95.15
          9 |          5        2.43       97.57
         10 |          2        0.97       98.54
         11 |          3        1.46      100.00
------------+-----------------------------------
      Total |        206      100.00

tabstat vgarp, s(mean var)

    variable |      mean  variance
-------------+--------------------
       vgarp |  1.946602  7.514208
----------------------------------

Since 54% of the observations equal to zero, I was fitting the following model:

Code:

zinb vgarp crtstd $X, inflate(crtstd $X)

Fitting constant-only model:

Iteration 0:   log likelihood = -348.89538  (not concave)
Iteration 1:   log likelihood = -329.47975  (not concave)
Iteration 2:   log likelihood = -318.65738  
Iteration 3:   log likelihood =  -314.4651  
Iteration 4:   log likelihood = -314.30746  
Iteration 5:   log likelihood = -314.28332  
Iteration 6:   log likelihood = -314.27921  
Iteration 7:   log likelihood = -314.27827  
Iteration 8:   log likelihood = -314.27804  
Iteration 9:   log likelihood =   -314.278  
Iteration 10:  log likelihood = -314.27799  

Fitting full model:

Iteration 0:   log likelihood = -314.27799  
Iteration 1:   log likelihood = -311.15034  
Iteration 2:   log likelihood =  -310.5663  
Iteration 3:   log likelihood =  -310.3594  
Iteration 4:   log likelihood = -310.35725  
Iteration 5:   log likelihood = -310.35725  

Zero-inflated negative binomial regression      Number of obs     =        195
                                                Nonzero obs       =         87
                                                Zero obs          =        108

Inflation model = logit                         LR chi2(9)        =       7.84
Log likelihood  = -310.3573                     Prob > chi2       =     0.5502

------------------------------------------------------------------------------
       vgarp |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
vgarp        |
      crtstd |   .0616272   .0794811     0.78   0.438    -.0941529    .2174073
   age_years |   -.006631   .0256227    -0.26   0.796    -.0568506    .0435885
        age2 |   .0000702   .0003347     0.21   0.834    -.0005858    .0007261
             |
         fem |
          0  |          0  (base)
          1  |   .1822168   .1687236     1.08   0.280    -.1484754     .512909
             |
      region |
        AME  |          0  (base)
        EUR  |  -.3997596   .1946401    -2.05   0.040    -.7812471    -.018272
        ASI  |   .3559845   .3460471     1.03   0.304    -.3222554    1.034224
             |
       peduc |  -.0347125   .0526516    -0.66   0.510    -.1379078    .0684828
             |
       math2 |
          1  |          0  (base)
          2  |  -.1601537   .1960938    -0.82   0.414    -.5444904    .2241831
          3  |      .0119   .2367397     0.05   0.960    -.4521012    .4759012
             |
       _cons |   1.803591   .5207295     3.46   0.001     .7829802    2.824202
-------------+----------------------------------------------------------------
inflate      |
      crtstd |   .2811264   .1735864     1.62   0.105    -.0590967    .6213496
   age_years |   .1349144   .0720917     1.87   0.061    -.0063827    .2762116
        age2 |  -.0015847   .0009182    -1.73   0.084    -.0033843    .0002148
             |
         fem |
          0  |          0  (base)
          1  |  -.4768687   .3538094    -1.35   0.178    -1.170322     .216585
             |
      region |
        AME  |          0  (base)
        EUR  |   .2909378    .402207     0.72   0.469    -.4973735    1.079249
        ASI  |    -20.225   13124.36    -0.00   0.999    -25743.49    25703.04
             |
       peduc |   .0961251   .1214731     0.79   0.429    -.1419578     .334208
             |
       math2 |
          1  |          0  (base)
          2  |   .4459254     .48198     0.93   0.355     -.498738    1.390589
          3  |    .388145   .5467879     0.71   0.478    -.6835396     1.45983
             |
       _cons |  -3.156989    1.53912    -2.05   0.040    -6.173609   -.1403683
-------------+----------------------------------------------------------------
    /lnalpha |  -2.059337   .5275167    -3.90   0.000    -3.093251   -1.025423
-------------+----------------------------------------------------------------
       alpha |   .1275385   .0672787                      .0453543    .3586446
------------------------------------------------------------------------------

Any advice on this? Someone also recommended me to look at Hurdle models.

Thanks!

Gastón

Comment

Joao Santos Silva

Join Date: Apr 2014

Posts: 3006
#4

09 May 2020, 05:45

Dear Gaston Fernandez,

1 - It is not true that with count data heteroskedasticity biases the estimators; the hetprobit you mention is for binary data.

2 - The fact that 54% of observations are zero tells us nothing about the suitability of a ZI model. ZI models are often misused because people tend to use them just because there are many zeros in the sample and this is wrong. ZI models should be used when there is a sub population for which the dependent variable is always zero.

3 - To know whether to use a ZI model, a hurdle model, or a "plain" model, you need to think about how your data is generated and collected and model that process. Descriptive statistics of the data tell us nothing about which is the best approach.

Best wishes,

Joao
Comment
Gaston Fernandez

Join Date: Jul 2015

Posts: 27
#5

09 May 2020, 07:50

Dear Joao Santos Silva,

Many thanks for your reply.

I mentioned the hetprobit to point a situation (i.e. binary choice) where heteroskedasticity could be modeled so as to not get biases in estimators. With count data, it would be different then since heteroskedasticity does not bias the estimated coefficients. I was wondering exactly that.

Regarding how my data was generated, in fact, there isn't such a subpopulation for which the dependent variable is always zero. My dependent variable counts the number of times economic rational axioms were violated in observed consumption choices. If, for example, my observations were sampled again it is not strictly necessary that the same observations will obtain a value of zero in the dependent variable. It is plausible to happen, but not necessarily have to be in that way.
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3006
#6

09 May 2020, 08:59

Dear Gaston Fernandez,

Unless you want to estimate probabilities such as the probability that no axiom is violated, you do not have to worry about heteroskedasticity or things like that, and can safely use Poisson with robust standard errors at least as a benchmark.

Best wishes,

Joao
Comment
Gaston Fernandez

Join Date: Jul 2015

Posts: 27
#7

09 May 2020, 09:27

Dear Joao Santos Silva,

In fact, I am interested in computing the marginal effects of one of my covariates on the probability of observing zero or more violations of axioms. Do you suggest that computing these marginal effects could arise more complicated solutions to correct by the presence of heteroskedasticity?

Thanks,
Gastón
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3006
#8

09 May 2020, 10:11

Dear Gaston Fernandez,

If you care about that probability things are more complex as you will need to get the distribution right. There is a large variety of distributions for count data that you can try, but the Poisson and the different flavours of the negative binomial are a good place to start. The other thing you may want to consider is to look at the 0 versus a positive number of violations, for that you can use all the binary choice models. In any case, ZI and hurdle modules do not look particularly suitable in your context.

Best wishes,

Joao
Comment
Gaston Fernandez

Join Date: Jul 2015

Posts: 27
#9

09 May 2020, 11:17

Dear Joao Santos Silva,

I see. Many thanks for your comments.
Comment

Announcement