Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • References about heteroskedasticity in Zero-Inflated Negative Binomial models

    Hello!

    I am trying to set up a zero-inflated negative binomial model using count data. I am not sure if the presence of heteroskedasticity in such data would bias my estimation, making inconsistent the estimated coefficients. In traditional econometrics textbooks that I usually use (e.g. Wooldridge, Cameron and Trivedi, Hayashi, etc) do not appear too much about count data models and potential heteroskedastic issues.

    I would be pleased if someone could provide me with some useful references about this.

    Many thanks!
    Gastón


  • #2
    Dear Gaston Fernandez,

    Typically, there is heteroskedasticity in regressions where the dependent variable is a count, so that is taken as a given. Now, I do not know what you are trying to do, but are you sure you need a zero inflated model? Those models are often misused and, additionally, the ZINB likelihood can have multiple maxima and Stata often converges to a local maximum. So, you should be very careful if you really want to use a ZINB model.

    Best wishes,

    Joao

    Comment


    • #3
      Dear Joao Santos Silva,

      Thanks for your answer. A few comments on your reply.

      there is heteroskedasticity in regressions where the dependent variable is a count, so that is taken as a given
      Heteroskedasticity would bias the estimators, and therefore, it is necessary to model it (e.g. as using the hetprobit command), or using heteroskedastic standard errors it is sufficient?

      Those models are often misused
      What would be the main reason why these models are often misused?

      the ZINB likelihood can have multiple maxima and Stata often converges to a local maximum
      This could be solved by setting the starting values of the likelihood function, for example?

      Moreover, what I am trying to do is to set up a model using the following dependent variable that counts the number of times an event occurs in my data:
      Code:
      tab vgarp, m
      
            vgarp |      Freq.     Percent        Cum.
      ------------+-----------------------------------
                0 |        111       53.88       53.88
                1 |          5        2.43       56.31
                2 |         31       15.05       71.36
                3 |         11        5.34       76.70
                4 |          7        3.40       80.10
                5 |         18        8.74       88.83
                6 |          8        3.88       92.72
                7 |          3        1.46       94.17
                8 |          2        0.97       95.15
                9 |          5        2.43       97.57
               10 |          2        0.97       98.54
               11 |          3        1.46      100.00
      ------------+-----------------------------------
            Total |        206      100.00
      
      tabstat vgarp, s(mean var)
      
          variable |      mean  variance
      -------------+--------------------
             vgarp |  1.946602  7.514208
      ----------------------------------
      Since 54% of the observations equal to zero, I was fitting the following model:
      Code:
      zinb vgarp crtstd $X, inflate(crtstd $X)
      
      Fitting constant-only model:
      
      Iteration 0:   log likelihood = -348.89538  (not concave)
      Iteration 1:   log likelihood = -329.47975  (not concave)
      Iteration 2:   log likelihood = -318.65738  
      Iteration 3:   log likelihood =  -314.4651  
      Iteration 4:   log likelihood = -314.30746  
      Iteration 5:   log likelihood = -314.28332  
      Iteration 6:   log likelihood = -314.27921  
      Iteration 7:   log likelihood = -314.27827  
      Iteration 8:   log likelihood = -314.27804  
      Iteration 9:   log likelihood =   -314.278  
      Iteration 10:  log likelihood = -314.27799  
      
      Fitting full model:
      
      Iteration 0:   log likelihood = -314.27799  
      Iteration 1:   log likelihood = -311.15034  
      Iteration 2:   log likelihood =  -310.5663  
      Iteration 3:   log likelihood =  -310.3594  
      Iteration 4:   log likelihood = -310.35725  
      Iteration 5:   log likelihood = -310.35725  
      
      Zero-inflated negative binomial regression      Number of obs     =        195
                                                      Nonzero obs       =         87
                                                      Zero obs          =        108
      
      Inflation model = logit                         LR chi2(9)        =       7.84
      Log likelihood  = -310.3573                     Prob > chi2       =     0.5502
      
      ------------------------------------------------------------------------------
             vgarp |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
      vgarp        |
            crtstd |   .0616272   .0794811     0.78   0.438    -.0941529    .2174073
         age_years |   -.006631   .0256227    -0.26   0.796    -.0568506    .0435885
              age2 |   .0000702   .0003347     0.21   0.834    -.0005858    .0007261
                   |
               fem |
                0  |          0  (base)
                1  |   .1822168   .1687236     1.08   0.280    -.1484754     .512909
                   |
            region |
              AME  |          0  (base)
              EUR  |  -.3997596   .1946401    -2.05   0.040    -.7812471    -.018272
              ASI  |   .3559845   .3460471     1.03   0.304    -.3222554    1.034224
                   |
             peduc |  -.0347125   .0526516    -0.66   0.510    -.1379078    .0684828
                   |
             math2 |
                1  |          0  (base)
                2  |  -.1601537   .1960938    -0.82   0.414    -.5444904    .2241831
                3  |      .0119   .2367397     0.05   0.960    -.4521012    .4759012
                   |
             _cons |   1.803591   .5207295     3.46   0.001     .7829802    2.824202
      -------------+----------------------------------------------------------------
      inflate      |
            crtstd |   .2811264   .1735864     1.62   0.105    -.0590967    .6213496
         age_years |   .1349144   .0720917     1.87   0.061    -.0063827    .2762116
              age2 |  -.0015847   .0009182    -1.73   0.084    -.0033843    .0002148
                   |
               fem |
                0  |          0  (base)
                1  |  -.4768687   .3538094    -1.35   0.178    -1.170322     .216585
                   |
            region |
              AME  |          0  (base)
              EUR  |   .2909378    .402207     0.72   0.469    -.4973735    1.079249
              ASI  |    -20.225   13124.36    -0.00   0.999    -25743.49    25703.04
                   |
             peduc |   .0961251   .1214731     0.79   0.429    -.1419578     .334208
                   |
             math2 |
                1  |          0  (base)
                2  |   .4459254     .48198     0.93   0.355     -.498738    1.390589
                3  |    .388145   .5467879     0.71   0.478    -.6835396     1.45983
                   |
             _cons |  -3.156989    1.53912    -2.05   0.040    -6.173609   -.1403683
      -------------+----------------------------------------------------------------
          /lnalpha |  -2.059337   .5275167    -3.90   0.000    -3.093251   -1.025423
      -------------+----------------------------------------------------------------
             alpha |   .1275385   .0672787                      .0453543    .3586446
      ------------------------------------------------------------------------------
      Any advice on this? Someone also recommended me to look at Hurdle models.

      Thanks!

      Gastón

      Comment


      • #4
        Dear Gaston Fernandez,

        1 - It is not true that with count data heteroskedasticity biases the estimators; the hetprobit you mention is for binary data.

        2 - The fact that 54% of observations are zero tells us nothing about the suitability of a ZI model. ZI models are often misused because people tend to use them just because there are many zeros in the sample and this is wrong. ZI models should be used when there is a sub population for which the dependent variable is always zero.

        3 - To know whether to use a ZI model, a hurdle model, or a "plain" model, you need to think about how your data is generated and collected and model that process. Descriptive statistics of the data tell us nothing about which is the best approach.

        Best wishes,

        Joao

        Comment


        • #5
          Dear Joao Santos Silva,

          Many thanks for your reply.

          I mentioned the hetprobit to point a situation (i.e. binary choice) where heteroskedasticity could be modeled so as to not get biases in estimators. With count data, it would be different then since heteroskedasticity does not bias the estimated coefficients. I was wondering exactly that.

          Regarding how my data was generated, in fact, there isn't such a subpopulation for which the dependent variable is always zero. My dependent variable counts the number of times economic rational axioms were violated in observed consumption choices. If, for example, my observations were sampled again it is not strictly necessary that the same observations will obtain a value of zero in the dependent variable. It is plausible to happen, but not necessarily have to be in that way.

          Comment


          • #6
            Dear Gaston Fernandez,

            Unless you want to estimate probabilities such as the probability that no axiom is violated, you do not have to worry about heteroskedasticity or things like that, and can safely use Poisson with robust standard errors at least as a benchmark.

            Best wishes,

            Joao

            Comment


            • #7
              Dear Joao Santos Silva,

              In fact, I am interested in computing the marginal effects of one of my covariates on the probability of observing zero or more violations of axioms. Do you suggest that computing these marginal effects could arise more complicated solutions to correct by the presence of heteroskedasticity?

              Thanks,
              Gastón

              Comment


              • #8
                Dear Gaston Fernandez,

                If you care about that probability things are more complex as you will need to get the distribution right. There is a large variety of distributions for count data that you can try, but the Poisson and the different flavours of the negative binomial are a good place to start. The other thing you may want to consider is to look at the 0 versus a positive number of violations, for that you can use all the binary choice models. In any case, ZI and hurdle modules do not look particularly suitable in your context.

                Best wishes,

                Joao

                Comment


                • #9
                  Dear Joao Santos Silva,

                  I see. Many thanks for your comments.

                  Comment

                  Working...
                  X