Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to use and interpret Zero Inflated Poisson

    I am working on an academic research that seeks to analyze the influence of precipitation on the occurrence of traffic accidents. I'm using Poisson's regression because it fits nicely to counting. I have data from municipalities in the state of Minas Gerais located in Brazil and the number of accidents occurred in these municipalities during the period from 2010 to 2015. It turns out that the dependent variable being "accidents" occurs that many zeros appear causing a subdispersion (the data exhibit less Variation than expected) in the chart. I thought of using Zero Inflated Poisson to eliminate the zeros. But I do not know how to use. What should I select in Stata?

    Click image for larger version

Name:	STATA 01.jpg
Views:	1
Size:	162.6 KB
ID:	1370411



    Click image for larger version

Name:	STATA 02.jpg
Views:	1
Size:	135.3 KB
ID:	1370412

  • #2
    First, I would not use the graphical user interface for anything but playing around with Stata or for a "quick and dirty" one-off to get a rough sense of some result that you do not plan to share with anyone else. For any analysis you intend for others to see, you need to document your code in a do file and the output in a log file.

    The zero-inflated Poisson command estimates a model in which the distribution of the outcome is a two-component mixture. One component is a distribution that is all zero. The other component is a Poisson distribution. The command estimates the rate parameter of the Poisson distribution (or coefficients of a linear expression which give the rate parameter) and also the proportion of the mixture that comes from each component. The syntax of the -zip- command itself reflects this.

    Code:
    zip numacc precipitacao tempo temposq, inflate(...) // AND PERHAPS OTHER OPTIONS SUCH AS exposure(), offset(), vce(), etc.
    The -inflate()- option specifies the variables that predict the probability of an observation being in the all-zero component of the mixture. That is, you can model the probability of an observation's being an all-zero observation in terms of other variables in the model. You list those variables in the -inflate()- option and Stata will estimate the probability of being in the all-zero component for each observation by fitting a logistic regression model. (There is also an option to use a probit model instead.) So, for example, if you think that precipitacao affects the probability of being in the all-zero component of the distribution you would code
    Code:
    zip numacc precipitacao tempo temposq, inflate(precipitacao)
    If you don't have any variables that you think predict that, you can just specify -inflate(_cons)-, and then Stata will calculate that probability on the assumption that it is the same for all observations.

    Do read the chapter on -zip- in the [R] section of your on-line manual for more information.

    Finally, if temposq is the square of the variable tempo, then you should use factor variable notation in your command instead of calculating a squared variable:
    Code:
    zip numacc precipitacao c.tempo##c.tempo, inflate(precipitacao)
    That way, after estimating your zip model you can use the -margins- command to easily and correctly calculate adjusted mean outcomes and marginal effects. See -help fvvarlist- and the associated chapter in the on-line manual.

    In the future, please do not post screen shots to show information (except perhaps a graph--but those are best done as .png attachments). They are difficult to work with. For example, it isn't possible to copy and paste your model variables.
    Last edited by Clyde Schechter; 15 Jan 2017, 18:06.

    Comment


    • #3
      Hello Jessica,

      After Clyde's insightful remarks, I just wish to add a comment, based on a sentence of yours:

      It turns out that the dependent variable being "accidents" occurs that many zeros appear causing a subdispersion (the data exhibit less Variation than expected) in the chart. I thought of using Zero Inflated Poisson to eliminate the zeros. But I do not know how to use. What should I select in Stata?
      If by "subdispersion" you mean "underdispersion", I believe you also should think about using a generalized Poisson model. Specifically, a zero-inflated generalized Poisson model. There is the ado-file "zigp" written by Hilbe, as demonstrated in the book Modeling Count Data (http://www.stata.com/bookstore/modeling-count-data/).

      Best,

      Marcos
      Best regards,

      Marcos

      Comment


      • #4
        Jesica:
        I do share all the previous skillful remarks.
        Just an aside: taking a look at one of your (highly deprecated on this forum, as per FAQ and wise Clyde's comment) screenshots, it seems that you have both a linear and a square predictors for -time- (-tempo- and -temposq-; I easily spotted them for the trivial reason that in Italian the same word means -time- and, unfortunately for translating in languages with different setups, means -weather-, too).
        Set aside linguistic issues, creating two predictors the way you followed is highly inefficient (and potentially misleading, as Stata cannot recognize that -temposq- is the square of -tempo-), in that -fvvarlist- can do that job for you, letting you exploit the capabilities of two other wonderful Stata built-in command, such as -margins- and -marginsplot-.
        Last edited by Carlo Lazzaro; 16 Jan 2017, 09:22.
        Kind regards,
        Carlo
        (Stata 18.0 SE)

        Comment


        • #5
          A few comments.

          1. Given that you have panel data, you may want to account for unobserved heterogeneity. While weather variables might be assumed exogenous with respect to the heterogeneity, that is not always obvious. Maybe there are features of traffic safety that vary by municipality that also happen to be correlated with weather.

          2. If you want to account for heterogeneity, the fixed effects Poisson estimator is head and shoulders above anything else. It is fully robust to violation of the Poisson assumption as well as unmodeled serial correlation in accidents within a municipality. If you are interested in effects on the mean -- and, in my experience, this describes 90% of count applications -- then you don't care whether the Poisson distribution is correctly specified.

          3. Finding that there are municipalities with few traffic accidents across the years is not proof of the relevant underdispersion. If you allow for municipality "fixed effects," then some municipalities may have Poisson distributions with very small means. You cannot learn this by looking at the raw data. In any case, as stated above, to use Poisson FE does not require the Poisson distribution be true.

          4. If you have a sufficient number of municipalities you can also include year dummies to account for secular trends in accidents and weather. It's unlikely this will matter much for the estimates, but it could actually help reduce standard errors.

          5. Using a zero-inflated Poisson and ignoring the panel structure is less appealing, I think, than using FE Poisson and recognizing the panel structure.

          As a general comment, applied researchers seem too eager to latch onto something more complicated than they need. The FE Poisson estimator is to count outcomes as the linear FE estimator is to continuous outcomes. At a minimum, one can try both. But the ZIP estimates will be much more difficult to interpret because you need to turn the estimates into marginal effects.

          Comment


          • #6
            First of all, I would like to thank you for the great explanations and suggestions. Secondly, I'd like to apologize for the screenshots. I'm totally new to the forum and I did not know about it. I do not have much knowledge in statistics, I am studying economics and I have not studied econometrics (where we see poisson regression and panel data). I am very difficult to deal with my research, I will take all comments to my advisor, who has even taught me some basic STATA functions. Using Zero-Inflated was a suggestion from my advisor but we are also thinking about fixed effects, I wanted to see which best suited my dataset. In any case, I would like to thank everyone again and apologize if the explanation of my work has been vague or a bit confusing, but I am not fluent in English and I had to elaborate this discussion with the help of an online translator.

            Regards,
            Jessica.

            Comment


            • #7
              Originally posted by Jeff Wooldridge View Post
              A few comments.

              1. Given that you have panel data, you may want to account for unobserved heterogeneity. While weather variables might be assumed exogenous with respect to the heterogeneity, that is not always obvious. Maybe there are features of traffic safety that vary by municipality that also happen to be correlated with weather.

              2. If you want to account for heterogeneity, the fixed effects Poisson estimator is head and shoulders above anything else. It is fully robust to violation of the Poisson assumption as well as unmodeled serial correlation in accidents within a municipality. If you are interested in effects on the mean -- and, in my experience, this describes 90% of count applications -- then you don't care whether the Poisson distribution is correctly specified.

              3. Finding that there are municipalities with few traffic accidents across the years is not proof of the relevant underdispersion. If you allow for municipality "fixed effects," then some municipalities may have Poisson distributions with very small means. You cannot learn this by looking at the raw data. In any case, as stated above, to use Poisson FE does not require the Poisson distribution be true.

              4. If you have a sufficient number of municipalities you can also include year dummies to account for secular trends in accidents and weather. It's unlikely this will matter much for the estimates, but it could actually help reduce standard errors.

              5. Using a zero-inflated Poisson and ignoring the panel structure is less appealing, I think, than using FE Poisson and recognizing the panel structure.

              As a general comment, applied researchers seem too eager to latch onto something more complicated than they need. The FE Poisson estimator is to count outcomes as the linear FE estimator is to continuous outcomes. At a minimum, one can try both. But the ZIP estimates will be much more difficult to interpret because you need to turn the estimates into marginal effects.
              Adding to what Jeff said. Two of my statistics professors advised me that if you are using a Poisson model, then you need to worry much more about overdispersion than underdispersion. If your data are under-dispersed, then your standard errors get over-estimated. They are biased conservatively. That is not a terrible problem to have.

              I've also been advised that you want to think about why you are trying to fit a zero-inflated Poisson model. As Clyde noted, a ZIP model is most suitable for a situation where there are two discrete populations. In your case, maybe one subgroup of municipalities had either systematically not reported the number of accidents or had almost no motor vehicles, and so they more or less could not have had any (reported) accidents, and the other set of towns is more "normal" and is a reasonable fit for a Poisson model. If your situation is like that, then maybe you do want to go and read up more on what a zero-inflated model does. If not, then maybe it's better not to worry. And, of course, if you have repeated measures, it is probably more important for the model to account for the repeated measurements (i.e. use xtpoisson).
              Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

              When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

              Comment


              • #8
                Originally posted by Weiwen Ng View Post

                Adding to what Jeff said. Two of my statistics professors advised me that if you are using a Poisson model, then you need to worry much more about overdispersion than underdispersion. If your data are under-dispersed, then your standard errors get over-estimated. They are biased conservatively. That is not a terrible problem to have.

                I've also been advised that you want to think about why you are trying to fit a zero-inflated Poisson model. As Clyde noted, a ZIP model is most suitable for a situation where there are two discrete populations. In your case, maybe one subgroup of municipalities had either systematically not reported the number of accidents or had almost no motor vehicles, and so they more or less could not have had any (reported) accidents, and the other set of towns is more "normal" and is a reasonable fit for a Poisson model. If your situation is like that, then maybe you do want to go and read up more on what a zero-inflated model does. If not, then maybe it's better not to worry. And, of course, if you have repeated measures, it is probably more important for the model to account for the repeated measurements (i.e. use xtpoisson).
                It is essentially trivial to make the standard errors robust to overdispersion, underdispersion, or neither, so I'm not sure what the issue is with Poisson regression. We know a lot about the good properties of Poisson regression now . It's too bad that in both statistics and econometrics it's still being covered in a somewhat misleading way.

                If the goal is to estimate probabilities of different outcomes, rather than the effect on the mean outcome, then the Poisson distribution is deficient. But, in my experience, the focus is heavily on the mean, even in more complicated models. So, if the mean is of interest, Poisson regression with robust inference makes a lot of sense.

                Comment


                • #9
                  Originally posted by Jeff Wooldridge View Post
                  A few comments.

                  1. Given that you have panel data, you may want to account for unobserved heterogeneity. While weather variables might be assumed exogenous with respect to the heterogeneity, that is not always obvious. Maybe there are features of traffic safety that vary by municipality that also happen to be correlated with weather.

                  2. If you want to account for heterogeneity, the fixed effects Poisson estimator is head and shoulders above anything else. It is fully robust to violation of the Poisson assumption as well as unmodeled serial correlation in accidents within a municipality. If you are interested in effects on the mean -- and, in my experience, this describes 90% of count applications -- then you don't care whether the Poisson distribution is correctly specified.

                  3. Finding that there are municipalities with few traffic accidents across the years is not proof of the relevant underdispersion. If you allow for municipality "fixed effects," then some municipalities may have Poisson distributions with very small means. You cannot learn this by looking at the raw data. In any case, as stated above, to use Poisson FE does not require the Poisson distribution be true.

                  4. If you have a sufficient number of municipalities you can also include year dummies to account for secular trends in accidents and weather. It's unlikely this will matter much for the estimates, but it could actually help reduce standard errors.

                  5. Using a zero-inflated Poisson and ignoring the panel structure is less appealing, I think, than using FE Poisson and recognizing the panel structure.

                  As a general comment, applied researchers seem too eager to latch onto something more complicated than they need. The FE Poisson estimator is to count outcomes as the linear FE estimator is to continuous outcomes. At a minimum, one can try both. But the ZIP estimates will be much more difficult to interpret because you need to turn the estimates into marginal effects.

                  Dear Jeff,

                  Thank you for posting this, it helps a lot. Perhaps out of habit, reviewers often ask to use something other than FE Poisson regression even though, as you explained, in many situation FE Poisson is a better fit. Forgive my ignorance, but is there a paper of yours that can be cited to support what you said here? This would help a lot with reviewers,

                  Comment


                  • #10
                    Hi Michael. Glad you find it helpful. I discuss all of these things in my 1999 Journal of Econometrics paper, "Distribution-Free Estimation of Some Nonlinear Panel Data Models." I also discuss this in Chapter 18 of my 2010 MIT Press book.

                    Comment

                    Working...
                    X