Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Handling large percentage of zero-valued observations in the dependent variable in a panel dataset

    I am writing a paper using a panel dataset in which my depepent variable has an large percentage amount of zero values observations. Those zero values are real zeros, I mean they are not missing data or whatsoever. I have taken a look on the literature and there are many models that can be applied in this case. I am awared of the following: Tobit (Tobin, 1958), Two-Stage Model (Heckman, 1979), Two-Parts Model (Duan et al.,1984), PPML (Silva & Tenreyro, 2006) and Double-Hurdle models (Dong & Kaiser, 2008). Which one should I use and how to justify the adoption?

  • #2
    Bruno:
    welcome to this forum.
    As usual, the substantive issue is what real zero means. As per your description, those zeros did not replace any missing values (and this is good to know). However, it may well be that patient did not need any visit during the span of time covered by your panel dataset (first type of real zero); but is also may be that patient badly needed physician's assistance but she/he could not get it because she/he did not sign up to a physician beforehand (secon type of real zero).
    Obviously, those real zeros mean something different, that shluld be addressed in your analysis.
    The Stata textbook: https://www.stata.com/bookstore/regr...ent-variables/ offers an example of hurdle model that might be useful to read.
    You might be also interested in: https://www.statalist.org/forums/for...ing-with-zeros.
    Last but not least, I warmly recommend you reading John Mullahy 's "Specification and Testing of Some Modified Count Data Models." Journal of Econometrics 1986; 33:341-65.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Carlo,
      Thank you for your answer. However, I did not explain that my dependent variable (DV) is in fact a continuous variable, not a counting or categorical one. Would you have references about continuous dependent variable with lots of real zeros data? I know PPML can be used as says this paper (http://dx.doi.org/10.1016/j.enpol.2013.07.072). But I am searching for references which can support my choice on the model to be applied. This paper (http://dx.doi.org/10.1016/j.econlet.2011.05.008) states that PPML performs better when there is DV with lots of zeros compared to some other models. However, it did not compared to Panel Double Hurdle, for exemple. How should base my decision?

      Comment


      • #4
        Bruno:
        you may also be interested in https://blog.stata.com/2011/08/22/us...tell-a-friend/.
        The issue, however, is what those real zeros actually mean.
        The usual advice is to to follow the approach that is in line with what others did/do in your research field.
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          With panel data, if the response is nonnegative and takes a lot of zeros, the fixed effects Poisson estimator is often the most convincing. The outcome need not be a count. Because of the multiplicative heterogeneity, it accommodates units with lots of zeros as well as those with large outcomes. It’s properties are discussed in my 1999 Journal of Econometrics paper, “Distribution-free estimation of some nonlinear panel data models.”

          Comment


          • #6
            Dear Bruno Moreno,

            Trying to add to the great advice you already received, I suggest that you think whether there are separate processes leading to zeros and the positive observations, and whether you care about separately identifying the effect of the regressors on each part. In case you think there are two processes and you care about each of them, you may have to use an approach that takes that into account (this happens a lot in health economics); otherwise PPML with fixed effects is indeed probably the best option as it is very robust and likely to at least be a good benchmark. Of course, there are also formal statistical tests you can use to decide, but that may not be necessary.

            Best wishes,

            Joao

            Comment


            • #7
              Thank you all for the answers.

              Dear Joao Santos Silva ,

              My area is renewable energy (RE) production technology adoption by households. Yes, I recognize in my field of expertise that there are two process. The first is if a household is an adopter or not and the second is the intensity of the adoption, I mean what is the size of the RE production technology system. So which approach do you think I should use? Panel double-hurdle model? I am not aware of the health economics field of research. Do you happen to know a paper of some author applying some model to a panel dataset with a continuous and zero inflated DV?

              Thank you in advance.

              Comment


              • #8
                Cross-posted at https://stats.stackexchange.com/ques...uous-dependent

                Please note our policy on cross-posting, which is that you should tell us about it.

                Comment


                • #9
                  Dear Bruno Moreno,

                  It is not clear to me that adoption and participation are different processes.

                  In health economics, zero demand is an individual decision, but positive demand involves the health provider. Here there are clearly two processes with two different sets of actors. Likewise, the decision to become a parent is very different form the decision of having additional children, and the decision to join the labour force may be different form the decision on how much to work. There are examples where two-part models are worth considering as alternatives to PPML.

                  The case you describe does not sound very different from what we see in international trade, where the decisions to export and how much to export are often seen as the same process. That is, the firm decides how much to export and one of the possible outcomes is to export zero (a corner-solution).

                  About papers doing related things in health economics, like Carlo above I also recommend that you look at papers by John Mullahy.

                  Best wishes,

                  Joao

                  Comment


                  • #10
                    Dear Joao Santos Silva,

                    Thank you for your anwser.

                    Actually, there is a lack of literature on that. I haven't found authors who did an empirical study on residential renewable energy adoption using a panel data approach in which the target market of analysis was a emerging one. That is also why it is a zero inflated case.

                    I might be wrong, but I am inclined to think that it is a two step decision. First, the household will decide to adopt a renewable energy technology or not only if this technology has reached the grid parity (the point where the costs from consuming electricity from the grid and selfconsumption are the same). This decision, does not depend, for exemple, on how much electricity the household consumes per month. Nevertheless, the second step on the households decision is the size (capacity of electricity production) of the renewable energy technology system and this is going to depend on the household's consumption. Choosing the size of the system is an optimization process in which the minimum cost point will depend on the size of the system and the incentive scheme adopted by the authorities.

                    Cheers,

                    Bruno

                    Comment


                    • #11
                      Dear Nick Cox

                      I am sorry, I did not know. This question is also cross-posted at ResearchGate (https://www.researchgate.net/post/Ho...ata_regression).

                      Cheers,

                      Bruno

                      Comment


                      • #12
                        Best then to read the FAQ Advice, as we ask new posters to do, both on our home page and in every prompt on starting a new thread.

                        Comment


                        • #13
                          Originally posted by Jeff Wooldridge View Post
                          With panel data, if the response is nonnegative and takes a lot of zeros, the fixed effects Poisson estimator is often the most convincing. The outcome need not be a count. Because of the multiplicative heterogeneity, it accommodates units with lots of zeros as well as those with large outcomes. It’s properties are discussed in my 1999 Journal of Econometrics paper, “Distribution-free estimation of some nonlinear panel data models.”
                          Dear Sir
                          In my model, There are too many zeroes in the dependent variable. The dependent variable is the number of days in a women that a women is working in a given activity. Therefore. If in a given month the individual doesn't work in a given activity, the dependent variable takes value 0.
                          My data set is at individual level. These are monthly observations for 6 years.
                          In this case, I wanted to use fixed effects model with Tobit. But there is no Stata command for it.
                          Would fixed effects Poisson estimator be a good idea for my model?
                          Kindly suggest.
                          Thank you

                          Comment

                          Working...
                          X