Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Heavy-tailed dependent variable - what to do?

    Hi,

    I am doing a replication of a data-set where the dependent variable is a count with a lot of zeroes. Since it is panel data, the author has used OLS-PCSE, without doing any changes to the dependent variable. The variable is a count, measuring "number of peacekeepers". But most often the selected countries do not deploy peacekeepers, so there are many zeroes.

    Actually 86,2 percent of the values on the dependent variable are zeroes. This makes it heavy-tailed, and it creates problems when choosing right estimator. I do not believe an ordinary OLS-PCSE is the correct choice of model when the dependent variable is so heavy-tailed.

    I have considered different options:
    - Zero-inflated negative binomial regression (ZINBR). But the dependent variable can not take negative values, only zero and positive values, and therefore it has a floor effect. If I have understood it correctly, ZINBR is not a good estimator when the variable has a floor effect.

    - Then, I was advised to transform the heavy-tailed dependent variable into an inverse hyperbolic sine. Would this be helpful? And when it is transformed, how do I best use my new transformed variable? Can/should I use an inverse hyperbolic sine in an OLS-PCSE, or is this for some reason not recommended?

    - I have also considered transforming the dependent variable into a dummy where it is either 0 (no peacekeepers deployed) or 1 (>0 peacekeepers deployed). This will not tell us the increase of peacekeepers when the independent change value, but it will tell us the likelihood for a country to send more than one peacekeeper when the IV changes. Could this be helpful in any way?

    And, if it is possible to say, how do I understand how I have found the best model for my data-set?

    As these questions might reveal, I am quite new to statistics, and this replication is a part of my introduction class.

    If you have other suggestions on how to solve this problem, it is appreciated. (The independent variables in the data-set are either counts or dummies.)

    Thanks

  • #2
    asinh() in principle could be a link function for xtgee, I think. I've not seen that done. But in principle there is a tuning parameter which should be estimated rather than guessed.

    Comment


    • #3
      But the dependent variable can not take negative values, only zero and positive values, and therefore it h
      as a floor effect
      . If I have understood it correctly, ZINBR is not a good estimator when the variable has a floor effect.


      But you say your variable is a count of the number of peace-keepers deployed, which is inherently >= 0. So it seems to be quite appropriate as a DV for ZINBR. What am I missing here?

      Comment


      • #4
        Vegard: Just because there are "many zeros" does not mean that standard count data models (Poisson, Neg-Bin) are inappropriate. E.g. a Poisson distribution with lambda=.15 will give Prob(y=0)=.86. There is nothing per se heavy-tailed about such a distribution.

        So before appealing to zero-inflated models or to dichotomizing your dependent variable (and thus discarding information) or to transformations of your dependent variable, my recommendation would be to consider at least initially standard panel-count-data approaches like xtpoisson and xtnbreg. Or (as hinted by Nick Cox in #2) consider using xtgee with link(log) and some appropriate choice of family(...) (e.g. Poisson).

        As for determining "the best model for my data-set" that is an altogether different matter and really depends on what you mean by "best".

        Good luck on your project.

        Comment


        • #5
          I will add that I agree completely with John Mullahy here--your distribution might very well fit a simple Poisson model, or negative binomial without zero inflation. It is worth exploring these options.

          My purpose in #3 was to make a different point, which, in retrospect, did not come across clearly, so I will try to restate it here.

          There are (at least) three reasons why the values of a variable in data take on only values greater than or equal to zero, and they have different implications.

          A. The nature of what the variable represents is inherently always greater than or equal to zero. If your variable represents a count of things, then it falls into this category: counts are necessarily >= 0.

          B. In nature the variable might in principle take on negative values, but when it does so we simply don't get to observe its actual value. We will know that the true value is < 0, but that is all. This is called left-censored data. Consider a situation where we are measuring the concentration of some analyte in blood specimens. The measurement apparatus usually has some lower limit below which it cannot detect the analyte. Conventionally these will be reported as zero, but the zero is just an encoding of "left-censored at the lowest value the apparatus can detect." A floor effect is similar to this, although sometimes with a floor effect we cannot distinguish a true zero from a negative masquerading as zero.

          C. In nature the variable can in principle take on negative values, but when there really is a negative value, that observation simply can't make it into our data set (either because we deliberately exclude them when we find them, or because our method of ascertainment is unable to find them in the first place. For example, if we are studying the number of children people have and we (foolishly) decide to do that by looking at birth records listing them as a parent, then people with no children will never appear in our data set because there will not be any birth records listing them as a parent. This is called truncated data (truncated at 1)

          Different statistical approaches are appropriate for these different situations. The point I wanted to make in #3 is that your situation is like A, not C.

          Comment

          Working...
          X