Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dealing with zeros

    I have lots of zeros in both my dependent and independent variables.

    One way that I was dealing with this is by adding 1 to all of the values. However, this makes each of the variables right-skewed. So I took the natural log to create a normal distribution. But when I do this I get a spike to the left followed by a normal distribution (see an example below). As I believe this violates the assumption of normal distribution, I tried dropping the zeros which reduces the sample size too much and then I don't get significance in my models. I read that I could impute the zero values with the mean, but I know that would misrepresent my data. I also read that I could take the square root instead of the log for transformation, but the data is still right skewed rather than having a normal distraction. Any other thoughts on how I might deal with this issue would be much appreciated!
    Attached Files

  • #2
    Jessica:
    - it is not clear what you are planning to do with your data. If your goal is a regression, it would be helpful to know which type of regressand you are dealing with;
    - it is relevant to get more details about your zeros: during a given span of time, in some health care systems a given individual could score 0 visit because she does not need to see a doctor or 0 visit because, despite she needs to see a doctor, she cannot, due to the lack of a valid insurance;
    - taking natural log of 0 id leaves you at square one, as:
    Code:
    di ln(0)
    .
    ;
    -omitting zeros from you sample is actually making-up your sample (and biasing your results);
    -hunting for a statistical significant result(s) is not that scientific: try to give a fair and true view of the data generating process, instead.
    Kind regards,
    Carlo
    (Stata 18.0 SE)

    Comment


    • #3
      I agree with Carlo Lazzaro. Key is what zero means. A graph like yours makes me guess that zeros have a qualitatively different meaning from positive values. If that's so, adding 1 is usually a bad idea and taking logarithms of the result can't rescue a bad idea.

      In various fields, there are two-part models in which the first issue is whether someone does something and if so how much and why and if not then why not. So, non-smokers consume zero amounts and smokers consume positive amounts.

      Perhaps that's what you need, but only substantive context can help here. I don't recognise a standard problem for which Stata code suggests itself.

      Comment


      • #4
        Jessica:
        you might be interested in taking a look (if you have access to the whole article) at John Mullahy 's pivotal contribution in the field of heterogeneity in count data regression models: https://onlinelibrary.wiley.com/doi/...3E3.0.CO%3B2-G.
        Kind regards,
        Carlo
        (Stata 18.0 SE)

        Comment


        • #5
          Jessica: In light of Carlo's comment in #2 and Nick's comment in #3, you might want to check out twopm.
          Code:
          ssc describe twopm

          Comment


          • #6
            Thank you everyone for your thoughts on this. To provide more context, the goal is to use this data in a fixed effects model. I have the zero issue in both my DV and IVs. My DV is production data. So if they have a zero it means they don't produce anything or didn't report. My independent variables are various financials. So in this case, $0 would mean they didn't spend or generate any money in a particular area.

            For my DV, I did confirm that I have selection bias. I separated my sample into those that produce (y>0) and those that don't produce (y=0), and found a significance difference. I then tried to do a Heckman 2 step model, but my data won't converge. As I'm new to Heckman models, I'm not sure what the problem might be or if there are other 2 step models that I could consider?

            Comment


            • #7
              Jessica:
              some comments about your last post:
              - DV: disentangling potentially missing zero (those companies that do not report) from zero due to lack of production is probably impossible (but obviously those zeros have different substantive meaning). It may also be that the difference between those with y=0 vs y>o is due to missing values (but who knows?);
              -IV: my last challenges with corporate finance date back to 30 years ago. However, I would check, if fesible, whether those zeros are, again, genuine or placeholders for missing values;
              - fixed effect regression: as you know the -fe- estimator is hungry for time invariant predictors (like industries): have you already checked is this the way to go with your data?
              - Heckman model: it is difficult to say anything about that without taking a look at what you typed and what Stata gave you back (as recommended by the FAQ).
              Last edited by Carlo Lazzaro; 10 Oct 2019, 01:05.
              Kind regards,
              Carlo
              (Stata 18.0 SE)

              Comment


              • #8
                Jessica: A true zero, and setting the value to zero because you don't observe it, are very different. The latter should never be done. Can you tell which is the case? If so, you should replace zeros that are actually missing with the missing data indicator.

                You can't tell if there is "selection" based on splitting the sample because selection is inherently about unobservables. Assuming that you can resolve the missing data problem, the best solution for you is to use Poisson fixed effects estimation. Contrary to its name, the Poisson distribution is not needed. Any variance-mean relationship is allowed, and any serial correlation. By allowing a multiplicative fixed effect you can easily account for the fact that some firms almost always, or always, have zeros. This may not entirely account for the "selection" you're worried about but it likely goes a long way. Just make sure you use robust standard errors. The coefficients have percentage interpretations: it's similar to using log(y) but this works with any number of zeros.

                Code:
                xtset id year
                xtpoisson y i.year x1 x2 ... xk, vce(robust)

                Comment


                • #9
                  Jeff Wooldridge I have a similar question, yet I am not working with count data. In fact, my dependent variable is an observed flight delay, in min (thus, I have many 0 delays and many positive delays in my dataset) I am estimating the Fixed Effect model but I also want to account for 'different processes' underlying positive and zero delays. My first attempt was to proceed with the selection model as you proposed (Wooldridge, 1995). However, I believe that such a method is mostly suitable for cases where I do not observe 'zero' observations and thus the data is truncated, which is not the case here. Could you clarify what would be the most suitable approach to account for zero delays? Thank you in advance.

                  Comment

                  Working...
                  X