Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dealing with zero values and log variables

    Hi all,

    I'm looking at some log real income and log real wage data as outcome variables in a difference-in-differences specification. I'm trying to figure out the best way to deal with zero income/wage values. What I have currently done is to convert them to 1 when I do the log transformation. Is this the best approach, or should I consider an alternative like a Poisson model (which would enable me to retain the zeros as zeros) or any other model?

    Many thanks,
    Ashani

  • #2
    Adding an arbitrary constant before taking logarithms is sometimes a good idea for visualization, provided the constant is carefully chosen -- but it's not obviously the best solution for modelling. In some circumstances it is a very bad solution.

    For simplicity let's consider use of log base 10. Using any other base raises the same issues, but the mental arithmetic is harder for me. Suppose you have some zeros but otherwise your wages are hundreds or thousands in your currency units. Then your zeros map to 0 under log10 (wage + 1) while the other wages map to say 2 3 4 and for those other values the result of log10 (wage + 1) is essentially the same as that of log10 (wage) So, you've created a bunch of outliers from the zeros. The point is that log (x + 1) is a very gentle nudge when x is large compared with 1 or 0 but a violent nudge otherwise.

    To see the effect of your transformation plot log (wage + 1) against wage to see whether outliers pop out -- of necessity they will be all the same point on any graph, so count them too. For this the base you want to use is fine, and as said any other base would show the same point.

    If they do, then the problem would be improved a little by using a different constant but never removed.

    John Mullahy and I have often discussed these issues both on and off the list. I think he would say that my warning understates the problems!

    Sometimes Poisson is a good answer. Sometimes there is good reason to separate the problem into predicting who does or does not receive wages and then predicting the wages of the waged.

    For wages read income too.

    There are many analogues in health or medical fields, for example predicting who smokes (tobacco) and predicting how much the smokers smoke.

    Comment


    • #3
      For diff-in-diffs this recent paper should be instructive:
      https://academic.oup.com/qje/article.../2/891/7473710

      This paper may also be useful to consult:
      https://onlinelibrary.wiley.com/doi/...bes.12583?af=R

      Comment


      • #4
        Thank you very much for the feedback and resources. Much appreciated.

        Comment


        • #5
          Related to this, if I decide to go ahead with estimating fixed-effects Poisson models, would I then use the level form of my real income variable as the outcome variable (without transforming it into log), omitting negative real income values?

          Many thanks.

          Comment


          • #6
            My instinct would be to use linear regression (linear conditional mean specification) rather than Poisson regression (exponential, and therefore positive, conditional mean specification) if your sample's outcome variable had negative, zero, and positive values.

            Omitting negative values would seem to me problematic.

            Comment

            Working...
            X