Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dealing w/ 0 values when taking log transformation of variable

    I'm working with a right-skewed variable--monthly new COVID cases / 100,000 people. My analysis includes pre-COVID months, when naturally the value would be 0. I don't want to lose these observations from my analysis; the new transformed variable simply shows them as missing. So, how should I deal with these 0 values? Thanks so much

  • #2
    The first question is why you want to log transform this variable in the first place. The mere fact that the variable can take on zero values suggests that this is not an appropriate model anyway. Have you considered instead using the untransformed variable in a Poisson model, or, given that you have pre-pandemic data included, a zero-inflated Poisson model? The Poisson model uses a logarithmic link function, so it is, in a sense, similar to log transformation of the outcome variable and provides many of the advantages that people often hope to gain by log-transforming the outcome. But it has no problem accommodating zero values. And Poisson models are widely useful for modeling count outcomes.

    Comment


    • #3
      So the transformed state-by-month COVID variable is a control variable in a logit model, where the outcome is reemployment or not (1/0). What if I don't transform it?

      Comment


      • #4
        OK, that's a different story. Given that the logit model inherently squashes the effects of extreme values of predictor variables, I would give the untransformed COVID variable a try and see how the logit model works out. It might fit very well. Remember, there are no distributional requirements for the explanatory variables in generalized linear models. If the model ends up fitting poorly, then you can consider other ways of dealing with it.

        For example, the cube root function, and the inverse hyperbolic sine function (-asinh()- in Stata) are both good transformations for reducing skewness. You refer to this COVID variable as a "control" variable, by which I understand you to mean that it is included in the model to adjust for its nuisance contributions to outcome variation (aka, variously per discipline, as adjusting for confounding, or reducing omitted variable bias), but answering your research question does not require estimating its effect on the outcome. That's even better, because the main drawback of these two transformations is that it is difficult to interpret their regression coefficients. But if this is really just a nuisance variable, that doesn't matter, as there is no research interest in those effects: they are included just to keep them from biasing the effects you really are interested in.

        Comment


        • #5
          log(x + 1) is some ways akin to asinh(x) and has (in some contexts, not all) the minor advantage that for x >> 0 it is close to log x and for x close to 0 it is close to x.

          I would never be in favour of advocating its blind use, and there are many partial checks, such as looking at

          1. The univariate distribution of log(x + 1) to see whether the distribution looks reasonable. Don't ask for a precise definition of reasonable, as the opposite will be obvious.

          2. Checks of y vs log(x + 1) and of residuals vs log(x + 1) ditto.

          3. Sensitivity analysis considering the transformation as a special case of log(x + c) and considering other choices of c. But there is a pitfall to watch out for. log(x + smidgen) is even closer to log x for x >> 0 but can all too easily create massive outliers "out in left field" near x = 0.

          Also note that log(x + 1) is sensitive to measurement unit choice, as Dimitriy V. Masterov recently reminded me elsewhere. log(x + 1) if x is in million dollars and log(x + 1) is x is in cents are different transformations.

          There is an easy generalization to include negative arguments sign(x) log(1 + abs(x)) and this (like cube roots and asinh) has the nice feature that it preserves sign, so the result is negative, zero, or positive precisely when the argument is.

          This transformation has been called the neglog, an ugly name in my view, but there it is. It is worth musing on how far useful functions tend to acquire distinctive names. but this can take a while. log[p / (1 - p)] has been around for a few centuries but statistical people's name logit is much more recent.

          Comment


          • #6
            I've had good luck using sqrt(x) with right-shewed predictors. It's easy to understand and has a simple derivative.

            Comment


            • #7
              Originally posted by Clyde Schechter View Post
              But if this is really just a nuisance variable, that doesn't matter, as there is no research interest in those effects: they are included just to keep them from biasing the effects you really are interested in.
              That's right; it's a control, not one of my main IVs. I might stick with the non-logged/transformed version; but others have been giving me stuff to think about. Thank you all.

              Comment

              Working...
              X