Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Logged variables w/value of 0 : drop observations or weaken the model?

    [Note: PhD student new to Stata and still somewhat of a beginner with stats analysis]

    In my dataset, the variable 'Inputs' reflects monetary values for which some observations are 0. I have logged all values of 'Inputs' for running regressions, but of course Stata drops +/- 25 observations for which 'Inputs' =0. I would prefer not to lose those observations because my sample is only n=147.

    On the advice of my supervisor, I have replaced 'Inputs'=0 with 'Inputs'=1 for the latter observations so as not to drop them from the sample, then I logged the values again. Now instead of dropping those observations, they remain in the sample with 'Log_Inputs'=0. However, this weakens the R-squared value and therefore the model.

    Which is the better choice: Drop the observations that cannot be logged, or weaken the model but maintain the sample size?

  • #2
    Amy:
    welcome to this forum.
    I would avoid adding an (arbitray) additive constant and maintain the sample as it originally was (n=147).
    The issue is that you're forced to go log-linear for tribal traditions or for some other reason.
    That said, in your future post please share what you typed and your Stata gave you back (as per FAQ). Thanks.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Taking logarithms of (variable + constant) is one work-around.

      Let's assume the minimum value other than zeros is 1 unit. Then use as predictors

      cond(x == 0, log(x + 1), log(x))

      the indicator 1 if x == 0 and 0 otherwise.

      The first is a fudge, but the second allows some quantification of the effect of the predictor being 0 not 1.

      Stata drops nothing here; better to say that it omits missing values from model fits.

      Comment


      • #4
        The choice of what constant to add can have substantial effects on your parameter estimates. I'd try Nick's suggestion, but see what happens when you used some different values of the constant (0.01, 0.1, 1.0, etc as relevant in your situation). Perhaps you will find that that the choice of constant doesn't matter much, and that the results are similar to when those observations are omitted due to missing values, which would support that approach. However, in confronting a similar situation myself, I once found that the value of the constant *did* matter. I proceeded to try something in the direction Carlo implied, i.e., trying another transformation instead of log(). I used sqrt(), and obtained a better fit than I did with log(), so I think that's also worth a try in your situation.

        Comment


        • #5
          Note that log(x + smidgen) for 0 < smidgen << 1 is likely to create massive outliers. My suggestion in #3 pivots on 1 being the smallest feasible positive value but may be generalised accordingly.

          Comment


          • #6
            There’s a recent paper on this topic (dealing with logs and zeros in regression models), which you might find useful. The authors discuss some of the common practices and have also made the Stata implementation of their approach to address the issue publicly available on GitHub.


            https://github.com/ldpape/iOLS


            Bellego, Christophe, and Louis-Daniel Pape. "Dealing with logs and zeros in regression models." Série des Documents de Travail 2019-13 (2019).

            Comment


            • #7
              Thanks so much for all of your input. I will give it a go with Nick Cox's suggestion and play around with Mike Lacy's advice. And I will absolutely review that paper Justin Niakamal!

              Comment

              Working...
              X