Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with negative values in log-transformation

    Negative values and zeros are deleted in log-transforming. How can negative values and zeros be log-transformed without losing? Is it wise to make them all positive by adding equal positive numbers to the entire observations before log-transformation? I have learned from answers on my last question about log transformation of ratio variable that it is not a good idea to add value into original values. However, a log transformation of negative values has a different issue (missing).

    Specifically, I want to log-transform x in the below in order to address the potential problem of outliers. In this case, in my field, log(x+6 [the smallest negative number]) is a typical choice. Do you agree with this? Or, do you have any other suggestion? I provide detailed information on variable x as follows.

    Code:
    sum x, det
    x
    -------------------------------------------------------------
    Percentiles Smallest
    1% -5 -6
    5% -1 -6
    10% -1 -6 Obs 712
    25% 0 -6 Sum of Wgt. 712

    50% 0 Mean .2373596
    Largest Std. Dev. 1.21111
    75% 1 4
    90% 1 4 Variance 1.466788
    95% 2 5 Skewness -.8423147
    99% 4 5 Kurtosis 10.86292

    Code:
    graph box x
    Click image for larger version

Name:	Graph.png
Views:	1
Size:	9.3 KB
ID:	1485129



    Code:
    dataex x
    ----------------------- copy starting from the next line -----------------------
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float x
     1
     0
     1
     0
     0
     0
     0
    -1
    -1
     0
     0
     0
    -1
    -1
    -3
     0
     0
     0
     0
     1
     1
     0
     1
     1
     0
    -2
    -2
    -2
     0
     0
     0
     0
     0
     1
     1
     1
     1
     0
     0
    -1
     0
     0
     0
     0
     0
     0
     0
     1
     0
     1
     0
     0
     0
     0
     0
     0
     0
     0
     0
     0
     1
     0
     0
     0
     0
     0
     0
     0
     0
     1
     1
     0
     0
     0
     0
     0
     0
     1
     1
     2
     2
     0
     1
     2
     0
     0
     0
     0
     0
     0
     0
     1
     0
     0
     0
     0
     0
    -3
    -3
    -1
    end
    ------------------ copy up to and including the previous line ------------------

    Listed 100 out of 712 observations
    Last edited by Sang-Bum Park; 22 Feb 2019, 23:00.

  • #2
    I think log(y + constant) with constant chosen ad hoc so that all y + constant > 0 in your data is just about the worst possible solution you could choose. For one, it makes your results really hard to compare with anybody else's, because typically they would choose a different constant. For another, if zero and negative values have meaning because your scale isn't arbitrary, then that meaning deserves respect.

    I have to say that your data look fine as they are. They are approximately symmetric and I wouldn't call any of the extreme data points outliers. Indeed, log(y + constant) is going to create stronger outliers!

    However, much depends on the substance, which you don't explain at all. Indeed why use an anonymous variable name like x? Is it a response or predictor? What are you going to do with these data? Do any of the data appear outliers in bivariate or multivariate analyses too?

    Further, the indications are that your data are all integers. If so, that may affect good advice.

    Note: Your graph shows integers between -6 and 5. But log(y + 6) is indeterminate for y = -6. so your transformation fails.

    I would be interested perversely in literature references to this practice in your field, as on the face of it people need to be told how bad an idea this is. (Sorry if that seems arrogant, but I do give the arguments here.)
    Last edited by Nick Cox; 23 Feb 2019, 01:20.

    Comment


    • #3
      Thank you, Nick Cox. Using original values seems best. This variable is just one of the control variables. Log transformation seems only necessary for the dependent variable in a highly skewed or having outliers.

      Comment

      Working...
      X