Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Interpretation of pairwise correlation between a variable and a logged variable

    Dear Statalists,

    I would like to show a simple pwcorr of two variables (as indicated in the example below), of which one variable is logged. What is the difference I have to explain, when showing the relationship between the variable and the logged variable, as the results differ to when I do not log my variable.


    HTML Code:
     . pwcorr mpg price, sig
    
                 |      mpg    price
    -------------+------------------
             mpg |   1.0000 
                 |
                 |
           price |  -0.4686   1.0000 
                 |   0.0000
                 |
    HTML Code:
    . pwcorr mpg price_log, sig
    
                 |      mpg price_~g
    -------------+------------------
             mpg |   1.0000
                 |
                 |
       price_log |  -0.4910   1.0000
                 |   0.0000
                 |
    Best,
    Alexander

  • #2
    -pwcorr- calculates the Pearson correlation coefficient, which takes into account the actual magnitudes of the values of the variables, not just their rank ordering. When you take the log of a variable whose values are greater than 1, you compress the range of the variable. If you think of the scatterplot, log-transforming one of the variables results in the points on the scatterplot being squeezed closer together in the direction of the logged variable. Consequently the shape of the scatter plot is altered, and so the steepness of the best-fit line will shift as well. With price and mpg you do not see it much, because the mpg variable doesn't have a very wide range to start with, so log doesn't compress it that much. But if we use mpg and displacement, it is a little bit easier to visualize, because displacement has a wider range. Try running this code:
    Code:
    sysuse auto, clear
    gen log_displacement = log(displacement)
    pwcorr mpg displacement log_displacement
    graph twoway scatter mpg displacement, name(untransformed, replace)
    graph twoway scatter mpg displacement, xscale(log) name(logged, replace)
    graph combine untransformed logged
    You can see that in the left panel the points sort of outline a parabolic curve from top left to lower right, but in the right panel, where displacement has been logged, the points look more like they are outlining a straight line--there is less "bowing" in the
    mass of points. Correspondingly, the correlation coefficient has gone from -.71 (unlogged) to -.75 (logged), as a coefficient whose magnitude is closer to 1 corresponds to a more linear relationship.



    Comment


    • #3
      Clyde Schechter thank you a lot for the detailed explanation!

      What does this mean for the interpretation of my analysis? In my specific case with my set of data, the correlation of the two unlogged variables is insignificant, whereas the correlation with the one logged variable (as the data has a very wide range as in the case with displacement) is significant. Is there any case where it is "wrong" to take the log of a variable? If I understood it right, in my case I would argue in the study that the data had a very wide range, thats why I took the log of this variable.

      Comment


      • #4
        The original example isn't very good because both the logged and unlogged correlations are nearly identical. But, it sounds like in the actual data the difference is more dramatic.

        There are lots of discussions on the web about logging variables, e.g.

        https://www.asanet.org/asa-communiti...ard-recipients

        Some other thoughts: you wouldn't log variables that have negative values. It should make sense to think of the variable in terms of percentage increases.
        -------------------------------------------
        Richard Williams, Notre Dame Dept of Sociology
        Stata Version: 17.0 MP (2 processor)

        EMAIL: [email protected]
        WWW: https://www3.nd.edu/~rwilliam

        Comment


        • #5
          The difference between statistically significant and not statistically significant is, itself, not statistically significant. The 0.05 significance cutoff (or whatever other cutoff you might be using) is just an arbitrary number and you shouldn't attach any importance to different analyses giving p-values on opposite sides of the cutoff, particularly if the correlation coefficient itself has only changed by a very small amount, as in the example you show.

          The real issue is which model is a better model of the actual relationships between the variables. If the range of variation of the variable that you logged is small, it probably will be difficult or impossible to tell which model is better just by looking at a scatterplot, but if the range is large, it may become obvious that one of the graphs is more like a straight line than the other. That would be your better model, if the parameter you are interested in is a correlation coefficient.

          Finally, let me note that the American Statistical Association has recently recommended that the concept of statistical significance be abandoned--a stronger position than what I described in my first paragraph here, and one that I agree with. See https://www.tandfonline.com/doi/full...5.2019.1583913 for their position paper.

          Comment

          Working...
          X