Interpretation of pairwise correlation between a variable and a logged variable

Alexander Cortolezis

Join Date: Feb 2019

Posts: 6
#1

Interpretation of pairwise correlation between a variable and a logged variable

17 Jun 2019, 15:34

Dear Statalists,

I would like to show a simple pwcorr of two variables (as indicated in the example below), of which one variable is logged. What is the difference I have to explain, when showing the relationship between the variable and the logged variable, as the results differ to when I do not log my variable.

HTML Code:

. pwcorr mpg price, sig | mpg price -------------+------------------ mpg | 1.0000 | | price | -0.4686 1.0000 | 0.0000 |

HTML Code:

. pwcorr mpg price_log, sig | mpg price_~g -------------+------------------ mpg | 1.0000 | | price_log | -0.4910 1.0000 | 0.0000 |

Best,
Alexander
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

17 Jun 2019, 17:49

-pwcorr- calculates the Pearson correlation coefficient, which takes into account the actual magnitudes of the values of the variables, not just their rank ordering. When you take the log of a variable whose values are greater than 1, you compress the range of the variable. If you think of the scatterplot, log-transforming one of the variables results in the points on the scatterplot being squeezed closer together in the direction of the logged variable. Consequently the shape of the scatter plot is altered, and so the steepness of the best-fit line will shift as well. With price and mpg you do not see it much, because the mpg variable doesn't have a very wide range to start with, so log doesn't compress it that much. But if we use mpg and displacement, it is a little bit easier to visualize, because displacement has a wider range. Try running this code:

Code:

sysuse auto, clear gen log_displacement = log(displacement) pwcorr mpg displacement log_displacement graph twoway scatter mpg displacement, name(untransformed, replace) graph twoway scatter mpg displacement, xscale(log) name(logged, replace) graph combine untransformed logged

You can see that in the left panel the points sort of outline a parabolic curve from top left to lower right, but in the right panel, where displacement has been logged, the points look more like they are outlining a straight line--there is less "bowing" in the
mass of points. Correspondingly, the correlation coefficient has gone from -.71 (unlogged) to -.75 (logged), as a coefficient whose magnitude is closer to 1 corresponds to a more linear relationship.
Comment
Alexander Cortolezis

Join Date: Feb 2019

Posts: 6
#3

18 Jun 2019, 03:45

Clyde Schechter thank you a lot for the detailed explanation!

What does this mean for the interpretation of my analysis? In my specific case with my set of data, the correlation of the two unlogged variables is insignificant, whereas the correlation with the one logged variable (as the data has a very wide range as in the case with displacement) is significant. Is there any case where it is "wrong" to take the log of a variable? If I understood it right, in my case I would argue in the study that the data had a very wide range, thats why I took the log of this variable.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#4

18 Jun 2019, 07:52

The original example isn't very good because both the logged and unlogged correlations are nearly identical. But, it sounds like in the actual data the difference is more dramatic.

There are lots of discussions on the web about logging variables, e.g.

https://www.asanet.org/asa-communiti...ard-recipients

Some other thoughts: you wouldn't log variables that have negative values. It should make sense to think of the variable in terms of percentage increases.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#5

18 Jun 2019, 09:50

The difference between statistically significant and not statistically significant is, itself, not statistically significant. The 0.05 significance cutoff (or whatever other cutoff you might be using) is just an arbitrary number and you shouldn't attach any importance to different analyses giving p-values on opposite sides of the cutoff, particularly if the correlation coefficient itself has only changed by a very small amount, as in the example you show.

The real issue is which model is a better model of the actual relationships between the variables. If the range of variation of the variable that you logged is small, it probably will be difficult or impossible to tell which model is better just by looking at a scatterplot, but if the range is large, it may become obvious that one of the graphs is more like a straight line than the other. That would be your better model, if the parameter you are interested in is a correlation coefficient.

Finally, let me note that the American Statistical Association has recently recommended that the concept of statistical significance be abandoned--a stronger position than what I described in my first paragraph here, and one that I agree with. See https://www.tandfonline.com/doi/full...5.2019.1583913 for their position paper.
Comment

Announcement

Interpretation of pairwise correlation between a variable and a logged variable

Comment

Comment

Comment

Comment