point biserial correlation vs pearson's r

natalia malancu

Join Date: Apr 2014

Posts: 110
#1

point biserial correlation vs pearson's r

18 Dec 2014, 08:18

hi guys!

i have one continuous variable and one dichotomous one. predictably, the point biserial correlation (esize twosample) and pearson's r (correlate) give me a similar value, yet of different sign. my intuition is that when computing the point biserial what is compared is group 2 (corresponding to value 1 of the /dichotomous) with group 1 (corresponding to value 0 if the dichotomous). is that correct? which sign is then correctly indicating the association? my money is on pearson's r.

cheers,
natalia

Last edited by natalia malancu; 18 Dec 2014, 08:21.
Tags: None
natalia malancu

Join Date: Apr 2014

Posts: 110
#2

18 Dec 2014, 08:53

Update: flipped the dummy, did the pearson r and came to the same conclusion as for the point biserial. My intuition has no substance, but it also means the point biserial estimate is misleading
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#3

18 Dec 2014, 10:19

All I can say is "interesting"! and hope somebody can chime in why it would reverse the sign. Have you considered the package -polychoric- which gives a the biserial, with a higher value (as one would expect) and in the expected direction. -findit polychoric-

Code:

clear set seed 1971 set obs 1000 gen x1=rnormal() gen x2=rnormal()+x1 replace x2=0 if x2<0 replace x2=1 if x2>0 esize twosample x1, by(x2) pbc corr polychoric x1 x2

Last edited by ben earnhart; 18 Dec 2014, 10:24.
Comment
natalia malancu

Join Date: Apr 2014

Posts: 110
#4

18 Dec 2014, 14:55

hi ben!

tried polychoric as you suggested and the direction is aligned with correlate x1 x2
also tried a different combination continous- dichotomous variable and noticed the same thing - correlate and polichoric give the same direction and more or less same magnitude (as expected), while esize twosample, x1, by (x2) either with the pbc or all option gives the same magnitude with the opposite sign - so yeah...interesting is the right word.
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#5

18 Dec 2014, 20:15

I found the package -pbis- which is dedicated to point-biserial. It agrees with -corr- and -polychoric-.. So it seems like esize is broken? Hard to believe for a default Stata package, but I tied it with several different real and generated datasets, and it (esize) got the direction wrong. Big triple-hmm.
Comment
natalia malancu

Join Date: Apr 2014

Posts: 110
#6

20 Dec 2014, 03:10

Hopefully stata.corp will provide us with an answer
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4420
#7

22 Dec 2014, 00:14

Originally posted by natalia malancu View Post

Hopefully stata.corp will provide us with an answer

StataCorp already did, more or less.

Take a look at Example 1 in the user's manual entry for esize. You can see that the command is set up like ttest in that it looks at the difference in the values of the outcome variable for the second group relative to those for the first, just as ttest does.

The parallel will be evident after running the following.

Code:

sysuse auto, clear table foreign, contents(mean mpg) ttest mpg, by(foreign) // From the Methods and Formulas section of the manual's entry for esize: display in smcl as text "rPB = " as result r(t) / sqrt(r(t)^2 + r(df_t)) esize twosample mpg, by(foreign) pbcorr correlate mpg foreign exit
Comment
ben earnhart

Join Date: May 2014

Posts: 1027
#8

22 Dec 2014, 01:45

Code:

clear set obs 1000 set seed 1971 gen x1=rnormal() gen x2=rnormal()+x1 replace x2=0 if x2<=0 replace x2=1 if x2>0 ttest x1, by(x2)

So esize and ttest "test" in what seems to me to be a counter-intuitive direction. When using a t-test or effect size, one generally looks at the substance, then the size, then the significance. So it might take a second glance, but the direction is arbitrary. But for a "correlation" it's pretty darn counter-intuitive. x1 higher generally is interpreted as x2 higher. Here I stand, I can do no other. For a correlation, + goes with +, - goes with -.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4420
#9

22 Dec 2014, 02:27

I'm not disagreeing with you, just pointing out that the direction / sign for estat twosample , by() pbcorr appears to be intended (per documentation, namely, Example 1 in the user's manual entry for the command, where the sign & direction of the difference is mentioned) and where I'm guessing the origin of the command's convention for direction / sign arises.
Comment
natalia malancu

Join Date: Apr 2014

Posts: 110
#10

09 Sep 2016, 12:22

hi guys!

sorry to get back on this after so long, but I have one other related question.

i'm having a hard time understanding a difference in magnitude:
correlate x1 x2 gives me a value of -0.5494
esize twosample x1, by(x2) pbc a value of .5494018
pbis x2 x1 a value of -0.5493
polychoric x1 x2 a Rho of -.04736496 (S.e. .00312193)

What am I missing here, with regards to the magnitude returned by polychoric.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#11

09 Sep 2016, 17:17

Correct me if I am wrong, but these two effect sizes should be exactly equal, as in the Pearson product-moment correlation should yield an equivalent result with one continuous and one dichotomous variable. I wonder why these two are different?
Comment
natalia malancu

Join Date: Apr 2014

Posts: 110
#12

10 Sep 2016, 01:58

hence my question. polychoric should return more or less a similar value
Comment
natalia malancu

Join Date: Apr 2014

Posts: 110
#13

14 Sep 2016, 14:59

changed machine, changes stata version to no avail. any suggestions whatsoever? what am I missing out?
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4420
#14

14 Sep 2016, 21:27

Well, here's something to consider:

First, the two commands compute fundamentally different things—one is a point-biserial correlation coefficient and the other a biserial (polyserial) correlation coefficient.

Second, while the latter is typically larger than the former, they have different assumptions regarding properties of the distribution of the data.

So, maybe your particular dataset violates one or both of the coeffiicients' distributional assumptions and the unexpected values that you've got reflect that.
Comment
natalia malancu

Join Date: Apr 2014

Posts: 110
#15

10 Oct 2016, 05:55

hi joseph!

thanks for this.
re assumption: definitely a problem. i'm dealing with 2 time periods, and in one of them the continuous variable doesnt take the top 3 values it does in the other one ( say in period 0: values 1-20, in period 1: values 1-17). given the nature of the variables, i thought the biserial correlation made most sense, but given the distribution, i dont know which to use anymore.
Comment

Announcement

point biserial correlation vs pearson's r

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment