Methods for computing the point-biserial correlation

Bruce Weaver

Join Date: May 2014
Posts: 1139

Methods for computing the point-biserial correlation

18 Nov 2022, 15:33

After reading this question on ResearchGate the other day, I installed -pbis- (net describe sg20, from(http://www.stata.com/stb/stb17)) and tried it. I was surprised to find that it gave a different result than I got using -pwcorr- and -corrci- (net sj 21-3 pr0041_4). I became a bit obsessed with figuring out why that was happening. To make a long story short, I believe that the -pbis- package uses the sample standard deviation (with n-1 in the denominator) in an equation that requires the descriptive (or population) SD (with n in the denominator).

But while I was tinkering around to trace that problem, I discovered something else a bit peculiar: When I compute a point-biserial correlation via -esize-, the sign of the correlation is reversed compared to what I get via -pwcorr- or -corrci-. Also, the 95% CIs from -esize- and -corrci- are not the same. Both of these things puzzle me. Perhaps someone can explain what is going on!

I'll paste below some of my output and the code that generated it.

Cheers,
Bruce

Code:

. sysuse auto, clear
(1978 automobile data)

. * r_pb = Pearson r for one dichotomous and one metric variable.
. * Therefore, it can be obtained via any commmand that computes Pearson r.
. * First, show that -pbis- gives the wrong result.
. pwcorr foreign mpg

             |  foreign      mpg
-------------+------------------
     foreign |   1.0000
         mpg |   0.3934   1.0000

. pbis foreign mpg

(obs= 74)
Np= 22  p= 0.30
Nq= 52  q= 0.70
------------------+------------------+------------------+------------------+
Coef.= 0.3907          t= 3.6018        P>|t| = 0.0006        df=     72

. * I believe that -pbis- uses the sample SD in an equation that
. * requires the descriptive (or population) SD.  If so, I should
. * be able to duplicate that incorrect result.
. * Use stored results from the -ttest- command
. quietly ttest mpg, by(foreign)

. local M1 = r(mu_2)

. local n1 = r(N_2)

. local M0 = r(mu_1)

. local n0 = r(N_1)

. local n = r(N_1) + r(N_2)

. local sigma = r(sd)*sqrt((`n'-1)/`n') // descriptive SD

. local s = r(sd)                       // sample SD

. local rpb1 = ((`M1'-`M0')/`sigma')*sqrt(`n1'*`n0'/`n'^2)

. local rpb2 = ((`M1'-`M0')/`s')*sqrt(`n1'*`n0'/`n'^2)

. display _newline ///
> "Using descriptive SD, r_pb = " `rpb1' _newline ///
> "Using the sample SD,  r_pb = " `rpb2'

Using descriptive SD, r_pb = .39339742
Using the sample SD,  r_pb = .39073028

. * The first result matches what I get with -pwcorr-;
. * the second result matches what I get with -pbis-.
.
. * Now try -esize-
. esize twosample mpg, by(foreign) pbcorr

Effect size based on mean comparison

                               Obs per group:
                                    Domestic =         52
                                     Foreign =         22
---------------------------------------------------------
        Effect size |   Estimate     [95% conf. interval]
--------------------+------------------------------------
   Point-biserial r |  -.3933974     -.555367   -.1821459
---------------------------------------------------------

. * -esize- gives the same magnitude, but with the opposite sign.
. * If I reverse-code, presumably, I'll get the same sign too.
. generate byte domestic = !foreign

. esize twosample mpg, by(domestic) pbcorr

Effect size based on mean comparison

                               Obs per group:
                                 domestic==0 =         22
                                 domestic==1 =         52
---------------------------------------------------------
        Effect size |   Estimate     [95% conf. interval]
--------------------+------------------------------------
   Point-biserial r |   .3933974     .1821459     .555367
---------------------------------------------------------

. * Finally, use -corrci- to get the 95% CI
. corrci mpg foreign  

(obs=74)

                  correlation and 95% limits
mpg     foreign      0.393    0.181    0.571

. matrix list r(corr) // show more decimals

symmetric r(corr)[2,2]
               mpg    foreign
    mpg          1
foreign  .39339742          1

. * The correlation matches what I get from -corrci-, but the CI does not.
. * I do not understand why the CIs are different.

Code:

sysuse auto, clear
* r_pb = Pearson r for one dichotomous and one metric variable.
* Therefore, it can be obtained via any commmand that computes Pearson r.
* First, show that -pbis- gives the wrong result.
pwcorr foreign mpg
pbis foreign mpg
* I believe that -pbis- uses the sample SD in an equation that
* requires the descriptive (or population) SD.  If so, I should
* be able to duplicate that incorrect result.
* Use stored results from the -ttest- command
quietly ttest mpg, by(foreign)
local M1 = r(mu_2)
local n1 = r(N_2)
local M0 = r(mu_1)
local n0 = r(N_1)
local n = r(N_1) + r(N_2)
local sigma = r(sd)*sqrt((`n'-1)/`n') // descriptive SD
local s = r(sd)                       // sample SD
local rpb1 = ((`M1'-`M0')/`sigma')*sqrt(`n1'*`n0'/`n'^2)
local rpb2 = ((`M1'-`M0')/`s')*sqrt(`n1'*`n0'/`n'^2)
display _newline ///
"Using descriptive SD, r_pb = " `rpb1' _newline ///
"Using the sample SD,  r_pb = " `rpb2'
* The first result matches what I get with -pwcorr-;
* the second result matches what I get with -pbis-.

* Now try -esize-
esize twosample mpg, by(foreign) pbcorr
* -esize- gives the same magnitude, but with the opposite sign.
* If I reverse-code, presumably, I'll get the same sign too.
generate byte domestic = !foreign
esize twosample mpg, by(domestic) pbcorr
* Finally, use -corrci- to get the 95% CI
corrci mpg foreign  
matrix list r(corr) // show more decimals
* The correlation matches what I get from -corrci-, but the CI does not.
* I do not understand why the CIs are different.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)

Tags: None

Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#2

18 Nov 2022, 16:09

The point-biserial correlation is just a special case of the product-moment correlation (Pearson's correlation) where one variable is binary.

You're right that there is a difference in using the sample vs population standard deviation estimate, which will cause the point estimate the change. This is inconsequential with large samples. The differences in the CI from -esize- can be explained by the use of the non-central t-distribution (see the Methods and Formulas section of the PDF help for -esize-) versus the other commands using the central t-distribution. The non-central t-distribution is more exact from a coverage perspective. Again, this will be inconsequential in large samples.

-esize- uses -ttest- in the underlying subroutine, because it is viewing the correlation as a difference in means. As such, the default ordering of the group variable from -ttest- is applied in -esize-.

Last edited by Leonardo Guizzetti; 18 Nov 2022, 16:12.
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1139
#3

19 Nov 2022, 07:12

Thanks Leonardo Guizzetti. Your reply helped me clarify in my own mind that the point estimate of r_pb is exactly the same as the point estimate for Pearson r computed on the same data. But that does not imply that the CIs must be the same. Another example of this can be seen when comparing Pearson r to the slope of the least squares regression line after both X and Y have been converted to z-scores. The point estimates are the same, but the 95% CI for the slope in the regression output is not the correct CI for Pearson r (despite the terrible advice given in this YouTube video*). For one thing, it is symmetrical, and can have limits that fall outside the range -1 to 1.

Thanks to your explanation, I now also understand why -esize- is reporting r_pb with the opposite sign to what I get with pwcorr. But surely, r_pb is a correlation. So like any correlation, it ought to be positive when an increase in X is associated with an increase in the fitted value of Y, as it is in my example using X=foreign and Y=mpg. If I generate a scatter-plot for that example...

Code:

scatter mpg foreign || lfit mpg foreign , xlabel(0(1)1)

...the slope is positive. IMO, therefore, -esize- is reporting the wrong sign for r_pb. YMMV.

Cheers,
Bruce

* In case anyone is wondering, since that YouTube video was made, a Python extension command (STATS CORRELATIONS) was added to SPSS to compute CIs for Pearson correlations. It was written by now-retired IBM employee Jon Peck. The help file is pasted below. (Jon was very kind to acknowledge me for some minimal assistance.)

Bivariate Correlations with Confidence Intervals

This procedure computes bivariate correlations and asymptotic (Fisher) confidence intervals. Variables with a nominal measurement level are excluded.

Variables Choose the variables for a square correlation matrix, i.e., the correlation of each variable with every other is produced.

With Variables If variables are put in this list, then only the correlations of the variables in the first list with the variables listed here are produced. For example, if there are five variables in the first list and two in the second, ten correlations are produced.

Confidence Level Enter the desired confidence level for the confidence interval as a percentage.

Confidence Interval Estimation Choose Fisher to use the Fisher transformation based on the approximate normality of the Z-transform of the correlation. Choose Bootstrap to calculate bootstrap confidence intervals.

•Bootstrap requires that the bootstrap option is licensed. Otherwise, an error will occur.

User Missing Values Choose pairwise deletion to maximize the use of the data or listwise deletion for a consistent case base for each correlation. You can alternatively choose to treat missing values as valid. System-missing values (sysmis) are always excluded.

SPLIT FILES is supported.

Thanks to Bruce Weaver for assistance with this procedure.

Additional Features

Run the command
STATS CORRELATIONS /HELP.
or in Statistics version 23 or later press F1 with the cursor in the command to display the syntax help.

TO and ALL are supported in variable lists in Statistics 23 or later.

Requirements

This command requires the Python Programmability Plug-in,

© Copyright IBM Corp. 2016

Last edited by Bruce Weaver; 19 Nov 2022, 07:29. Reason: Fixed some formatting.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#4

19 Nov 2022, 08:19

Originally posted by Bruce Weaver View Post

But that does not imply that the CIs must be the same.

I didn't say that it did. Two examples of this are the Fisher's Z (atanh) transformation and the use of a non-central t-distribution, both will produce asymmetric intervals on the correlation scale, while a "naive" Wald-type interval can also produce intervals that exceed -1 or +1.

CIs will also differ because the point estimate is used in the computation of the intervals, yet one more underlying source of difference.

For -esize-, I understand that it's really just computing Cohen's d effect size (or some transformations thereof) so in that sense, a negative value is not unreasonable. But, if you were expecting a correlation, this can be misleading.
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1139
#5

19 Nov 2022, 09:36

Originally posted by Leonardo Guizzetti View Post

I didn't say that it did.

Leonardo, I think you misunderstood me. I was not claiming that you did say that. My statement was addressed (mainly) to myself as a reminder that just because two different ways of calculating a point estimate yield exactly the same point estimate (e.g., Pearson r and OLS slope after X and Y have been converted to z-scores), that does not imply that the CI computed using one method will be correct for the point estimate of interest. Again, I certainly did not mean to suggest that you said otherwise.

Re the other point, yes, I think it is very misleading and confusing to have a "correlation" for which a positive value means that in increase in X is associated with a decrease in the fitted value of Y.

Cheers,
Bruce

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#6

19 Nov 2022, 10:37

Ah I understand you now.
1 like
Comment

Bruce Weaver

Join Date: May 2014
Posts: 1139

01 May 2023, 13:31

UPDATE: I now have Stata 18, and esize with pbcorr is still showing the wrong sign on the point-biserial correlation. This should be a simple problem to fix, so I hope to see it done soon.

Code:

. clear *

. sysuse auto
(1978 automobile data)

. twoway scatter mpg foreign || lfit mpg foreign, ///
> xlab(0 "Domestic" 1 "Foreign")

Click image for larger version

Name: rpb_plot.png
Views: 1
Size: 36.8 KB
ID: 1711942

Code:

. esize twosample mpg, by(foreign) pbcorr

Effect size based on mean comparison

                               Obs per group:
                                    Domestic =         52
                                     Foreign =         22
---------------------------------------------------------
        Effect size |   Estimate     [95% conf. interval]
--------------------+------------------------------------
   Point-biserial r |  -.3933974     -.555367   -.1821459
---------------------------------------------------------

. * Scatter-plot shows a positive point-biserial correlation,
. * but -esize- shows a negative correlation.
. * I have to reverse-code the dichotomous variable
. * to get the correct sign on the pb correlation!
. generate byte domestic = !foreign

. esize twosample mpg, by(domestic) pbcorr

Effect size based on mean comparison

                               Obs per group:
                                 domestic==0 =         22
                                 domestic==1 =         52
---------------------------------------------------------
        Effect size |   Estimate     [95% conf. interval]
--------------------+------------------------------------
   Point-biserial r |   .3933974     .1821459     .555367
---------------------------------------------------------

Code:

clear *
sysuse auto
twoway scatter mpg foreign || lfit mpg foreign, ///
xlab(0 "Domestic" 1 "Foreign")
esize twosample mpg, by(foreign) pbcorr
* Scatter-plot shows a positive point-biserial correlation,
* but -esize- shows a negative correlation.
* I have to reverse-code the dichotomous variable
* to get the correct sign on the pb correlation!
generate byte domestic = !foreign
esize twosample mpg, by(domestic) pbcorr

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)

Announcement

Methods for computing the point-biserial correlation

Comment

Comment

Comment

Comment

Comment

Comment