xicor - A new coefficient of correlation

ericmelse

Join Date: May 2014

Posts: 436
#1

xicor - A new coefficient of correlation

26 Apr 2024, 08:15

Is anyone aware of work being done in the Stata community to implement xicor - a measure of association through cross rank increments, supposedly A new coefficient of correlation, published by Sourav Chatterjee in the Journal of the American Statistical Association - in a Stata ado package?
I am interested should it be available.

It is available for Python and for R.

Also note the post about it by Tim Sumner in Towards Data Science.

http://publicationslist.org/eric.melse
Tags: None
Dirk Enzmann

Join Date: Apr 2014

Posts: 574
#2

26 Apr 2024, 16:23

This should definitely be something for the Wishlist (i.e. for Stata Corp.)!
Comment

Scott Merryman

Join Date: Mar 2014
Posts: 896

26 Apr 2024, 20:12

In the Data Science article it lists the R code as:

Code:

xicor <- function(X, Y, ties = TRUE){
  n <- length(X)
  r <- rank(Y[order(X)], ties.method = "random")
  set.seed(42)
  if(ties){
    l <- rank(Y[order(X)], ties.method = "max")
    return( 1 - n*sum( abs(r[-1] - r[-n]) ) / (2*sum(l*(n - l))) )
  } else {
    return( 1 - 3 * sum( abs(r[-1] - r[-n]) ) / (n^2 - 1) )    
  }
}

In Mata I think it would be (using Ben Jann's moremata functions and excluding if there are ties):

Code:

mata: 
rseed(1234)
    n = length(x)
    r = mm_ranks(y[mm_order(x,1,1)],1,0)
    xicor = (1 - (3 * colsum(abs(r[2..n] - r[1..n-1])))/ (n^2 -1))
    xicor
end

Testing, using figure 2 (g) from the article

Code:

. range x -1 1 100
Number of observations (_N) was 0, now 100.

. // gen y = x  /* xicor = 0.97 */
. //gen y = x^2 /* xicor = 0.941 */
. gen y = sin(6.1*x)  /* xicor = 0.885 */
. putmata x = x y = y
(2 vectors posted)

. mata: 
------------------------------------------------- mata (type end to exit) ----------------------
: rseed(1234)

:         n = length(x)

:         r = mm_ranks(y[mm_order(x,1,1)],1,0)

:         xicor = (1 - (3 * colsum(abs(r[2..n] - r[1..n-1])))/ (n^2 -1))

:         xicor
  .8850885089

: end
------------------------------------------------------------------------------------------------

Testing the same data in R:

Code:

. rcall: (y = st.var(x))
. rcall: (x = st.var(y))
. rcall: library(XICOR)

. rcall: xicor(y, x)
[1] 0.8850885

Comment

daniel klein

Join Date: Mar 2014

Posts: 3886
#4

29 Apr 2024, 06:11

I thought this might be a fun exercise for the weekend. Turns out, it isn't so much fun.

I believe the R code in the blog that Scott Merryman cites got it wrong. As I understand the original authors as well as their R code, ties in the ranks of \(Y\), (\(r\)), should not be broken randomly. Instead, they are assigned the highest rank in the group, just like the \(l\). The difference is that the \(r\) are sorted in ascending order while the \(l\) are sorted in descending order. The only ties broken randomly are the ties in \(X\). Consequentially, the cited R code yields random results even when variables are correlated with themselves.

By the way, I can't wrap my head around a (supposedly standardized) measure of association that does not equal 1 when variables are compared to themselves, but that might be just me.

Another thing that worries me is the coefficient's dependence on sort order. Note that in Stata, setting the random seed would not suffice for reproducible results. You will have to set sortseed; something you really should not do. Even then, the results when computing more than one coefficient will depend on the implementation details. I have written a version that produces a matrix of results. Thus results depend on the order in which I fill the matrix.

Here are the results for the examples in #3 (lower triangle):

Code:

. clear . range x -1 1 100 number of observations (_N) was 0, now 100 . . gen y1 = x /* xicor = 0.97 */ . gen y2 = x^2 /* xicor = 0.941 */ . gen y3 = sin(6.1*x) /* xicor = 0.885 */ . . . xicor x y1 (Obs=100) | x y1 -------------+---------------------- x | .970297 y1 | .970297 .970297 . xicor x y2 (Obs=100) | x y2 -------------+---------------------- x | .970297 -.2586259 y2 | .9411765 .9705882 . xicor x y3 (Obs=100) | x y3 -------------+---------------------- x | .970297 .0546055 y3 | .8850885 .970297 . . xicor x y1 y2 y3 (Obs=100) | x y1 y2 y3 -------------+-------------------------------------------- x | .970297 .970297 -.1059106 .0546055 y1 | .970297 .970297 -.1059106 .0546055 y2 | .9411765 .9411765 .9705882 -.1104442 y3 | .8850885 .8850885 -.1248125 .970297 . end of do-file

Notice that when ties are present, as in y2, not only will results differ from run to run -- which you could "fix" with set sortseed --; results also differ between obtaining one (or two) coefficients at a time, and the complete matrix of coefficients.

I have no ide against which results I should certify the thing but here is the code, anyway.
Attached Files

xicor.ado (2.3 KB, 1 view)
1 like
Comment

Dirk Enzmann

Join Date: Apr 2014
Posts: 574

29 Apr 2024, 10:43

I tried to replicate the example from the Chatterjee (2021) article:

Code:

// Example: Galton's peas data (R package psychTools 2.4.3)
//          (see Chatterjee, 2021, p. 2011-2012)

clear
input parent child freq
  15  13.77  46  
  15  14.77  14  
  15  15.77   9  
  15  16.77  11  
  15  17.77  14  
  15  18.77   4  
  15  19.77   2  
  16  14.28  34  
  16  15.28  15  
  16  16.28  18  
  16  17.28  16  
  16  18.28  13  
  16  19.28   3  
  16  20.28   1  
  17  13.92  37  
  17  14.92  16  
  17  15.92  13  
  17  16.92  16  
  17  17.92  13  
  17  18.92   4  
  17  19.92   1  
  18  14.35  34  
  18  15.35  12  
  18  16.35  13  
  18  17.35  17  
  18  18.35  16  
  18  19.35   6  
  18  20.35   2  
  19  14.07  35  
  19  15.07  16  
  19  16.07  12  
  19  17.07  13  
  19  18.07  11  
  19  19.07  10  
  19  20.07   2  
  19  22.07   1  
  20  14.66  23  
  20  15.66  10  
  20  16.66  12  
  20  17.66  17  
  20  18.66  20  
  20  19.66  13  
  20  20.66   3  
  20  22.66   2  
  21  14.67  22  
  21  15.67   8  
  21  16.67  10  
  21  17.67  18  
  21  18.67  21  
  21  19.67  13  
  21  20.67   6  
  21  22.67   2  
end

expand freq

corr parent child
xicor parent child

The result is:

Code:

. corr parent child
(obs=700)

             |   parent    child
-------------+------------------
      parent |   1.0000
       child |   0.3463   1.0000


. xicor parent child
(Obs=700)

             |    parent      child
-------------+---------------------
      parent |    .99625      .9225
       child |  .0944753   .9958986

Note that the size of the coefficient xi depends on the sample size: Increasing n increases xi. Hence, with "small" n the correlation of (X, X) is less than 1. Is this useful?

Reference: Chatterjee, S. (2021). A new coefficient of correlation. Journal of the American Statistical Association, 116(536), 2009–2022. https://doi.org/10.1080/01621459.2020.1758115

Comment

daniel klein

Join Date: Mar 2014

Posts: 3886
#6

29 Apr 2024, 12:59

Running the (expanded) example in #5 10,000 times, I obtain \(\xi(X,Y)\) between 0.04 and 0.22 as a result of randomly breaking ties (assuming my code is correct). I wonder how useful that is.
Comment
Felix Bittmann

Join Date: Aug 2018

Posts: 737
#7

30 Apr 2024, 02:14

Very interesting discussion but the results so far are highly suspicious IMHO. Personally, I think a correlation coefficient should not need any randomness or seed setting... If this is an inherent feature of xi, I would be reluctant to use it in substantive research.

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10282
#8

30 Apr 2024, 02:34

Originally posted by daniel klein View Post

By the way, I can't wrap my head around a (supposedly standardized) measure of association that does not equal 1 when variables are compared to themselves, but that might be just me.

Another thing that worries me is the coefficient's dependence on sort order. Note that in Stata, setting the random seed would not suffice for reproducible results. You will have to set sortseed; something you really should not do. Even then, the results when computing more than one coefficient will depend on the implementation details. I have written a version that produces a matrix of results. Thus results depend on the order in which I fill the matrix.

I haven't followed the details closely, but these are good points. Maybe worth sending a comment to JASA with these illustrations.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3886
#9

30 Apr 2024, 03:03

I have skimmed through related literature, and from what I understand (which, granted, is not all that much), the mathematical foundations are solid. The author also discusses some of my points in their paper. I guess JASA is fine with this. Still, I fail to see how and when to apply \(\xi\), given its fluctuation even in moderate sample sizes and the difficulties of reproducing results without going through the software implementation details.
2 likes
Comment
daniel klein

Join Date: Mar 2014

Posts: 3886
#10

30 Apr 2024, 06:32

Anyone interested in playing around with this can now download an updated version of the command in #4 from GitHub. In Stata type

Code:

net install xicor , from(https://raw.githubusercontent.com/kleindaniel81/xicor/master)

to do this.

The update fixes a bug with constants and changes the orientation of the results matrix; rows are now Xs and columns are Ys. Thus, you will find \(\xi(X_i,Y_j)\) in the ith row and jth column of r(xi). Yes, the returned matrix has a different name now (all lowercase).

Last edited by daniel klein; 30 Apr 2024, 06:35.
1 like
Comment

Announcement