Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • xicor - A new coefficient of correlation

    Is anyone aware of work being done in the Stata community to implement xicor - a measure of association through cross rank increments, supposedly A new coefficient of correlation, published by Sourav Chatterjee in the Journal of the American Statistical Association - in a Stata ado package?
    I am interested should it be available.

    It is available for Python and for R.

    Also note the post about it by Tim Sumner in Towards Data Science.
    http://publicationslist.org/eric.melse

  • #2
    This should definitely be something for the Wishlist (i.e. for Stata Corp.)!

    Comment


    • #3
      In the Data Science article it lists the R code as:
      Code:
      xicor <- function(X, Y, ties = TRUE){
        n <- length(X)
        r <- rank(Y[order(X)], ties.method = "random")
        set.seed(42)
        if(ties){
          l <- rank(Y[order(X)], ties.method = "max")
          return( 1 - n*sum( abs(r[-1] - r[-n]) ) / (2*sum(l*(n - l))) )
        } else {
          return( 1 - 3 * sum( abs(r[-1] - r[-n]) ) / (n^2 - 1) )    
        }
      }
      In Mata I think it would be (using Ben Jann's moremata functions and excluding if there are ties):
      Code:
      mata: 
      rseed(1234)
          n = length(x)
          r = mm_ranks(y[mm_order(x,1,1)],1,0)
          xicor = (1 - (3 * colsum(abs(r[2..n] - r[1..n-1])))/ (n^2 -1))
          xicor
      end
      Testing, using figure 2 (g) from the article
      Code:
      . range x -1 1 100
      Number of observations (_N) was 0, now 100.
      
      . // gen y = x  /* xicor = 0.97 */
      . //gen y = x^2 /* xicor = 0.941 */
      . gen y = sin(6.1*x)  /* xicor = 0.885 */
      . putmata x = x y = y
      (2 vectors posted)
      
      . mata: 
      ------------------------------------------------- mata (type end to exit) ----------------------
      : rseed(1234)
      
      :         n = length(x)
      
      :         r = mm_ranks(y[mm_order(x,1,1)],1,0)
      
      :         xicor = (1 - (3 * colsum(abs(r[2..n] - r[1..n-1])))/ (n^2 -1))
      
      :         xicor
        .8850885089
      
      : end
      ------------------------------------------------------------------------------------------------
      Testing the same data in R:
      Code:
      . rcall: (y = st.var(x))
      . rcall: (x = st.var(y))
      . rcall: library(XICOR)
      
      . rcall: xicor(y, x)
      [1] 0.8850885

      Comment


      • #4
        I thought this might be a fun exercise for the weekend. Turns out, it isn't so much fun.

        I believe the R code in the blog that Scott Merryman cites got it wrong. As I understand the original authors as well as their R code, ties in the ranks of \(Y\), (\(r\)), should not be broken randomly. Instead, they are assigned the highest rank in the group, just like the \(l\). The difference is that the \(r\) are sorted in ascending order while the \(l\) are sorted in descending order. The only ties broken randomly are the ties in \(X\). Consequentially, the cited R code yields random results even when variables are correlated with themselves.

        By the way, I can't wrap my head around a (supposedly standardized) measure of association that does not equal 1 when variables are compared to themselves, but that might be just me.

        Another thing that worries me is the coefficient's dependence on sort order. Note that in Stata, setting the random seed would not suffice for reproducible results. You will have to set sortseed; something you really should not do. Even then, the results when computing more than one coefficient will depend on the implementation details. I have written a version that produces a matrix of results. Thus results depend on the order in which I fill the matrix.

        Here are the results for the examples in #3 (lower triangle):

        Code:
        . clear
        
        . range x -1 1 100
        number of observations (_N) was 0, now 100
        
        .
        . gen y1 = x          /* xicor = 0.97 */
        
        . gen y2 = x^2        /* xicor = 0.941 */
        
        . gen y3 = sin(6.1*x) /* xicor = 0.885 */
        
        .
        .
        . xicor x y1
        (Obs=100)
        
                     |         x         y1
        -------------+----------------------
                   x |   .970297            
                  y1 |   .970297    .970297
        
        . xicor x y2
        (Obs=100)
        
                     |         x         y2
        -------------+----------------------
                   x |   .970297  -.2586259
                  y2 |  .9411765   .9705882
        
        . xicor x y3
        (Obs=100)
        
                     |         x         y3
        -------------+----------------------
                   x |   .970297   .0546055
                  y3 |  .8850885    .970297
        
        .
        . xicor x y1 y2 y3
        (Obs=100)
        
                     |         x         y1         y2         y3
        -------------+--------------------------------------------
                   x |   .970297    .970297  -.1059106   .0546055
                  y1 |   .970297    .970297  -.1059106   .0546055
                  y2 |  .9411765   .9411765   .9705882  -.1104442
                  y3 |  .8850885   .8850885  -.1248125    .970297
        
        .
        end of do-file
        Notice that when ties are present, as in y2, not only will results differ from run to run -- which you could "fix" with set sortseed --; results also differ between obtaining one (or two) coefficients at a time, and the complete matrix of coefficients.

        I have no ide against which results I should certify the thing but here is the code, anyway.
        Attached Files

        Comment


        • #5
          I tried to replicate the example from the Chatterjee (2021) article:
          Code:
          // Example: Galton's peas data (R package psychTools 2.4.3)
          //          (see Chatterjee, 2021, p. 2011-2012)
          
          clear
          input parent child freq
            15  13.77  46  
            15  14.77  14  
            15  15.77   9  
            15  16.77  11  
            15  17.77  14  
            15  18.77   4  
            15  19.77   2  
            16  14.28  34  
            16  15.28  15  
            16  16.28  18  
            16  17.28  16  
            16  18.28  13  
            16  19.28   3  
            16  20.28   1  
            17  13.92  37  
            17  14.92  16  
            17  15.92  13  
            17  16.92  16  
            17  17.92  13  
            17  18.92   4  
            17  19.92   1  
            18  14.35  34  
            18  15.35  12  
            18  16.35  13  
            18  17.35  17  
            18  18.35  16  
            18  19.35   6  
            18  20.35   2  
            19  14.07  35  
            19  15.07  16  
            19  16.07  12  
            19  17.07  13  
            19  18.07  11  
            19  19.07  10  
            19  20.07   2  
            19  22.07   1  
            20  14.66  23  
            20  15.66  10  
            20  16.66  12  
            20  17.66  17  
            20  18.66  20  
            20  19.66  13  
            20  20.66   3  
            20  22.66   2  
            21  14.67  22  
            21  15.67   8  
            21  16.67  10  
            21  17.67  18  
            21  18.67  21  
            21  19.67  13  
            21  20.67   6  
            21  22.67   2  
          end
          
          expand freq
          
          corr parent child
          xicor parent child
          The result is:
          Code:
          . corr parent child
          (obs=700)
          
                       |   parent    child
          -------------+------------------
                parent |   1.0000
                 child |   0.3463   1.0000
          
          
          . xicor parent child
          (Obs=700)
          
                       |    parent      child
          -------------+---------------------
                parent |    .99625      .9225
                 child |  .0944753   .9958986
          Note that the size of the coefficient xi depends on the sample size: Increasing n increases xi. Hence, with "small" n the correlation of (X, X) is less than 1. Is this useful?

          Reference: Chatterjee, S. (2021). A new coefficient of correlation. Journal of the American Statistical Association, 116(536), 2009–2022. https://doi.org/10.1080/01621459.2020.1758115

          Comment


          • #6
            Running the (expanded) example in #5 10,000 times, I obtain \(\xi(X,Y)\) between 0.04 and 0.22 as a result of randomly breaking ties (assuming my code is correct). I wonder how useful that is.

            Comment


            • #7
              Very interesting discussion but the results so far are highly suspicious IMHO. Personally, I think a correlation coefficient should not need any randomness or seed setting... If this is an inherent feature of xi, I would be reluctant to use it in substantive research.
              Best wishes

              (Stata 16.1 MP)

              Comment


              • #8
                Originally posted by daniel klein View Post
                By the way, I can't wrap my head around a (supposedly standardized) measure of association that does not equal 1 when variables are compared to themselves, but that might be just me.

                Another thing that worries me is the coefficient's dependence on sort order. Note that in Stata, setting the random seed would not suffice for reproducible results. You will have to set sortseed; something you really should not do. Even then, the results when computing more than one coefficient will depend on the implementation details. I have written a version that produces a matrix of results. Thus results depend on the order in which I fill the matrix.
                I haven't followed the details closely, but these are good points. Maybe worth sending a comment to JASA with these illustrations.

                Comment


                • #9
                  I have skimmed through related literature, and from what I understand (which, granted, is not all that much), the mathematical foundations are solid. The author also discusses some of my points in their paper. I guess JASA is fine with this. Still, I fail to see how and when to apply \(\xi\), given its fluctuation even in moderate sample sizes and the difficulties of reproducing results without going through the software implementation details.

                  Comment


                  • #10
                    Anyone interested in playing around with this can now download an updated version of the command in #4 from GitHub. In Stata type

                    Code:
                    net install xicor , from(https://raw.githubusercontent.com/kleindaniel81/xicor/master)
                    to do this.

                    The update fixes a bug with constants and changes the orientation of the results matrix; rows are now Xs and columns are Ys. Thus, you will find \(\xi(X_i,Y_j)\) in the ith row and jth column of r(xi). Yes, the returned matrix has a different name now (all lowercase).
                    Last edited by daniel klein; 30 Apr 2024, 06:35.

                    Comment

                    Working...
                    X