Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • point biserial correlation vs pearson's r

    hi guys!

    i have one continuous variable and one dichotomous one. predictably, the point biserial correlation (esize twosample) and pearson's r (correlate) give me a similar value, yet of different sign. my intuition is that when computing the point biserial what is compared is group 2 (corresponding to value 1 of the /dichotomous) with group 1 (corresponding to value 0 if the dichotomous). is that correct? which sign is then correctly indicating the association? my money is on pearson's r.

    cheers,
    natalia
    Last edited by natalia malancu; 18 Dec 2014, 08:21.

  • #2
    Update: flipped the dummy, did the pearson r and came to the same conclusion as for the point biserial. My intuition has no substance, but it also means the point biserial estimate is misleading

    Comment


    • #3
      All I can say is "interesting"! and hope somebody can chime in why it would reverse the sign. Have you considered the package -polychoric- which gives a the biserial, with a higher value (as one would expect) and in the expected direction. -findit polychoric-

      Code:
      clear
      set seed 1971
      set obs 1000
      gen x1=rnormal()
      gen x2=rnormal()+x1
      replace x2=0 if x2<0
      replace x2=1 if x2>0
      esize twosample x1, by(x2) pbc
      corr
      polychoric x1 x2
      Last edited by ben earnhart; 18 Dec 2014, 10:24.

      Comment


      • #4
        hi ben!

        tried polychoric as you suggested and the direction is aligned with correlate x1 x2
        also tried a different combination continous- dichotomous variable and noticed the same thing - correlate and polichoric give the same direction and more or less same magnitude (as expected), while esize twosample, x1, by (x2) either with the pbc or all option gives the same magnitude with the opposite sign - so yeah...interesting is the right word.

        Comment


        • #5
          I found the package -pbis- which is dedicated to point-biserial. It agrees with -corr- and -polychoric-.. So it seems like esize is broken? Hard to believe for a default Stata package, but I tied it with several different real and generated datasets, and it (esize) got the direction wrong. Big triple-hmm.

          Comment


          • #6
            Hopefully stata.corp will provide us with an answer

            Comment


            • #7
              Originally posted by natalia malancu View Post
              Hopefully stata.corp will provide us with an answer
              StataCorp already did, more or less.

              Take a look at Example 1 in the user's manual entry for esize. You can see that the command is set up like ttest in that it looks at the difference in the values of the outcome variable for the second group relative to those for the first, just as ttest does.

              The parallel will be evident after running the following.
              Code:
              sysuse auto, clear
              
              table foreign, contents(mean mpg)
              
              ttest mpg, by(foreign)
              // From the Methods and Formulas section of the manual's entry for esize:
              display in smcl as text "rPB = " as result r(t) / sqrt(r(t)^2 + r(df_t))
              
              esize twosample mpg, by(foreign) pbcorr
              
              correlate mpg foreign
              
              exit

              Comment


              • #8
                Code:
                clear
                set obs 1000
                set seed 1971
                gen x1=rnormal()
                gen x2=rnormal()+x1
                replace x2=0 if x2<=0
                replace x2=1 if x2>0
                ttest x1, by(x2)
                So esize and ttest "test" in what seems to me to be a counter-intuitive direction. When using a t-test or effect size, one generally looks at the substance, then the size, then the significance. So it might take a second glance, but the direction is arbitrary. But for a "correlation" it's pretty darn counter-intuitive. x1 higher generally is interpreted as x2 higher. Here I stand, I can do no other. For a correlation, + goes with +, - goes with -.

                Comment


                • #9
                  I'm not disagreeing with you, just pointing out that the direction / sign for estat twosample , by() pbcorr appears to be intended (per documentation, namely, Example 1 in the user's manual entry for the command, where the sign & direction of the difference is mentioned) and where I'm guessing the origin of the command's convention for direction / sign arises.

                  Comment


                  • #10
                    hi guys!

                    sorry to get back on this after so long, but I have one other related question.

                    i'm having a hard time understanding a difference in magnitude:
                    correlate x1 x2 gives me a value of -0.5494
                    esize twosample x1, by(x2) pbc a value of .5494018
                    pbis x2 x1 a value of -0.5493
                    polychoric x1 x2 a Rho of -.04736496 (S.e. .00312193)

                    What am I missing here, with regards to the magnitude returned by polychoric.

                    Comment


                    • #11
                      Correct me if I am wrong, but these two effect sizes should be exactly equal, as in the Pearson product-moment correlation should yield an equivalent result with one continuous and one dichotomous variable. I wonder why these two are different?

                      Comment


                      • #12
                        hence my question. polychoric should return more or less a similar value

                        Comment


                        • #13
                          changed machine, changes stata version to no avail. any suggestions whatsoever? what am I missing out?

                          Comment


                          • #14
                            Well, here's something to consider:

                            First, the two commands compute fundamentally different things—one is a point-biserial correlation coefficient and the other a biserial (polyserial) correlation coefficient.

                            Second, while the latter is typically larger than the former, they have different assumptions regarding properties of the distribution of the data.

                            So, maybe your particular dataset violates one or both of the coeffiicients' distributional assumptions and the unexpected values that you've got reflect that.

                            Comment


                            • #15
                              hi joseph!

                              thanks for this.
                              re assumption: definitely a problem. i'm dealing with 2 time periods, and in one of them the continuous variable doesnt take the top 3 values it does in the other one ( say in period 0: values 1-20, in period 1: values 1-17). given the nature of the variables, i thought the biserial correlation made most sense, but given the distribution, i dont know which to use anymore.

                              Comment

                              Working...
                              X