Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Correlation and cluster command

    Hi Everyone!
    I am using Stata SE 14.0. I would like to run a correlation between variables (e.g. alcoholuse and depression), but I have a twin sample, which means that my observations are non-independent (twins belonging together are identified by the variable familyID). I need the value for the Pearson's correlation and p-value, but I cannot use the cluster(familyID) together with the correlate command. I know that I can use regressions with cluster(familyID) (regress command), but here I don't get the value for the Pearson's correlation... any idea's how to work around that problem? Or is that simply not possible? Thanky ou very much for your help!

  • #2
    With just one predictor, the Pearson correlation is just the positive square root of the coefficient of determination from regression. Not having a P-value for a calculation that doesn't make full sense does not strike me as a loss. The regression provides more appropriate inferential results.

    Comment


    • #3
      You can standardize the variables to have mean 0 and unit standard deviation. The regression coefficient will then equal the correlation.

      Best
      Daniel

      Comment


      • #4
        Thank you both very much for your help! That is very useful!! Would you also have any idea how to optain the p-value or alternatively the confidence intervals for the Person's correlation when I use the regression insted of the correlation? (Silly question @Daniel: Would the p-value of the regression be identical with the p-value of the Pearson's correlation if I standardize the variables to have mean 0 and unit standard deviation?)
        Best,
        Mel

        Comment


        • #5
          I don't see that there can be a separate P-value for the correlation that accounts for the clustering. That was part of my point in #2.

          You need a model for the generating process and that model is in the regression and it has a P-value. It also has a corresponding correlation.

          As I understand it, that's as much as you need and as much as you can get.



          Comment


          • #6
            The p-values will be the same in this case.*

            But, I do not fully understand why you want correlations in the first place. Without much information on your ultimate goal here, it seems like a family fixed-effects regression approach or something would probably be preferable.

            Best
            Daniel

            Edit:

            * This statement needs explanation. If you run a simple regression of y on x, where both y and x have mean 0 and variance 1, then the coefficient will be equal to the correlation of both variables. Further the estimated p-value from the regression model will equal the p-value as calculated by Stata's pwcorr command, i.e. the p-value for the correlation. With clustered standard errors this does no longer hold true, of course.
            Last edited by daniel klein; 21 Mar 2016, 08:15.

            Comment


            • #7
              One problem with regress with clustered data is that inference depends on which variable is predictor and which is outcome. sem offers one approach to inference about the correlation.
              Code:
              clear
              input pairid  x  y
              1  1 3
              1  2 3
              2  2 6
              2  3 5
              3  4 2
              3  2 8
              4  7 9
              4  9 9
              5  4 12
              5  8 10
              end
              .corr x y
                         |        x        y
              -------------+------------------
                         x |   1.0000
                         y |   0.5861   1.0000
              .reg y x, cluster(pairid)
              
              Linear regression                               Number of obs     =         10
                                                              F(1, 4)           =      13.30
                                                              Prob > F          =     0.0218
                                                              R-squared         =     0.3435
                                                              Root MSE          =     2.9228
                                               (Std. Err. adjusted for 5 clusters in pairid)
              ------------------------------------------------------------------------------
                           |               Robust
                         y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
              -------------+----------------------------------------------------------------
                         x |   .7067039   .1937797     3.65   0.022     .1686852    1.244723
                     _cons |   3.731844   1.276101     2.92   0.043     .1888205    7.274867
              ------------------------------------------------------------------------------
              
              . reg x y, cluster(pairid)
              
              Linear regression                               Number of obs     =         10
                                                              F(1, 4)           =       4.16
                                                              Prob > F          =     0.1111
                                                              R-squared         =     0.3435
                                                              Root MSE          =      2.424
                                               (Std. Err. adjusted for 5 clusters in pairid)
              ------------------------------------------------------------------------------
                           |               Robust
                         x |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
              -------------+----------------------------------------------------------------
                         y |   .4860711   .2383802     2.04   0.111    -.1757784    1.147921
                     _cons |   .9433237   1.386112     0.68   0.534    -2.905141    4.791788
              ------------------------------------------------------------------------------
              /* SEM approach */
              sem (<-y x), vce(cluster pair) standardized
              
              Exogenous variables
              Observed: y x
              Structural equation model Number of obs = 10
              Estimation method = ml
              Log pseudolikelihood= -47.830929
              (Std. Err. adjusted for 5 clusters in pair)
              ------------------------------------------------------------------------------
              | Robust
              Standardized | Coef. Std. Err. z P>|z| [95% Conf. Interval]
              -------------+----------------------------------------------------------------
              mean(y)| 2.076584 .4772121 4.35 0.000 1.141265 3.011902
              mean(x)| 1.569614 .3041468 5.16 0.000 .9734969 2.165731
              -------------+----------------------------------------------------------------
              var(y)| 1 . . .
              var(x)| 1 . . .
              -------------+----------------------------------------------------------------
              cov(y,x)| .5860958 .1731163 3.39 0.001 .2467941 .9253976
              ------------------------------------------------------------------------------
              One problem: the test and confidence limits are based on a Gaussian distribution. For clustered data, a more appropriate reference distribution may be a t distribution with degrees of freedom equal to the number of clusters minus one.
              Code:
              matrix A = r(table)
              matrix B = A["b" ,"cov(y,x):"] \ A["se","cov(y,x):"]
              scalar r = B[1,1]
              scalar se = B[2,1]
              scalar df = e(N_clust)-1
              scalar pvalue = 2*ttail(df, abs(corr/se))
              scalar k = ttail(df, (100- c(level))/200)
              scalar llim = corr-k*se
              scalar ulim = corr+k*se
              The confidence limits above are symmetric about the estimate. They can also fall outside the theoretical range of [-1,1]. Below is a delta method approach to an asymmetric interval via Fisher's Z transformation. The applicability of the delta method with such a small n is certainly suspect. The difference here between symmetric and asymmetric intervals here is nil.
              Code:
              scalar fish_z = atanh(r)
              scalar fish_se = se/((1+r)*(1-r))
              
              /* Conf limits on transformed scale */
              scalar fish_ll =  fish_z - k*fish_se
              scalar fish_ul = fish_z + k*fish_se
              
              /*back transformed to correlation scale */
              scalar rz_ll  =  tanh(fish_ll)
              scalar rz_ul = tanh(fish_ul)
              scalar list  llim  rz_ll  ulim rz_ul
                    llim =  .50116044
                   rz_ll =  .49474247
                    ulim =  .67103124
                   rz_ul =  .66463784
              Steve Samuels
              Statistical Consulting
              [email protected]

              Stata 14.2

              Comment


              • #8
                Thank all of you for your responses!! Sorry, for the delay replying, I was travelling. To clarify, I am doing research in behavioural genetics (structural equation modelling using twin data sets) - here it is standard to report the phenotypic correlations with p-values or confidence intervals for these kind of populations before presenting the results of the mutlivariate models, this is why I need that kind of information. The result Steve suggest looks interesting! I will give it a try!!
                Best
                Mel

                Comment


                • #9
                  Let us know how it turns out. Since \((1-x)(1+x)=1-x^2\), the code for the delta method standard error of fisher's Z can also be written as:
                  Code:
                  scalar fish_se = se/(1-r^2)
                  Steve Samuels
                  Statistical Consulting
                  [email protected]

                  Stata 14.2

                  Comment

                  Working...
                  X