Correlation and cluster command

Melanie Schneider

Join Date: Mar 2016

Posts: 3
#1

Correlation and cluster command

20 Mar 2016, 07:31

Hi Everyone!
I am using Stata SE 14.0. I would like to run a correlation between variables (e.g. alcoholuse and depression), but I have a twin sample, which means that my observations are non-independent (twins belonging together are identified by the variable familyID). I need the value for the Pearson's correlation and p-value, but I cannot use the cluster(familyID) together with the correlate command. I know that I can use regressions with cluster(familyID) (regress command), but here I don't get the value for the Pearson's correlation... any idea's how to work around that problem? Or is that simply not possible? Thanky ou very much for your help!
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35694
#2

20 Mar 2016, 08:37

With just one predictor, the Pearson correlation is just the positive square root of the coefficient of determination from regression. Not having a P-value for a calculation that doesn't make full sense does not strike me as a loss. The regression provides more appropriate inferential results.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3847
#3

20 Mar 2016, 12:09

You can standardize the variables to have mean 0 and unit standard deviation. The regression coefficient will then equal the correlation.

Best
Daniel
Comment
Melanie Schneider

Join Date: Mar 2016

Posts: 3
#4

21 Mar 2016, 07:32

Thank you both very much for your help! That is very useful!! Would you also have any idea how to optain the p-value or alternatively the confidence intervals for the Person's correlation when I use the regression insted of the correlation? (Silly question @Daniel: Would the p-value of the regression be identical with the p-value of the Pearson's correlation if I standardize the variables to have mean 0 and unit standard deviation?)
Best,
Mel
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35694
#5

21 Mar 2016, 08:07

I don't see that there can be a separate P-value for the correlation that accounts for the clustering. That was part of my point in #2.

You need a model for the generating process and that model is in the regression and it has a P-value. It also has a corresponding correlation.

As I understand it, that's as much as you need and as much as you can get.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3847
#6

21 Mar 2016, 08:07

The p-values will be the same in this case.*

But, I do not fully understand why you want correlations in the first place. Without much information on your ultimate goal here, it seems like a family fixed-effects regression approach or something would probably be preferable.

Best
Daniel

Edit:

* This statement needs explanation. If you run a simple regression of y on x, where both y and x have mean 0 and variance 1, then the coefficient will be equal to the correlation of both variables. Further the estimated p-value from the regression model will equal the p-value as calculated by Stata's pwcorr command, i.e. the p-value for the correlation. With clustered standard errors this does no longer hold true, of course.

Last edited by daniel klein; 21 Mar 2016, 08:15.
Comment

Steve Samuels

Join Date: Mar 2014
Posts: 1786

21 Mar 2016, 15:43

One problem with regress with clustered data is that inference depends on which variable is predictor and which is outcome. sem offers one approach to inference about the correlation.

Code:

clear
input pairid  x  y
1  1 3
1  2 3
2  2 6
2  3 5
3  4 2
3  2 8
4  7 9
4  9 9
5  4 12
5  8 10
end
.corr x y
           |        x        y
-------------+------------------
           x |   1.0000
           y |   0.5861   1.0000
.reg y x, cluster(pairid)

Linear regression                               Number of obs     =         10
                                                F(1, 4)           =      13.30
                                                Prob > F          =     0.0218
                                                R-squared         =     0.3435
                                                Root MSE          =     2.9228
                                 (Std. Err. adjusted for 5 clusters in pairid)
------------------------------------------------------------------------------
             |               Robust
           y |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   .7067039   .1937797     3.65   0.022     .1686852    1.244723
       _cons |   3.731844   1.276101     2.92   0.043     .1888205    7.274867
------------------------------------------------------------------------------

. reg x y, cluster(pairid)

Linear regression                               Number of obs     =         10
                                                F(1, 4)           =       4.16
                                                Prob > F          =     0.1111
                                                R-squared         =     0.3435
                                                Root MSE          =      2.424
                                 (Std. Err. adjusted for 5 clusters in pairid)
------------------------------------------------------------------------------
             |               Robust
           x |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           y |   .4860711   .2383802     2.04   0.111    -.1757784    1.147921
       _cons |   .9433237   1.386112     0.68   0.534    -2.905141    4.791788
------------------------------------------------------------------------------
/* SEM approach */
sem (<-y x), vce(cluster pair) standardized

Exogenous variables
Observed: y x
Structural equation model Number of obs = 10
Estimation method = ml
Log pseudolikelihood= -47.830929
(Std. Err. adjusted for 5 clusters in pair)
------------------------------------------------------------------------------
| Robust
Standardized | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
mean(y)| 2.076584 .4772121 4.35 0.000 1.141265 3.011902
mean(x)| 1.569614 .3041468 5.16 0.000 .9734969 2.165731
-------------+----------------------------------------------------------------
var(y)| 1 . . .
var(x)| 1 . . .
-------------+----------------------------------------------------------------
cov(y,x)| .5860958 .1731163 3.39 0.001 .2467941 .9253976
------------------------------------------------------------------------------

One problem: the test and confidence limits are based on a Gaussian distribution. For clustered data, a more appropriate reference distribution may be a t distribution with degrees of freedom equal to the number of clusters minus one.

Code:

matrix A = r(table)
matrix B = A["b" ,"cov(y,x):"] \ A["se","cov(y,x):"]
scalar r = B[1,1]
scalar se = B[2,1]
scalar df = e(N_clust)-1
scalar pvalue = 2*ttail(df, abs(corr/se))
scalar k = ttail(df, (100- c(level))/200)
scalar llim = corr-k*se
scalar ulim = corr+k*se

The confidence limits above are symmetric about the estimate. They can also fall outside the theoretical range of [-1,1]. Below is a delta method approach to an asymmetric interval via Fisher's Z transformation. The applicability of the delta method with such a small n is certainly suspect. The difference here between symmetric and asymmetric intervals here is nil.

Code:

scalar fish_z = atanh(r)
scalar fish_se = se/((1+r)*(1-r))

/* Conf limits on transformed scale */
scalar fish_ll =  fish_z - k*fish_se
scalar fish_ul = fish_z + k*fish_se

/*back transformed to correlation scale */
scalar rz_ll  =  tanh(fish_ll)
scalar rz_ul = tanh(fish_ul)
scalar list  llim  rz_ll  ulim rz_ul
      llim =  .50116044
     rz_ll =  .49474247
      ulim =  .67103124
     rz_ul =  .66463784

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2

Comment

Melanie Schneider

Join Date: Mar 2016

Posts: 3
#8

24 Mar 2016, 15:57

Thank all of you for your responses!! Sorry, for the delay replying, I was travelling. To clarify, I am doing research in behavioural genetics (structural equation modelling using twin data sets) - here it is standard to report the phenotypic correlations with p-values or confidence intervals for these kind of populations before presenting the results of the mutlivariate models, this is why I need that kind of information. The result Steve suggest looks interesting! I will give it a try!!
Best
Mel
Comment
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#9

26 Mar 2016, 20:24

Let us know how it turns out. Since \((1-x)(1+x)=1-x^2\), the code for the delta method standard error of fisher's Z can also be written as:

Code:

scalar fish_se = se/(1-r^2)

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment

Announcement