help with difference in proportions in multilevel data

AC Thompson

Join Date: Jan 2018

Posts: 17
#1

help with difference in proportions in multilevel data

22 May 2020, 11:16

I have a dataset where eyes are nested within subjects. I would like to know whether the proportions for var1 and var2 are different, such that a p<0.05 would indicate a significant difference. (thus null hypothesis is that var1=var2). If the data were not nested, I would run
prtest var1=var2

However, the p-value from prtest does not account for the fact that some subjects have data on both eyes.

Is there a way to do a bootstrap to get a 95CI and pvalue for prtest that would account for the fact that subjects sometimes have both eyes in the study, or is there another statistical test one can use to look at whether there is a significant difference between var1 and var2 while accounting for the multilevel nature of the data?

I realize that a GEE or mixed model would be able to account for the multilevel nature of the data but these would test a different hypothesis that var 1 is associated with var 2, so p<0.05 indicates they are significantly associated. However this isn't really asking the same question as the one which we are trying to answer which is whether the 2 variables are significantly different.

thank you for your help.

Last edited by AC Thompson; 22 May 2020, 11:33.
Tags: None

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17678

22 May 2020, 11:38

I think you have to -cluster()- the standard errors on subjects.
I hope that the following toy-example can be helpful somehow:

Code:

set obs 10
g id=_n
g eyes=0
g x=runiform()
expand 2
bysort id: replace eyes=1 if _n==2 & id<=6
bysort id: replace eyes=. if _n==2 & id>6
label define eyes 0 "right" 1 "left"
label val eyes eyes
. logistic eyes x, vce(cluster id)

Logistic regression                             Number of obs     =         16
                                                Wald chi2(1)      =       0.32
                                                Prob > chi2       =     0.5728
Log pseudolikelihood = -10.525824               Pseudo R2         =     0.0056

                                    (Std. Err. adjusted for 10 clusters in id)
------------------------------------------------------------------------------
             |               Robust
        eyes | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
           x |   .5231353   .6009756    -0.56   0.573       .05505     4.97131
       _cons |    .756365   .3143038    -0.67   0.502     .3349799    1.707828
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.



.

Kind regards,
Carlo
(Stata 19.0)

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2404
#3

22 May 2020, 11:52

Without contradicting Carlo's suggestion, I think that knowing something about the frequency of various data patterns would be helpful here. I'd suggest you post tabulations for var1 X var2 for each of the following subsets of observations:

1) Subjects with only one eye in the study, tabulation for the right eye, and a separate one for the left eye;
2) Subjects with two eyes in the study, same two tabulations.
1 like
Comment
AC Thompson

Join Date: Jan 2018

Posts: 17
#4

22 May 2020, 12:56

Hi, I think that a logistic regression with clustered standard errors (similar to xtgee or mixed) is looking at a different null hypothesis...In that case the null hypothesis is that var 1 is not associated with var 2. Thus if p<0.05, then we reject the null hypothesis and var 1 is significantly associated with var 2. The output is an odds ratio.

But what we want to show is the opposite--we want to show that var 1 proportion is significantly different from var 2 -- so that p <0.05 means that these two variables have significantly different proportions, (not that they are significantly associated ). The way to do this when there is just one eye per subject would be prtest. But since this is multilevel data with more than one eye per subject, one would need to somehow adjust the 95CI and p value

- I'm just wondering if someone might have code to do that or if there is another statistical test that could provide an adjusted 95CI and p-value for the difference in proportions in multilevel data?

I'll show the breakdown here in case this is helpful:

For left eye:
var 1
var2 | No Yes | Total
---------------+-----------+----------
No | 65 22 | 87
Yes. | 8 3 | 11
----------- +-------------------------+---------
Total | 73 25 | 98

For right eye:
var1

var2 | No Yes | Total
---------------+---------------+------------+----------
No | 66 17 | 83
Yes | 7 2 | 9
---------------+--------------+-------------+----------
Total | 73 19 | 92

For both eyes retained:

var1

var2 | No Yes. | Total
---------------+------------------+-----------------+----------
No | 131 39 | 170
Yes | 15 5 | 20
---------------+-------------------+-------------
Total | 146 44 | 190
Comment
AC Thompson

Join Date: Jan 2018

Posts: 17
#5

22 May 2020, 14:16

I'm assuming there is a way to bootstrap the 95CI for the difference in proportions taking into account the fact that there is clustering of eyes within subjects...But I have not been able to find code to do this.
Comment
AC Thompson

Join Date: Jan 2018

Posts: 17
#6

23 May 2020, 05:48

Originally posted by Mike Lacy View Post

Without contradicting Carlo's suggestion, I think that knowing something about the frequency of various data patterns would be helpful here. I'd suggest you post tabulations for var1 X var2 for each of the following subsets of observations:

1) Subjects with only one eye in the study, tabulation for the right eye, and a separate one for the left eye;
2) Subjects with two eyes in the study, same two tabulations.

Hi, I think that a logistic regression with clustered standard errors (similar to xtgee or mixed) is looking at a different null hypothesis...In that case the null hypothesis is that var 1 is not associated with var 2. Thus if p<0.05, then we reject the null hypothesis and var 1 is significantly associated with var 2. The output is an odds ratio.

But what we want to show is the opposite--we want to show that var 1 proportion is significantly different from var 2 -- so that p <0.05 means that these two variables have significantly different proportions, (not that they are significantly associated ). The way to do this when there is just one eye per subject would be prtest. But since this is multilevel data with more than one eye per subject, one would need to somehow adjust the 95CI and p value

- I'm just wondering if someone might have code to do that or if there is another statistical test that could provide an adjusted 95CI and p-value for the difference in proportions in multilevel data?

I'll show the breakdown here in case this is helpful:

For left eye:
var 1
var2 | No Yes | Total
---------------+-----------+----------
No | 65 22 | 87
Yes. | 8 3 | 11
----------- +-------------------------+---------
Total | 73 25 | 98

For right eye:
var1

var2 | No Yes | Total
---------------+---------------+------------+----------
No | 66 17 | 83
Yes | 7 2 | 9
---------------+--------------+-------------+----------
Total | 73 19 | 92

For both eyes retained:

var1

var2 | No Yes. | Total
---------------+------------------+-----------------+----------
No | 131 39 | 170
Yes | 15 5 | 20
---------------+-------------------+-------------
Total | 146 44 | 190
Comment
AC Thompson

Join Date: Jan 2018

Posts: 17
#7

23 May 2020, 05:51

As an example, I tried :

seed 200
cl(id eye)
bootstrap, reps(1000): prtest var1=var2
estat bootstrap, all

This does not work - I get an error
expression list required
r(100);

There is probably a way to get the 95CI and p for a difference in proportions, clustering at the ID level, by writing a program for bootstrapping. Does anyone know how to do this?
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17678

23 May 2020, 08:18

I do hope that the following toy-example can help you somehow:

Code:

. set obs 10
number of observations (_N) was 0, now 10

. g id=_n

. g var1=1 in 1/6
(4 missing values generated)

. replace var1=0 if var1==.
(4 real changes made)

. g var2=0 in 1/3
(7 missing values generated)


. replace var2=1 in 4/10
(7 real changes made)

. expand 2
(10 observations created)

. prtest var1==var2

Two-sample test of proportions                  var1: Number of obs =       20
                                                var2: Number of obs =       20
------------------------------------------------------------------------------
    Variable |       Mean   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        var1 |         .6   .1095445                      .3852967    .8147033
        var2 |         .7   .1024695                      .4991635    .9008365
-------------+----------------------------------------------------------------
        diff |        -.1        .15                     -.3939946    .1939946
             |  under Ho:    .150831    -0.66   0.507
------------------------------------------------------------------------------
        diff = prop(var1) - prop(var2)                            z =  -0.6630
    Ho: diff = 0

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(Z < z) = 0.2537         Pr(|Z| > |z|) = 0.5073          Pr(Z > z) = 0.7463

. return list

scalars:
                 r(N1) =  20
                 r(N2) =  20
                 r(P1) =  .6
                 r(P2) =  .7
             r(P_diff) =  -.1
                r(se1) =  .1095445115010332
                r(se2) =  .102469507659596
           r(se_diff0) =  .1508310312899836
            r(se_diff) =  .15
                r(lb1) =  .3852967027539412
                r(ub1) =  .8147032972460588
                r(lb2) =  .4991634554736407
                r(ub2) =  .9008365445263593
            r(lb_diff) =  -.393994597681008
            r(ub_diff) =  .1939945976810081
                  r(z) =  -.6629935441317957
                r(p_l) =  .2536673443907445
                  r(p) =  .5073346887814889
                r(p_u) =  .7463326556092555
              r(level) =  95

. bootstrap r(P_diff), reps(1000) cluster(id) nodots : prtest var1==var2

Warning:  Because prtest is not an estimation command or does not set e(sample), bootstrap has no way to determine which observations
          are used in calculating the statistics and so assumes that all observations are used.  This means that no observations will
          be excluded from the resampling because of missing values or other reasons.

          If the assumption is not true, press Break, save the data, and drop the observations that are to be excluded.  Be sure that
          the dataset in memory contains only the relevant data.

Bootstrap results                               Number of obs     =         20
                                                Replications      =      1,000

      command:  prtest var1==var2
        _bs_1:  r(P_diff)

                                     (Replications based on 10 clusters in id)
------------------------------------------------------------------------------
             |   Observed   Bootstrap                         Normal-based
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _bs_1 |        -.1    .265759    -0.38   0.707     -.620878     .420878
------------------------------------------------------------------------------

. estat bootstrap, all

Bootstrap results                               Number of obs     =         20
                                                Replications      =       1000

      command:  prtest var1==var2
        _bs_1:  r(P_diff)

                                     (Replications based on 10 clusters in id)
------------------------------------------------------------------------------
             |    Observed               Bootstrap
             |       Coef.       Bias    Std. Err.  [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _bs_1 |         -.1      .0159   .26575895    -.620878    .420878   (N)
             |                                            -.6         .4   (P)
             |                                            -.5         .6  (BC)
------------------------------------------------------------------------------
(N)    normal confidence interval
(P)    percentile confidence interval
(BC)   bias-corrected confidence interval

.

Kind regards,
Carlo
(Stata 19.0)

Comment

AC Thompson

Join Date: Jan 2018
Posts: 17

23 May 2020, 11:53

Originally posted by Carlo Lazzaro View Post

I do hope that the following toy-example can help you somehow:

Code:

. set obs 10
number of observations (_N) was 0, now 10

. g id=_n

. g var1=1 in 1/6
(4 missing values generated)

. replace var1=0 if var1==.
(4 real changes made)

. g var2=0 in 1/3
(7 missing values generated)


. replace var2=1 in 4/10
(7 real changes made)

. expand 2
(10 observations created)

. prtest var1==var2

Two-sample test of proportions var1: Number of obs = 20
var2: Number of obs = 20
------------------------------------------------------------------------------
Variable | Mean Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
var1 | .6 .1095445 .3852967 .8147033
var2 | .7 .1024695 .4991635 .9008365
-------------+----------------------------------------------------------------
diff | -.1 .15 -.3939946 .1939946
| under Ho: .150831 -0.66 0.507
------------------------------------------------------------------------------
diff = prop(var1) - prop(var2) z = -0.6630
Ho: diff = 0

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(Z < z) = 0.2537 Pr(|Z| > |z|) = 0.5073 Pr(Z > z) = 0.7463

. return list

scalars:
r(N1) = 20
r(N2) = 20
r(P1) = .6
r(P2) = .7
r(P_diff) = -.1
r(se1) = .1095445115010332
r(se2) = .102469507659596
r(se_diff0) = .1508310312899836
r(se_diff) = .15
r(lb1) = .3852967027539412
r(ub1) = .8147032972460588
r(lb2) = .4991634554736407
r(ub2) = .9008365445263593
r(lb_diff) = -.393994597681008
r(ub_diff) = .1939945976810081
r(z) = -.6629935441317957
r(p_l) = .2536673443907445
r(p) = .5073346887814889
r(p_u) = .7463326556092555
r(level) = 95

. bootstrap r(P_diff), reps(1000) cluster(id) nodots : prtest var1==var2

Warning: Because prtest is not an estimation command or does not set e(sample), bootstrap has no way to determine which observations
are used in calculating the statistics and so assumes that all observations are used. This means that no observations will
be excluded from the resampling because of missing values or other reasons.

If the assumption is not true, press Break, save the data, and drop the observations that are to be excluded. Be sure that
the dataset in memory contains only the relevant data.

Bootstrap results Number of obs = 20
Replications = 1,000

command: prtest var1==var2
_bs_1: r(P_diff)

(Replications based on 10 clusters in id)
------------------------------------------------------------------------------
| Observed Bootstrap Normal-based
| Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_bs_1 | -.1 .265759 -0.38 0.707 -.620878 .420878
------------------------------------------------------------------------------

. estat bootstrap, all

Bootstrap results Number of obs = 20
Replications = 1000

command: prtest var1==var2
_bs_1: r(P_diff)

(Replications based on 10 clusters in id)
------------------------------------------------------------------------------
| Observed Bootstrap
| Coef. Bias Std. Err. [95% Conf. Interval]
-------------+----------------------------------------------------------------
_bs_1 | -.1 .0159 .26575895 -.620878 .420878 (N)
| -.6 .4 (P)
| -.5 .6 (BC)
------------------------------------------------------------------------------
(N) normal confidence interval
(P) percentile confidence interval
(BC) bias-corrected confidence interval

.

Yes, this code works -
. bootstrap r(P_diff), reps(1000) cluster(id) nodots : prtest var1==var2 thank you for your help!

Announcement