Estimating mean difference between two samples that partly overlap

Peter Hartmann

Join Date: Jul 2022

Posts: 9
#1

Estimating mean difference between two samples that partly overlap

04 Aug 2022, 10:44

Dear all,

I have a dataset containing a big sample of 2000 observations that in return contains a small subsample of 500 observations (So, total observations: 2000). I want to analyze if the mean of the 60 variables in this dataset are significantly different between the overall sample and the subsample, including an effect size test that indicates the standard deviations. I am able to do this if I have a clear dichotomous variable identifying an observation either as control or treatment, for example. However, I am not able to do this in this case as one sample includes the other. I tried to solve this using a variable with more than one category, but this does not work. Does anyone maybe have an idea? Many thanks!
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35731
#2

04 Aug 2022, 11:47

You could turn this into a regression problem with indicator variables for each condition and very likely an interaction term too.
Comment
David Radwin

Join Date: Mar 2014

Posts: 368
#3

04 Aug 2022, 12:24

Does this help? https://www.statalist.org/forums/for...59#post1450959

David Radwin
Senior Researcher, California Competes
californiacompetes.org
Pronouns: He/Him
Comment

Daniel Schaefer

Join Date: Mar 2020
Posts: 814

04 Aug 2022, 12:43

Alternatively, I believe you can loop over your variables, collect the requisite summary statistics for each group, then run a ttest like so:

Code:

clear
input int v1 v2 v3 subgroup
0 0 0 1
0 0 1 1
0 1 0 0
0 1 1 0
1 0 0 1
1 0 1 1
1 1 0 0
1 1 1 0
end

foreach var of varlist v*{
    display "t-test for `var'..."
    // Collect information for full sample
    quietly summarize `var'
    local full_n = r(N)
    local full_mean = r(mean)
    local full_sd = r(sd)
    // Collect information for subsample
    quietly summarize `var' if subgroup == 1
    local subgroup_n = r(N)
    local subgroup_mean = r(mean)
    local subgroup_sd = r(sd)
    // Ttest
    ttesti `full_n' `full_mean' `full_sd' `subgroup_n' `subgroup_mean' `subgroup_sd'
}

Last edited by Daniel Schaefer; 04 Aug 2022, 12:50. Reason: removed debug line

Comment

Ben Jann

Join Date: Sep 2014
Posts: 262

04 Aug 2022, 15:13

Here is how you could do such tests using dstat (see ssc describe dstat):

Code:

. // data
. set seed 1221348

. drop _all

. set obs 2000
Number of observations (_N) was 0, now 2,000.

. gen byte subpop = _n<=500

. drawnorm x1-x10

. foreach x of var x1-x10 {
  2.     local offset = runiform()-.5
  3.     qui replace `x' = `x' + `offset' if subpop==1
  4. }

.
. // means in subpop and in overall sample
. dstat x1-x10, over(subpop, select(1)) total

mean                              Number of obs   =      2,000

            1: subpop = 1

--------------------------------------------------------------
             | Coefficient  Std. err.     [95% conf. interval]
-------------+------------------------------------------------
1            |
          x1 |  -.2487229   .0451513     -.3372715   -.1601742
          x2 |   -.425834   .0442562     -.5126271   -.3390409
          x3 |    .272505   .0434875      .1872195    .3577906
          x4 |   .2799301   .0436161      .1943923    .3654678
          x5 |   .3145234   .0468535      .2226366    .4064102
          x6 |   -.387549   .0427548     -.4713976   -.3037003
          x7 |   .0785741   .0446668     -.0090243    .1661726
          x8 |  -.2181905   .0466019     -.3095839   -.1267972
          x9 |     .05156   .0466761      -.039979    .1430989
         x10 |  -.2816894   .0439925     -.3679654   -.1954134
-------------+------------------------------------------------
total        |
          x1 |  -.0553058   .0224608     -.0993548   -.0112568
          x2 |  -.1034344   .0226641     -.1478821   -.0589867
          x3 |   .0828425   .0220403      .0396181    .1260669
          x4 |    .069545   .0225834      .0252555    .1138344
          x5 |   .1004449   .0223098      .0566921    .1441978
          x6 |  -.0830756   .0222872     -.1267842    -.039367
          x7 |   .0151667   .0223798     -.0287235    .0590569
          x8 |  -.0980133   .0225949     -.1423253   -.0537013
          x9 |   .0181936   .0229536     -.0268218     .063209
         x10 |  -.0475958   .0220863     -.0909104   -.0042812
--------------------------------------------------------------

.
. // mean difference between subpop and overall sample including test
. dstat x1-x10, over(subpop, select(1) contrast) total

Difference in mean                              Number of obs     =      2,000
                                                Contrast          =      total

            1: subpop = 1

------------------------------------------------------------------------------
             | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
1            |
          x1 |   -.193417    .039036    -4.95   0.000    -.2699725   -.1168615
          x2 |  -.3223996   .0386389    -8.34   0.000    -.3981764   -.2466228
          x3 |   .1896625   .0378333     5.01   0.000     .1154658    .2638593
          x4 |   .2103851   .0382255     5.50   0.000     .1354191    .2853511
          x5 |   .2140785   .0399418     5.36   0.000     .1357465    .2924105
          x6 |  -.3044734   .0375594    -8.11   0.000     -.378133   -.2308138
          x7 |   .0634074   .0387094     1.64   0.102    -.0125077    .1393224
          x8 |  -.1201773   .0399549    -3.01   0.003    -.1985349   -.0418196
          x9 |   .0333664   .0402019     0.83   0.407    -.0454757    .1122085
         x10 |  -.2340936   .0381507    -6.14   0.000    -.3089129   -.1592744
------------------------------------------------------------------------------

An alternative would be to stack the data and cluster on observations:

Code:

. preserve

. gen id = _n

. expand 2 if subpop==1
(500 observations created)

. bysort id: gen byte subpop2 = _n==2

. dstat x1-x10, over(subpop2, contrast) vce(cluster id)

Difference in mean                              Number of obs     =      2,500
                                                Contrast          =  0.subpop2

            0: subpop2 = 0
            1: subpop2 = 1

                                 (Std. err. adjusted for 2,000 clusters in id)
------------------------------------------------------------------------------
             |               Robust
             | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
1            |
          x1 |   -.193417    .039036    -4.95   0.000    -.2699725   -.1168615
          x2 |  -.3223996   .0386389    -8.34   0.000    -.3981764   -.2466228
          x3 |   .1896625   .0378333     5.01   0.000     .1154658    .2638593
          x4 |   .2103851   .0382255     5.50   0.000     .1354191    .2853511
          x5 |   .2140785   .0399418     5.36   0.000     .1357465    .2924105
          x6 |  -.3044734   .0375594    -8.11   0.000     -.378133   -.2308138
          x7 |   .0634074   .0387094     1.64   0.102    -.0125077    .1393224
          x8 |  -.1201773   .0399549    -3.01   0.003    -.1985349   -.0418196
          x9 |   .0333664   .0402019     0.83   0.407    -.0454757    .1122085
         x10 |  -.2340936   .0381507    -6.14   0.000    -.3089129   -.1592744
------------------------------------------------------------------------------

. restore

Such an approach would also work with other commands such as mean or regress. Here's an illustration using mean.

Code:

. preserve

. gen id = _n

. expand 2 if subpop==1
(500 observations created)

. bysort id: gen byte subpop2 = _n==2

. qui mean x1-x10, over(subpop2) vce(cluster id)

. local exp

. foreach x of var x1-x10 {
  2.     local exp `exp' (`x': _b[c.`x'@1.subpop2]-_b[c.`x'@0bn.subpop2])
  3. }

. nlcom `exp'

          x1: _b[[email protected]]-_b[[email protected]]
          x2: _b[[email protected]]-_b[[email protected]]
          x3: _b[[email protected]]-_b[[email protected]]
          x4: _b[[email protected]]-_b[[email protected]]
          x5: _b[[email protected]]-_b[[email protected]]
          x6: _b[[email protected]]-_b[[email protected]]
          x7: _b[[email protected]]-_b[[email protected]]
          x8: _b[[email protected]]-_b[[email protected]]
          x9: _b[[email protected]]-_b[[email protected]]
         x10: _b[[email protected]]-_b[[email protected]]

------------------------------------------------------------------------------
        Mean | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
          x1 |   -.193417    .039036    -4.95   0.000    -.2699262   -.1169079
          x2 |  -.3223996   .0386389    -8.34   0.000    -.3981305   -.2466687
          x3 |   .1896625   .0378333     5.01   0.000     .1155107    .2638144
          x4 |   .2103851   .0382255     5.50   0.000     .1354645    .2853057
          x5 |   .2140785   .0399418     5.36   0.000     .1357939    .2923631
          x6 |  -.3044734   .0375594    -8.11   0.000    -.3780884   -.2308584
          x7 |   .0634074   .0387094     1.64   0.101    -.0124617    .1392765
          x8 |  -.1201773   .0399549    -3.01   0.003    -.1984875   -.0418671
          x9 |   .0333664   .0402019     0.83   0.407     -.045428    .1121608
         x10 |  -.2340936   .0381507    -6.14   0.000    -.3088676   -.1593197
------------------------------------------------------------------------------

. restore

..., including an effect size test that indicates the standard deviations.

Don't know what exactly you mean by that.
ben

Comment

Peter Hartmann

Join Date: Jul 2022

Posts: 9
#6

05 Aug 2022, 12:04

Dear all, thank you very much for your replies. Unfortunately, I was not able to solve my problem in any of these ways yet. However, I just had an idea I wanted to share with you. It's maybe a bit out of the box, but maybe it works anyway. My idea is to duplicate every observation in the dataset and then create a new dummy variable that is 0 for the first half of the dataset (so the 2000 observations of the original overall sample) and 1 for the 500 observations in the second half of the dataset that are copies of the original subsample. This way I would avoid the problem of one sample including the other (right?). In my head, this makes sense, but somehow I am wary that I am missing something. Am I making sense?

..., including an effect size test that indicates the standard deviations.

Don't know what exactly you mean by that.

Sorry that I was not making this more clear, I mean the Cohen's d test.
Comment
Daniel Schaefer

Join Date: Mar 2020

Posts: 814
#7

06 Aug 2022, 13:32

Hi Peter, I believe something like what you are describing will work, although in the interests of computational efficiency I might only duplicate the 500 observations in the subsample rather than the entire dataset. Since you are mutating your dataset, you might want to make use of -preserve- and -restore- for this. Alternatively, if you are on Stata 17, I would suggest using data frames in order to take advantage of the convenient syntax for copying observations. A quick word of warning: when mutating data like this, it is important to be careful about how you manage your data. You wouldn't want to run any other analyses on the data with with the duplicate observations by mistake.

Finally, since you have several alternative solutions, I would recommend that you implement at least two of them and make sure that they agree. This extra validation step will help you identify any silent errors. In the world of programming, it is a cardinal sin to assume that your code is correct just because it runs without producing an error. It is always a good idea to systematically test your code.
1 like
Comment
Ben Jann

Join Date: Sep 2014

Posts: 262
#8

06 Aug 2022, 14:11

However, I just had an idea ...

Hm, isn't that what I suggested above ("An alternative would be to stack the data and cluster on observations")?
Comment
Peter Hartmann

Join Date: Jul 2022

Posts: 9
#9

07 Aug 2022, 12:48

Ben Jann

Hm, isn't that what I suggested above ("An alternative would be to stack the data and cluster on observations")?

I am sorry, you are absolutely right! Seems like I was not able to process any more information yesterday (at least not consciously). For me, this part of your code did the trick:

Code:

gen id = _n expand 2 if subpop==1 bysort id: gen byte subpop2 = _n==2

Thank you so much!

Daniel Schaefer
Thank you for your advice, it is very much appreciated. I was not aware Stata supports even data frames by now. Very interesting. I will also keep in mind to validate my outcomes.
Again, thank you all!
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10223

#10

07 Aug 2022, 15:42

Using regress with indicators as hinted in #2 and #5 will allow you to make use of the suest command to combine the estimates and subsequently test for testing cross-model hypotheses. The results will be equivalent to the previously outlined approaches.

Code:

set seed 1221348
drop _all
set obs 2000
gen byte subpop = _n<=500
drawnorm x1-x10
foreach x of var x1-x10 {
    local offset = runiform()-.5
    qui replace `x' = `x' + `offset' if subpop==1
}
gen obs_no=_n
reshape long x, i(obs_no) j(which)
regress x ibn.which if subpop, nocons
est sto m1
regress x ibn.which, nocons
est sto m2
suest m1 m2

forval i=1/10{
    display "diff`i' = `: di %10.9f `=[m1_mean]`i'.which- [m2_mean]`i'.which''"
    test [m1_mean]`i'.which=[m2_mean]`i'.which
}

Res.:

Code:

. regress x ibn.which if subpop, nocons

      Source |       SS           df       MS      Number of obs   =     5,000
-------------+----------------------------------   F(10, 4990)     =     38.79
       Model |  390.362528        10  39.0362528   Prob > F        =    0.0000
    Residual |  5021.20488     4,990  1.00625348   R-squared       =    0.0721
-------------+----------------------------------   Adj R-squared   =    0.0703
       Total |  5411.56741     5,000  1.08231348   Root MSE        =    1.0031

------------------------------------------------------------------------------
           x |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       which |
          1  |  -.2487229    .044861    -5.54   0.000    -.3366701   -.1607756
          2  |   -.425834    .044861    -9.49   0.000    -.5137812   -.3378868
          3  |    .272505    .044861     6.07   0.000     .1845578    .3604523
          4  |   .2799301    .044861     6.24   0.000     .1919828    .3678773
          5  |   .3145234    .044861     7.01   0.000     .2265762    .4024707
          6  |   -.387549    .044861    -8.64   0.000    -.4754962   -.2996018
          7  |   .0785741    .044861     1.75   0.080    -.0093731    .1665213
          8  |  -.2181905    .044861    -4.86   0.000    -.3061378   -.1302433
          9  |     .05156    .044861     1.15   0.250    -.0363873    .1395072
         10  |  -.2816894    .044861    -6.28   0.000    -.3696367   -.1937422
------------------------------------------------------------------------------

. 
. est sto m1

. 
. regress x ibn.which, nocons

      Source |       SS           df       MS      Number of obs   =    20,000
-------------+----------------------------------   F(10, 19990)    =     10.90
       Model |  109.761047        10  10.9761047   Prob > F        =    0.0000
    Residual |  20127.6674    19,990  1.00688681   R-squared       =    0.0054
-------------+----------------------------------   Adj R-squared   =    0.0049
       Total |  20237.4285    20,000  1.01187142   Root MSE        =    1.0034

------------------------------------------------------------------------------
           x |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       which |
          1  |  -.0553058   .0224375    -2.46   0.014    -.0992853   -.0113264
          2  |  -.1034344   .0224375    -4.61   0.000    -.1474139    -.059455
          3  |   .0828425   .0224375     3.69   0.000     .0388631    .1268219
          4  |    .069545   .0224375     3.10   0.002     .0255655    .1135244
          5  |   .1004449   .0224375     4.48   0.000     .0564655    .1444244
          6  |  -.0830756   .0224375    -3.70   0.000     -.127055   -.0390961
          7  |   .0151667   .0224375     0.68   0.499    -.0288127    .0591462
          8  |  -.0980133   .0224375    -4.37   0.000    -.1419927   -.0540338
          9  |   .0181936   .0224375     0.81   0.417    -.0257859     .062173
         10  |  -.0475958   .0224375    -2.12   0.034    -.0915752   -.0036163
------------------------------------------------------------------------------

. 
. est sto m2

. 
. suest m1 m2

Simultaneous results for m1, m2

                                                Number of obs     =     20,000

------------------------------------------------------------------------------
             |               Robust
             |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
m1_mean      |
       which |
          1  |  -.2487229   .0451412    -5.51   0.000    -.3371979   -.1602478
          2  |   -.425834   .0442463    -9.62   0.000    -.5125551   -.3391129
          3  |    .272505   .0434777     6.27   0.000     .1872903    .3577198
          4  |   .2799301   .0436062     6.42   0.000     .1944634    .3653967
          5  |   .3145234   .0468429     6.71   0.000     .2227129    .4063339
          6  |   -.387549   .0427452    -9.07   0.000     -.471328     -.30377
          7  |   .0785741   .0446568     1.76   0.078    -.0089516    .1660998
          8  |  -.2181905   .0465914    -4.68   0.000     -.309508   -.1268731
          9  |     .05156   .0466656     1.10   0.269     -.039903    .1430229
         10  |  -.2816894   .0439826    -6.40   0.000    -.3678938   -.1954851
-------------+----------------------------------------------------------------
m1_lnvar     |
       _cons |    .006234   .0195972     0.32   0.750    -.0321758    .0446438
-------------+----------------------------------------------------------------
m2_mean      |
       which |
          1  |  -.0553058   .0224557    -2.46   0.014    -.0993183   -.0112934
          2  |  -.1034344    .022659    -4.56   0.000    -.1478452   -.0590236
          3  |   .0828425   .0220354     3.76   0.000      .039654     .126031
          4  |    .069545   .0225783     3.08   0.002     .0252923    .1137976
          5  |   .1004449   .0223048     4.50   0.000     .0567284    .1441615
          6  |  -.0830756   .0222822    -3.73   0.000    -.1267479   -.0394033
          7  |   .0151667   .0223748     0.68   0.498     -.028687    .0590205
          8  |  -.0980133   .0225898    -4.34   0.000    -.1422885   -.0537381
          9  |   .0181936   .0229484     0.79   0.428    -.0267844    .0631716
         10  |  -.0475958   .0220813    -2.16   0.031    -.0908744   -.0043172
-------------+----------------------------------------------------------------
m2_lnvar     |
       _cons |   .0068632   .0099142     0.69   0.489    -.0125683    .0262948
------------------------------------------------------------------------------

. 
. 
. 
. forval i=1/10{
  2. 
.     display "diff`i' = `: di %10.9f `=[m1_mean]`i'.which- [m2_mean]`i'.which''"
  3. 
.     test [m1_mean]`i'.which=[m2_mean]`i'.which
  4. 
. }
diff1 = -0.193417041

 ( 1)  [m1_mean]1bn.which - [m2_mean]1bn.which = 0

           chi2(  1) =   24.56
         Prob > chi2 =    0.0000
diff2 = -0.322399578

 ( 1)  [m1_mean]2.which - [m2_mean]2.which = 0

           chi2(  1) =   69.65
         Prob > chi2 =    0.0000
diff3 = 0.189662542

 ( 1)  [m1_mean]3.which - [m2_mean]3.which = 0

           chi2(  1) =   25.14
         Prob > chi2 =    0.0000
diff4 = 0.210385088

 ( 1)  [m1_mean]4.which - [m2_mean]4.which = 0

           chi2(  1) =   30.31
         Prob > chi2 =    0.0000
diff5 = 0.214078503

 ( 1)  [m1_mean]5.which - [m2_mean]5.which = 0

           chi2(  1) =   28.74
         Prob > chi2 =    0.0000
diff6 = -0.304473410

 ( 1)  [m1_mean]6.which - [m2_mean]6.which = 0

           chi2(  1) =   65.74
         Prob > chi2 =    0.0000
diff7 = 0.063407386

 ( 1)  [m1_mean]7.which - [m2_mean]7.which = 0

           chi2(  1) =    2.68
         Prob > chi2 =    0.1013
diff8 = -0.120177271

 ( 1)  [m1_mean]8.which - [m2_mean]8.which = 0

           chi2(  1) =    9.05
         Prob > chi2 =    0.0026
diff9 = 0.033366393

 ( 1)  [m1_mean]9.which - [m2_mean]9.which = 0

           chi2(  1) =    0.69
         Prob > chi2 =    0.4065
diff10 = -0.234093649

 ( 1)  [m1_mean]10.which - [m2_mean]10.which = 0

           chi2(  1) =   37.67
         Prob > chi2 =    0.0000

.

Comment

Ben Jann

Join Date: Sep 2014
Posts: 262

#11

11 Aug 2022, 08:38

I now added Cohen's d to the list of statistics supported by dstat. Type

Code:

. ssc install dstat, replace

to install the new version.

You can type

Code:

dstat (cohend) varlist, by(groupvar)

to obtain Cohen's d with respect to groupvar for each variable in varlist, including consistent standard errors and confidence intervals.

The example above would be:

Code:

// Data
set seed 1221348
drop _all
set obs 2000
gen byte subpop = _n<=500
drawnorm x1-x10
foreach x of var x1-x10 {
    local offset = runiform()-.5
    qui replace `x' = `x' + `offset' if subpop==1
}

// Compute Cohen's d
preserve
gen id = _n
expand 2 if subpop==1
bysort id: gen byte group = _n==2
dstat (cohend) x1-x10, by(group) vce(cluster id)
restore

Result:

Code:

. preserve

. gen id = _n

. expand 2 if subpop==1
(500 observations created)

. bysort id: gen byte group = _n==2

. dstat (cohend) x1-x10, by(group) vce(cluster id)

cohend                            Number of obs   =      2,500
                                  By variable     =      group

                 (Std. err. adjusted for 2,000 clusters in id)
--------------------------------------------------------------
             |               Robust
             | Coefficient  std. err.     [95% conf. interval]
-------------+------------------------------------------------
          x1 |  -.1923289   .0384871     -.2678079   -.1168498
          x2 |   -.319533   .0374729     -.3930229    -.246043
          x3 |   .1929067   .0382167       .117958    .2678555
          x4 |   .2096991   .0378185      .1355314    .2838669
          x5 |   .2123661   .0395732      .1347571    .2899752
          x6 |  -.3079036    .037162      -.380784   -.2350232
          x7 |   .0633699   .0386097     -.0123495    .1390894
          x8 |  -.1181661   .0389534     -.1945596   -.0417725
          x9 |   .0323904   .0390108     -.0441156    .1088965
         x10 |  -.2371593   .0382381       -.31215   -.1621686
--------------------------------------------------------------

. restore

Results are equivalent to the following "manual" calculation based on the non-expanded data:

Code:

. dstat (mean sd0 count) x1, over(subpop, select(1)) total

(...)

. nlcom (x1: (_b[1:mean]-_b[total:mean]) /*
>     */ / sqrt((_b[1:count]*_b[1:sd0]^2 + _b[total:count]*_b[total:sd0]^2) /*
>     */ / (_b[1:count]+_b[total:count]-2))) /*
>     */, noheader

------------------------------------------------------------------------------
          x1 | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
-------------+----------------------------------------------------------------
          x1 |  -.1923289   .0384871    -5.00   0.000    -.2677621   -.1168956
------------------------------------------------------------------------------

ben

Announcement