Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Estimating mean difference between two samples that partly overlap

    Dear all,

    I have a dataset containing a big sample of 2000 observations that in return contains a small subsample of 500 observations (So, total observations: 2000). I want to analyze if the mean of the 60 variables in this dataset are significantly different between the overall sample and the subsample, including an effect size test that indicates the standard deviations. I am able to do this if I have a clear dichotomous variable identifying an observation either as control or treatment, for example. However, I am not able to do this in this case as one sample includes the other. I tried to solve this using a variable with more than one category, but this does not work. Does anyone maybe have an idea? Many thanks!

  • #2
    You could turn this into a regression problem with indicator variables for each condition and very likely an interaction term too.

    Comment


    • #3
      Does this help? https://www.statalist.org/forums/for...59#post1450959
      David Radwin
      Senior Researcher, California Competes
      californiacompetes.org
      Pronouns: He/Him

      Comment


      • #4
        Alternatively, I believe you can loop over your variables, collect the requisite summary statistics for each group, then run a ttest like so:

        Code:
        clear
        input int v1 v2 v3 subgroup
        0 0 0 1
        0 0 1 1
        0 1 0 0
        0 1 1 0
        1 0 0 1
        1 0 1 1
        1 1 0 0
        1 1 1 0
        end
        
        foreach var of varlist v*{
            display "t-test for `var'..."
            // Collect information for full sample
            quietly summarize `var'
            local full_n = r(N)
            local full_mean = r(mean)
            local full_sd = r(sd)
            // Collect information for subsample
            quietly summarize `var' if subgroup == 1
            local subgroup_n = r(N)
            local subgroup_mean = r(mean)
            local subgroup_sd = r(sd)
            // Ttest
            ttesti `full_n' `full_mean' `full_sd' `subgroup_n' `subgroup_mean' `subgroup_sd'
        }
        Last edited by Daniel Schaefer; 04 Aug 2022, 12:50. Reason: removed debug line

        Comment


        • #5
          Here is how you could do such tests using dstat (see ssc describe dstat):

          Code:
          . // data
          . set seed 1221348
          
          . drop _all
          
          . set obs 2000
          Number of observations (_N) was 0, now 2,000.
          
          . gen byte subpop = _n<=500
          
          . drawnorm x1-x10
          
          . foreach x of var x1-x10 {
            2.     local offset = runiform()-.5
            3.     qui replace `x' = `x' + `offset' if subpop==1
            4. }
          
          .
          . // means in subpop and in overall sample
          . dstat x1-x10, over(subpop, select(1)) total
          
          mean                              Number of obs   =      2,000
          
                      1: subpop = 1
          
          --------------------------------------------------------------
                       | Coefficient  Std. err.     [95% conf. interval]
          -------------+------------------------------------------------
          1            |
                    x1 |  -.2487229   .0451513     -.3372715   -.1601742
                    x2 |   -.425834   .0442562     -.5126271   -.3390409
                    x3 |    .272505   .0434875      .1872195    .3577906
                    x4 |   .2799301   .0436161      .1943923    .3654678
                    x5 |   .3145234   .0468535      .2226366    .4064102
                    x6 |   -.387549   .0427548     -.4713976   -.3037003
                    x7 |   .0785741   .0446668     -.0090243    .1661726
                    x8 |  -.2181905   .0466019     -.3095839   -.1267972
                    x9 |     .05156   .0466761      -.039979    .1430989
                   x10 |  -.2816894   .0439925     -.3679654   -.1954134
          -------------+------------------------------------------------
          total        |
                    x1 |  -.0553058   .0224608     -.0993548   -.0112568
                    x2 |  -.1034344   .0226641     -.1478821   -.0589867
                    x3 |   .0828425   .0220403      .0396181    .1260669
                    x4 |    .069545   .0225834      .0252555    .1138344
                    x5 |   .1004449   .0223098      .0566921    .1441978
                    x6 |  -.0830756   .0222872     -.1267842    -.039367
                    x7 |   .0151667   .0223798     -.0287235    .0590569
                    x8 |  -.0980133   .0225949     -.1423253   -.0537013
                    x9 |   .0181936   .0229536     -.0268218     .063209
                   x10 |  -.0475958   .0220863     -.0909104   -.0042812
          --------------------------------------------------------------
          
          .
          . // mean difference between subpop and overall sample including test
          . dstat x1-x10, over(subpop, select(1) contrast) total
          
          Difference in mean                              Number of obs     =      2,000
                                                          Contrast          =      total
          
                      1: subpop = 1
          
          ------------------------------------------------------------------------------
                       | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
          -------------+----------------------------------------------------------------
          1            |
                    x1 |   -.193417    .039036    -4.95   0.000    -.2699725   -.1168615
                    x2 |  -.3223996   .0386389    -8.34   0.000    -.3981764   -.2466228
                    x3 |   .1896625   .0378333     5.01   0.000     .1154658    .2638593
                    x4 |   .2103851   .0382255     5.50   0.000     .1354191    .2853511
                    x5 |   .2140785   .0399418     5.36   0.000     .1357465    .2924105
                    x6 |  -.3044734   .0375594    -8.11   0.000     -.378133   -.2308138
                    x7 |   .0634074   .0387094     1.64   0.102    -.0125077    .1393224
                    x8 |  -.1201773   .0399549    -3.01   0.003    -.1985349   -.0418196
                    x9 |   .0333664   .0402019     0.83   0.407    -.0454757    .1122085
                   x10 |  -.2340936   .0381507    -6.14   0.000    -.3089129   -.1592744
          ------------------------------------------------------------------------------
          An alternative would be to stack the data and cluster on observations:

          Code:
          . preserve
          
          . gen id = _n
          
          . expand 2 if subpop==1
          (500 observations created)
          
          . bysort id: gen byte subpop2 = _n==2
          
          . dstat x1-x10, over(subpop2, contrast) vce(cluster id)
          
          Difference in mean                              Number of obs     =      2,500
                                                          Contrast          =  0.subpop2
          
                      0: subpop2 = 0
                      1: subpop2 = 1
          
                                           (Std. err. adjusted for 2,000 clusters in id)
          ------------------------------------------------------------------------------
                       |               Robust
                       | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
          -------------+----------------------------------------------------------------
          1            |
                    x1 |   -.193417    .039036    -4.95   0.000    -.2699725   -.1168615
                    x2 |  -.3223996   .0386389    -8.34   0.000    -.3981764   -.2466228
                    x3 |   .1896625   .0378333     5.01   0.000     .1154658    .2638593
                    x4 |   .2103851   .0382255     5.50   0.000     .1354191    .2853511
                    x5 |   .2140785   .0399418     5.36   0.000     .1357465    .2924105
                    x6 |  -.3044734   .0375594    -8.11   0.000     -.378133   -.2308138
                    x7 |   .0634074   .0387094     1.64   0.102    -.0125077    .1393224
                    x8 |  -.1201773   .0399549    -3.01   0.003    -.1985349   -.0418196
                    x9 |   .0333664   .0402019     0.83   0.407    -.0454757    .1122085
                   x10 |  -.2340936   .0381507    -6.14   0.000    -.3089129   -.1592744
          ------------------------------------------------------------------------------
          
          . restore
          Such an approach would also work with other commands such as mean or regress. Here's an illustration using mean.

          Code:
          . preserve
          
          . gen id = _n
          
          . expand 2 if subpop==1
          (500 observations created)
          
          . bysort id: gen byte subpop2 = _n==2
          
          . qui mean x1-x10, over(subpop2) vce(cluster id)
          
          . local exp
          
          . foreach x of var x1-x10 {
            2.     local exp `exp' (`x': _b[c.`x'@1.subpop2]-_b[c.`x'@0bn.subpop2])
            3. }
          
          . nlcom `exp'
          
                    x1: _b[[email protected]]-_b[[email protected]]
                    x2: _b[[email protected]]-_b[[email protected]]
                    x3: _b[[email protected]]-_b[[email protected]]
                    x4: _b[[email protected]]-_b[[email protected]]
                    x5: _b[[email protected]]-_b[[email protected]]
                    x6: _b[[email protected]]-_b[[email protected]]
                    x7: _b[[email protected]]-_b[[email protected]]
                    x8: _b[[email protected]]-_b[[email protected]]
                    x9: _b[[email protected]]-_b[[email protected]]
                   x10: _b[[email protected]]-_b[[email protected]]
          
          ------------------------------------------------------------------------------
                  Mean | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
          -------------+----------------------------------------------------------------
                    x1 |   -.193417    .039036    -4.95   0.000    -.2699262   -.1169079
                    x2 |  -.3223996   .0386389    -8.34   0.000    -.3981305   -.2466687
                    x3 |   .1896625   .0378333     5.01   0.000     .1155107    .2638144
                    x4 |   .2103851   .0382255     5.50   0.000     .1354645    .2853057
                    x5 |   .2140785   .0399418     5.36   0.000     .1357939    .2923631
                    x6 |  -.3044734   .0375594    -8.11   0.000    -.3780884   -.2308584
                    x7 |   .0634074   .0387094     1.64   0.101    -.0124617    .1392765
                    x8 |  -.1201773   .0399549    -3.01   0.003    -.1984875   -.0418671
                    x9 |   .0333664   .0402019     0.83   0.407     -.045428    .1121608
                   x10 |  -.2340936   .0381507    -6.14   0.000    -.3088676   -.1593197
          ------------------------------------------------------------------------------
          
          . restore


          ..., including an effect size test that indicates the standard deviations.
          Don't know what exactly you mean by that.
          ben

          Comment


          • #6
            Dear all, thank you very much for your replies. Unfortunately, I was not able to solve my problem in any of these ways yet. However, I just had an idea I wanted to share with you. It's maybe a bit out of the box, but maybe it works anyway. My idea is to duplicate every observation in the dataset and then create a new dummy variable that is 0 for the first half of the dataset (so the 2000 observations of the original overall sample) and 1 for the 500 observations in the second half of the dataset that are copies of the original subsample. This way I would avoid the problem of one sample including the other (right?). In my head, this makes sense, but somehow I am wary that I am missing something. Am I making sense?

            ..., including an effect size test that indicates the standard deviations.


            Don't know what exactly you mean by that.
            Sorry that I was not making this more clear, I mean the Cohen's d test.

            Comment


            • #7
              Hi Peter, I believe something like what you are describing will work, although in the interests of computational efficiency I might only duplicate the 500 observations in the subsample rather than the entire dataset. Since you are mutating your dataset, you might want to make use of -preserve- and -restore- for this. Alternatively, if you are on Stata 17, I would suggest using data frames in order to take advantage of the convenient syntax for copying observations. A quick word of warning: when mutating data like this, it is important to be careful about how you manage your data. You wouldn't want to run any other analyses on the data with with the duplicate observations by mistake.

              Finally, since you have several alternative solutions, I would recommend that you implement at least two of them and make sure that they agree. This extra validation step will help you identify any silent errors. In the world of programming, it is a cardinal sin to assume that your code is correct just because it runs without producing an error. It is always a good idea to systematically test your code.

              Comment


              • #8
                However, I just had an idea ...
                Hm, isn't that what I suggested above ("An alternative would be to stack the data and cluster on observations")?

                Comment


                • #9
                  Ben Jann

                  Hm, isn't that what I suggested above ("An alternative would be to stack the data and cluster on observations")?
                  I am sorry, you are absolutely right! Seems like I was not able to process any more information yesterday (at least not consciously). For me, this part of your code did the trick:

                  Code:
                  gen id = _n
                  expand 2 if subpop==1
                  bysort id: gen byte subpop2 = _n==2
                  Thank you so much!

                  Daniel Schaefer
                  Thank you for your advice, it is very much appreciated. I was not aware Stata supports even data frames by now. Very interesting. I will also keep in mind to validate my outcomes.
                  Again, thank you all!

                  Comment


                  • #10
                    Using regress with indicators as hinted in #2 and #5 will allow you to make use of the suest command to combine the estimates and subsequently test for testing cross-model hypotheses. The results will be equivalent to the previously outlined approaches.

                    Code:
                    set seed 1221348
                    drop _all
                    set obs 2000
                    gen byte subpop = _n<=500
                    drawnorm x1-x10
                    foreach x of var x1-x10 {
                        local offset = runiform()-.5
                        qui replace `x' = `x' + `offset' if subpop==1
                    }
                    gen obs_no=_n
                    reshape long x, i(obs_no) j(which)
                    regress x ibn.which if subpop, nocons
                    est sto m1
                    regress x ibn.which, nocons
                    est sto m2
                    suest m1 m2
                    
                    forval i=1/10{
                        display "diff`i' = `: di %10.9f `=[m1_mean]`i'.which- [m2_mean]`i'.which''"
                        test [m1_mean]`i'.which=[m2_mean]`i'.which
                    }
                    Res.:

                    Code:
                    . regress x ibn.which if subpop, nocons
                    
                          Source |       SS           df       MS      Number of obs   =     5,000
                    -------------+----------------------------------   F(10, 4990)     =     38.79
                           Model |  390.362528        10  39.0362528   Prob > F        =    0.0000
                        Residual |  5021.20488     4,990  1.00625348   R-squared       =    0.0721
                    -------------+----------------------------------   Adj R-squared   =    0.0703
                           Total |  5411.56741     5,000  1.08231348   Root MSE        =    1.0031
                    
                    ------------------------------------------------------------------------------
                               x |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                    -------------+----------------------------------------------------------------
                           which |
                              1  |  -.2487229    .044861    -5.54   0.000    -.3366701   -.1607756
                              2  |   -.425834    .044861    -9.49   0.000    -.5137812   -.3378868
                              3  |    .272505    .044861     6.07   0.000     .1845578    .3604523
                              4  |   .2799301    .044861     6.24   0.000     .1919828    .3678773
                              5  |   .3145234    .044861     7.01   0.000     .2265762    .4024707
                              6  |   -.387549    .044861    -8.64   0.000    -.4754962   -.2996018
                              7  |   .0785741    .044861     1.75   0.080    -.0093731    .1665213
                              8  |  -.2181905    .044861    -4.86   0.000    -.3061378   -.1302433
                              9  |     .05156    .044861     1.15   0.250    -.0363873    .1395072
                             10  |  -.2816894    .044861    -6.28   0.000    -.3696367   -.1937422
                    ------------------------------------------------------------------------------
                    
                    . 
                    . est sto m1
                    
                    . 
                    . regress x ibn.which, nocons
                    
                          Source |       SS           df       MS      Number of obs   =    20,000
                    -------------+----------------------------------   F(10, 19990)    =     10.90
                           Model |  109.761047        10  10.9761047   Prob > F        =    0.0000
                        Residual |  20127.6674    19,990  1.00688681   R-squared       =    0.0054
                    -------------+----------------------------------   Adj R-squared   =    0.0049
                           Total |  20237.4285    20,000  1.01187142   Root MSE        =    1.0034
                    
                    ------------------------------------------------------------------------------
                               x |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
                    -------------+----------------------------------------------------------------
                           which |
                              1  |  -.0553058   .0224375    -2.46   0.014    -.0992853   -.0113264
                              2  |  -.1034344   .0224375    -4.61   0.000    -.1474139    -.059455
                              3  |   .0828425   .0224375     3.69   0.000     .0388631    .1268219
                              4  |    .069545   .0224375     3.10   0.002     .0255655    .1135244
                              5  |   .1004449   .0224375     4.48   0.000     .0564655    .1444244
                              6  |  -.0830756   .0224375    -3.70   0.000     -.127055   -.0390961
                              7  |   .0151667   .0224375     0.68   0.499    -.0288127    .0591462
                              8  |  -.0980133   .0224375    -4.37   0.000    -.1419927   -.0540338
                              9  |   .0181936   .0224375     0.81   0.417    -.0257859     .062173
                             10  |  -.0475958   .0224375    -2.12   0.034    -.0915752   -.0036163
                    ------------------------------------------------------------------------------
                    
                    . 
                    . est sto m2
                    
                    . 
                    . suest m1 m2
                    
                    Simultaneous results for m1, m2
                    
                                                                    Number of obs     =     20,000
                    
                    ------------------------------------------------------------------------------
                                 |               Robust
                                 |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
                    -------------+----------------------------------------------------------------
                    m1_mean      |
                           which |
                              1  |  -.2487229   .0451412    -5.51   0.000    -.3371979   -.1602478
                              2  |   -.425834   .0442463    -9.62   0.000    -.5125551   -.3391129
                              3  |    .272505   .0434777     6.27   0.000     .1872903    .3577198
                              4  |   .2799301   .0436062     6.42   0.000     .1944634    .3653967
                              5  |   .3145234   .0468429     6.71   0.000     .2227129    .4063339
                              6  |   -.387549   .0427452    -9.07   0.000     -.471328     -.30377
                              7  |   .0785741   .0446568     1.76   0.078    -.0089516    .1660998
                              8  |  -.2181905   .0465914    -4.68   0.000     -.309508   -.1268731
                              9  |     .05156   .0466656     1.10   0.269     -.039903    .1430229
                             10  |  -.2816894   .0439826    -6.40   0.000    -.3678938   -.1954851
                    -------------+----------------------------------------------------------------
                    m1_lnvar     |
                           _cons |    .006234   .0195972     0.32   0.750    -.0321758    .0446438
                    -------------+----------------------------------------------------------------
                    m2_mean      |
                           which |
                              1  |  -.0553058   .0224557    -2.46   0.014    -.0993183   -.0112934
                              2  |  -.1034344    .022659    -4.56   0.000    -.1478452   -.0590236
                              3  |   .0828425   .0220354     3.76   0.000      .039654     .126031
                              4  |    .069545   .0225783     3.08   0.002     .0252923    .1137976
                              5  |   .1004449   .0223048     4.50   0.000     .0567284    .1441615
                              6  |  -.0830756   .0222822    -3.73   0.000    -.1267479   -.0394033
                              7  |   .0151667   .0223748     0.68   0.498     -.028687    .0590205
                              8  |  -.0980133   .0225898    -4.34   0.000    -.1422885   -.0537381
                              9  |   .0181936   .0229484     0.79   0.428    -.0267844    .0631716
                             10  |  -.0475958   .0220813    -2.16   0.031    -.0908744   -.0043172
                    -------------+----------------------------------------------------------------
                    m2_lnvar     |
                           _cons |   .0068632   .0099142     0.69   0.489    -.0125683    .0262948
                    ------------------------------------------------------------------------------
                    
                    . 
                    . 
                    . 
                    . forval i=1/10{
                      2. 
                    .     display "diff`i' = `: di %10.9f `=[m1_mean]`i'.which- [m2_mean]`i'.which''"
                      3. 
                    .     test [m1_mean]`i'.which=[m2_mean]`i'.which
                      4. 
                    . }
                    diff1 = -0.193417041
                    
                     ( 1)  [m1_mean]1bn.which - [m2_mean]1bn.which = 0
                    
                               chi2(  1) =   24.56
                             Prob > chi2 =    0.0000
                    diff2 = -0.322399578
                    
                     ( 1)  [m1_mean]2.which - [m2_mean]2.which = 0
                    
                               chi2(  1) =   69.65
                             Prob > chi2 =    0.0000
                    diff3 = 0.189662542
                    
                     ( 1)  [m1_mean]3.which - [m2_mean]3.which = 0
                    
                               chi2(  1) =   25.14
                             Prob > chi2 =    0.0000
                    diff4 = 0.210385088
                    
                     ( 1)  [m1_mean]4.which - [m2_mean]4.which = 0
                    
                               chi2(  1) =   30.31
                             Prob > chi2 =    0.0000
                    diff5 = 0.214078503
                    
                     ( 1)  [m1_mean]5.which - [m2_mean]5.which = 0
                    
                               chi2(  1) =   28.74
                             Prob > chi2 =    0.0000
                    diff6 = -0.304473410
                    
                     ( 1)  [m1_mean]6.which - [m2_mean]6.which = 0
                    
                               chi2(  1) =   65.74
                             Prob > chi2 =    0.0000
                    diff7 = 0.063407386
                    
                     ( 1)  [m1_mean]7.which - [m2_mean]7.which = 0
                    
                               chi2(  1) =    2.68
                             Prob > chi2 =    0.1013
                    diff8 = -0.120177271
                    
                     ( 1)  [m1_mean]8.which - [m2_mean]8.which = 0
                    
                               chi2(  1) =    9.05
                             Prob > chi2 =    0.0026
                    diff9 = 0.033366393
                    
                     ( 1)  [m1_mean]9.which - [m2_mean]9.which = 0
                    
                               chi2(  1) =    0.69
                             Prob > chi2 =    0.4065
                    diff10 = -0.234093649
                    
                     ( 1)  [m1_mean]10.which - [m2_mean]10.which = 0
                    
                               chi2(  1) =   37.67
                             Prob > chi2 =    0.0000
                    
                    .

                    Comment


                    • #11
                      I now added Cohen's d to the list of statistics supported by dstat. Type

                      Code:
                      . ssc install dstat, replace
                      to install the new version.

                      You can type

                      Code:
                      dstat (cohend) varlist, by(groupvar)
                      to obtain Cohen's d with respect to groupvar for each variable in varlist, including consistent standard errors and confidence intervals.

                      The example above would be:

                      Code:
                      // Data
                      set seed 1221348
                      drop _all
                      set obs 2000
                      gen byte subpop = _n<=500
                      drawnorm x1-x10
                      foreach x of var x1-x10 {
                          local offset = runiform()-.5
                          qui replace `x' = `x' + `offset' if subpop==1
                      }
                      
                      // Compute Cohen's d
                      preserve
                      gen id = _n
                      expand 2 if subpop==1
                      bysort id: gen byte group = _n==2
                      dstat (cohend) x1-x10, by(group) vce(cluster id)
                      restore
                      Result:

                      Code:
                      . preserve
                      
                      . gen id = _n
                      
                      . expand 2 if subpop==1
                      (500 observations created)
                      
                      . bysort id: gen byte group = _n==2
                      
                      . dstat (cohend) x1-x10, by(group) vce(cluster id)
                      
                      cohend                            Number of obs   =      2,500
                                                        By variable     =      group
                      
                                       (Std. err. adjusted for 2,000 clusters in id)
                      --------------------------------------------------------------
                                   |               Robust
                                   | Coefficient  std. err.     [95% conf. interval]
                      -------------+------------------------------------------------
                                x1 |  -.1923289   .0384871     -.2678079   -.1168498
                                x2 |   -.319533   .0374729     -.3930229    -.246043
                                x3 |   .1929067   .0382167       .117958    .2678555
                                x4 |   .2096991   .0378185      .1355314    .2838669
                                x5 |   .2123661   .0395732      .1347571    .2899752
                                x6 |  -.3079036    .037162      -.380784   -.2350232
                                x7 |   .0633699   .0386097     -.0123495    .1390894
                                x8 |  -.1181661   .0389534     -.1945596   -.0417725
                                x9 |   .0323904   .0390108     -.0441156    .1088965
                               x10 |  -.2371593   .0382381       -.31215   -.1621686
                      --------------------------------------------------------------
                      
                      . restore
                      Results are equivalent to the following "manual" calculation based on the non-expanded data:

                      Code:
                      . dstat (mean sd0 count) x1, over(subpop, select(1)) total
                      
                      (...)
                      
                      . nlcom (x1: (_b[1:mean]-_b[total:mean]) /*
                      >     */ / sqrt((_b[1:count]*_b[1:sd0]^2 + _b[total:count]*_b[total:sd0]^2) /*
                      >     */ / (_b[1:count]+_b[total:count]-2))) /*
                      >     */, noheader
                      
                      ------------------------------------------------------------------------------
                                x1 | Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
                      -------------+----------------------------------------------------------------
                                x1 |  -.1923289   .0384871    -5.00   0.000    -.2677621   -.1168956
                      ------------------------------------------------------------------------------
                      ben

                      Comment

                      Working...
                      X