Paired T-test for multiply imputed data

Chris Reinders Folmer

Join Date: Oct 2022

Posts: 9
#1

Paired T-test for multiply imputed data

15 Dec 2022, 08:26

Dear Stata sages,

I have conducted multiple imputation using the -mi- command to fill the gaps in a large (N = 3000+) survey dataset. I am using this to run a multivariate regression model (which runs just fine), and also to provide descriptive stats on some peripheral variables. The latter are simple Likert-scale measures (measuring beliefs).

For the latter, I would like to indicate which means differ (for a table). For this, I would like to conduct paired-sample t-tests. However, -mi- does not support the -ttest- command.

How could one conduct a paired samples t-test (comparing means of two different variables) for multiply imputed data? (there are great suggestions on here for conducting unpaired t-tests with multiple imputation, but I found none for paired t-tests).

My deepest thanks for your time!

Last edited by Chris Reinders Folmer; 15 Dec 2022, 08:28.
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30119

15 Dec 2022, 09:33

You can emulate the paired ttest using -xtreg, fe-. You will first need to reshape your data to long so that your x1 and x2 in -ttest x1 = x2- are a single variable and another variable indicates the 1 vs 2 distinction. -mi estimate- does support -xtreg-. Here's an illustration of the emulation:

Code:

. clear*

.
. set obs 25
Number of observations (_N) was 0, now 25.

. set seed 1234

. gen long obs_no = _n

. gen z = rnormal()

. forvalues i = 1/2 {
  2.     gen x`i' = z + rnormal(0, 0.2)
  3. }

.
. ttest x1 = x2

Paired t test
------------------------------------------------------------------------------
Variable |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
      x1 |      25    .0763758    .2087469    1.043734   -.3544566    .5072082
      x2 |      25    .0874277     .198443    .9922152   -.3221386     .496994
---------+--------------------------------------------------------------------
    diff |      25    -.011052    .0436359    .2181797   -.1011121    .0790082
------------------------------------------------------------------------------
     mean(diff) = mean(x1 - x2)                                   t =  -0.2533
 H0: mean(diff) = 0                              Degrees of freedom =       24

 Ha: mean(diff) < 0           Ha: mean(diff) != 0           Ha: mean(diff) > 0
 Pr(T < t) = 0.4011         Pr(|T| > |t|) = 0.8022          Pr(T > t) = 0.5989

.
. reshape long x, i(obs_no) j(_j)
(j = 1 2)

Data                               Wide   ->   Long
-----------------------------------------------------------------------------
Number of observations               25   ->   50          
Number of variables                   4   ->   4          
j variable (2 values)                     ->   _j
xij variables:
                                  x1 x2   ->   x
-----------------------------------------------------------------------------

. xtset obs_no _j

Panel variable: obs_no (strongly balanced)
 Time variable: _j, 1 to 2
         Delta: 1 unit

. xtreg x i._j, fe

Fixed-effects (within) regression               Number of obs     =         50
Group variable: obs_no                          Number of groups  =         25

R-squared:                                      Obs per group:
     Within  = 0.0027                                         min =          2
     Between =      .                                         avg =        2.0
     Overall = 0.0000                                         max =          2

                                                F(1,24)           =       0.06
corr(u_i, Xb) = -0.0000                         Prob > F          =     0.8022

------------------------------------------------------------------------------
           x | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        2._j |    .011052   .0436359     0.25   0.802    -.0790082    .1011121
       _cons |   .0763758   .0308553     2.48   0.021     .0126936    .1400579
-------------+----------------------------------------------------------------
     sigma_u |  1.0124405
     sigma_e |  .15427638
         rho |  .97730705   (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(24, 24) = 86.13                     Prob > F = 0.0000

That same thing will work in the multiple imputation context.

Comment

Richard Williams

Join Date: Apr 2014

Posts: 5008
#3

15 Dec 2022, 09:39

A paired sample t-test can easily be converted to a single sample test by computing the difference between the two variables and then testing whether the mean is zero. For example, without mi, some ways you could do it are

Code:

use "C:\Users\rwilliam\Downloads\2sample-IV.dta", clear ttest hscore = wscore gen diff = hscore - wscore ttest diff = 0 reg diff mean diff test diff = 0

With mi, you have to decide whether to compute diff and then impute it (the "Just Another Variable" approach), or whether to first impute v1 and v2 and then compute the difference between the imputed vars (passive imputation). I discuss the pros and cons of each on pp. 10-11 of

https://www3.nd.edu/~rwilliam/xsoc73994/MD02.pdf

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
2 likes
Comment

Leonardo Guizzetti

Join Date: Jul 2016
Posts: 2403

15 Dec 2022, 09:43

Another equivalent modeling option, if of interest.

Code:

. mixed x i._j || obs_no : , reml dfmethod(satt) nolog nolrtest

Mixed-effects REML regression                   Number of obs     =         50
Group variable: obs_no                          Number of groups  =         25
                                                Obs per group:
                                                              min =          2
                                                              avg =        2.0
                                                              max =          2
DF method: Satterthwaite                        DF:           min =      24.00
                                                              avg =      24.28
                                                              max =      24.56
                                                F(1,    24.00)    =       0.06
Log restricted-likelihood = -35.086189          Prob > F          =     0.8022

------------------------------------------------------------------------------
           x | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
        2._j |    .011052   .0436359     0.25   0.802    -.0790082    .1011121
       _cons |   .0763758   .2036601     0.38   0.711     -.343454    .4962055
------------------------------------------------------------------------------

------------------------------------------------------------------------------
  Random-effects parameters  |   Estimate   Std. err.     [95% conf. interval]
-----------------------------+------------------------------------------------
obs_no: Identity             |
                  var(_cons) |   1.013135   .2959223       .571536    1.795937
-----------------------------+------------------------------------------------
               var(Residual) |   .0238012   .0068708      .0135169    .0419102
------------------------------------------------------------------------------

Comment

Chris Reinders Folmer

Join Date: Oct 2022

Posts: 9
#5

15 Dec 2022, 10:20

Dear all,

Thank you so much for responding! These are really useful. Maybe one follow-up question for prof. Schechter, if I may:

I have 13 variables (rather than 2) which all share the same basic name: BI_1, BI_2, BI_3, etc. When running your code it separates these into 13 groups. I presume -xtreg- (which is currently running) will test for the overall difference between these. Which is also useful, but how can I break them down into pairs (BI_1 vs BI_2, etc)? I can do this manually via the difference scores suggested by prof. Williams, but would also like to understand how to operate this when using -xtreg-.

Again many thanks to you all!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#6

15 Dec 2022, 10:25

Well, if you want all 78 possible pairwise comparisons (I don't know what you'll do with all that but....):

Code:

forvalues i = 1/13 { forvalues k = `=`i'+1'/13 { xtreg BI i._j if inlist(_j, `i', `k'), fe } }
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#7

15 Dec 2022, 12:26

Clyde’s looping approach could be tweaked to use my suggested methods. But, are you sure you really want to do that? If you do, remember that, with 78 tests, you would expect about 4 to be significant at the .05 level even if the means were all the same.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
1 like
Comment
Chris Reinders Folmer

Join Date: Oct 2022

Posts: 9
#8

16 Dec 2022, 01:25

Helpful, thank you both! The purpose is to indicate in a table (with superscripts) which of the means statistically differ. I will correct for the number of tests.
Comment

daniel klein

Join Date: Mar 2014
Posts: 3860

16 Dec 2022, 01:46

Here is a "use at your own risk" quick hack as an alternative:

Code:

program mi_paired_t , eclass
    
    version 16.1
    
    ttest `0'
    
    tempname b V
    
    matrix `b' = r(mu_1)-r(mu_2)
    matrix `V' = r(se)^2
    
    matrix rownames `b' = "diff"
    matrix colnames `b' = "diff"
    
    matrix rownames `V' = "diff"
    matrix colnames `V' = "diff"
    
    ereturn post `b' `V'
    
    ereturn local  cmd    "mi_paired_t"
    ereturn scalar df_r = r(df_t)
    
end

Define the program in memory, then call as

Code:

mi estimate , cmdok : mi_paired_t varname == varname

Last edited by daniel klein; 16 Dec 2022, 02:38. Reason: added returned DF to force small sample correction and match results from -xtreg- and -regress- (on the calculated difference)

Comment

Chris Reinders Folmer

Join Date: Oct 2022

Posts: 9
#10

16 Dec 2022, 02:19

Thanks everyone! This is great. These responses will also be invaluable for anyone else who's facing the same question!
Comment
daniel klein

Join Date: Mar 2014

Posts: 3860
#11

16 Dec 2022, 02:54

By the way, building on the idea of differences (see #3), you can more directly

Code:

mi estimate (diff : _b[varname_1] - _b[varname_2]) : mean varname_1 varname_2

Using the loop approach from #6, you could build a list of difference expressions, then make one call to mi:

Code:

mi estimate (diff_1_2 : _b[varname_1] - _b[varname_2]) (diff_1_3 : _b[varname_1] - _b[varname_2]) ... : mean varname_1 varname_2 varname_3 ...

No reshaping, no new variables, no ad-hoc programs; it's all readily supported within the mi machinery.

Code:

forvalues i = 1/13 { forvalues j = `=`i'+1'/13 { local Diff `Diff' (diff_`i'_`j' : _b[BI_`i']-_b[BI_`j']) } local Vars `Vars' BI_`i' } mi estimate `Diff' : mean `Vars'

Last edited by daniel klein; 16 Dec 2022, 03:25. Reason: spelled out the loop
1 like
Comment
Chris Reinders Folmer

Join Date: Oct 2022

Posts: 9
#12

16 Dec 2022, 09:29

Again my thanks for all your helpful advice!

Might I perhaps ask a final follow-up question about the -xtreg- approach as suggested in #2 by prof. Schechter?

I ran this analysis on the 13 BI variables, and also on a second set of (again 13) variables. Curiously, the F test (which I presume is an omnibus test of differences between the 13 means) shows a difference in (total) df between the two analyses: F(12, 39710) = 301.30 vs F(12, 39695) = 592.66. Otherwise, the output refers to the same number of observations (3317) and groups (13). With everything else being identical, I can't figure out the reason for the difference in df - am I missing something here?
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#13

16 Dec 2022, 10:23

Are you running this with or without MI? Could missing data account for the discrepancies?

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
daniel klein

Join Date: Mar 2014

Posts: 3860
#14

16 Dec 2022, 10:28

I do not recall all the details but in the multiple imputation framework, the degrees of freedom depend on the within and between imputation variances. These variances are probably different for different sets of variables. The Methods and formulas in mi estimate provide details.
Comment
Chris Reinders Folmer

Join Date: Oct 2022

Posts: 9
#15

16 Dec 2022, 10:31

With MI, and missing values have been imputed for all these variables. This is the code:

Code:

gen long obs_no = _n mi reshape long BI_, i(obs_no) j(_j) mi xtset obs_no _j mi estimate: xtreg BI_ i._j, fe

and

Code:

gen long obs_no = _n mi reshape long Advice_, i(obs_no) j(_j) mi xtset obs_no _j mi estimate: xtreg Advice_i._j, fe

I will paste the output later but the analyses need to complete first (reshaping tends to take some time on my computer).

Last edited by Chris Reinders Folmer; 16 Dec 2022, 10:35.
Comment

Announcement