simulation based power analysis for linear mixed model/multilevel model

Lulu Dong

Join Date: Mar 2016

Posts: 12
#1

simulation based power analysis for linear mixed model/multilevel model

19 Mar 2016, 01:11

Dear all,

I was wondering whether I can run a simulation study in Stata to calculate a power analysis (i.e., determining sample size) for a longitudinal randomized controlled trial with 2 treatment arms and 4 time points (pre, post, and two follow-ups). It doesn't look at the power command can do this. Any suggestions (for Stata or other sofewares)?

Many thanks!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#2

19 Mar 2016, 10:00

You can do this in Stata; I do it frequently. But you have to write your own detailed code. There is nothing analogous to the -sampsi- command that lets you just plug in values and get an answer.

The general approach to the problem is to write a brief Stata program that simulates one replication of your trial, with random sampling of the error terms (and random slopes if any) at the participant and observation levels, followed by the appropriate analysis and hypothesis test. The program should be r-class and return the p-value in r(). It should take as arguments the various parameters of your trial such as the expected outcome difference under the alternate hypothesis, the assumed standard deviations of the random intercepts (and slopes if any) at each level, as well as the sample size in one arm and the ratio of arm sample sizes.

Then for a given set of assumed parameters, you replicate the study using the -simulate- command, with the -saving()- option specified. Note that if you are hoping to find a power in the 80-95% range, and you want that estimated power to have a confidence interval radius of 5 or less, you will need about 750 replications. After -simulate- runs, you load in the saved simulation results, calculate a dichotomous variable corresponding to a statistically significant p-value (conventionally p < 0.05, but it need not be that) and apply -cii proportions- to that dichotomous variable. Voila, you have your power and a confidence interval around it.

If you need to try several different sample sizes, perhaps in combination with several different assumptions about the effect under the alternate hypothesis and the standard deviations of the random effects (which is nearly always the case in my line of work), then you take all of the above and wrap it inside some loops over those values.

Note that the entire process can be very time consuming. I'm fairly experienced at it and it usually takes me several hours to get the code right. (When testing the code, you should only run a small number of replications, to save time.) Then actually running it for all the combinations that I need to test is typically 8 or more hours of computation on my computer, sometimes even extending to a couple of days if there are enough scenarios that need to be tried. Of course if there is little uncertainty about your parameters and only one or two sample sizes under considerations, it could go very quickly. (But that seldom happens in my life.)

There are a few stand-alone programs out there that will do this kind of work in a more "plug in the numbers, wait for the computations a while, and get an answer" form. I have tried a few of the free ones and never found them satisfactory: either they require the parameters be specified in unusual metrics that take time to compute, or they run even more slowly than doing it "by hand" in Stata. And none I have seen enables you to just write loops to automate trying different combinations of parameter estimates, so you have to wait for each scenario to run and then hand-enter the next set of assumptions, rinse, and repeat. There are some commercial sample size packages that probably overcome these limitations. I don't know, as they're expensive enough that I've never been tempted by them, perhaps because I actually enjoy doing this kind of programming.
Comment
Lulu Dong

Join Date: Mar 2016

Posts: 12
#3

19 Mar 2016, 11:54

Thank you so much for the detailed reply! Very helpful indeed. I'm trying this out in Stata. Could you point me to some references or detailed examples for this approach? Thanks!
Comment
Dick Campbell

Join Date: Apr 2014

Posts: 279
#4

19 Mar 2016, 11:56

You might want to look at ipdpower, available from SSC. It handles simulation for two level models of various kinds. The help file will give you a link to a working paper which describes the software in great detail.

Richard T. Campbell
Emeritus Professor of Biostatistics and Sociology
University of Illinois at Chicago
Comment
Lulu Dong

Join Date: Mar 2016

Posts: 12
#5

19 Mar 2016, 12:09

Thanks for the pointer! The working paper on ipdpower talks about a few meta-analysis models; I wasn't entirely sure that it applies to my case but I'm trying it out.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30116

19 Mar 2016, 12:42

If -ipdpower- doesn't work out for you, here is a power analysis I did just recently for a two arm trial with one baseline and one follow-up measurement for each subject. The sampling ratio between the arms is fixed at 1:1. Feasible sample sizes ranged between 320 and 350 per arm. The expected difference considered minimum clinically significant wold be swomewhwere around 0.5, and I tried ranging from 0.45 to 0.60. The standard deviation of the random intercepts is assumed to be 1.1, and the standard deviation of the residual error is assumed to be 1.6 (based on a literature review and a previous similar study carried out by the same investigators using similar methods in a similar population.)

It does not quite follow the outline I showed above in that I do not use -simulate-. The innermost -forvalues- loop with posting to the postfile takes the place of that. The reason for not using -simulate- here is to avoid having to keep appending different -saving()- files: this combination of loops does it all at once. It's not necessarily better than using -simulate-. Depends on your comfort with -postfile-s. Another deviation from the outline I gave earlier is that I calculate the dichotomous statistical significance variable in the one_run program and return it in r(), rather than doing that later. Makes no difference really.

You might wonder why I have one_run return the estimated effect sizes when I never use them later. The answer is that after this program ran, as a check on my work, I then did some statistical analyses (not shown here) on those values to be sure that they were in fact consistent with the effect sizes that we were trying to model.

These 28 scenarios (7 sizes X 4 effect sizes) took a total of about 5.5 hours of run time on my computer (Stata 14.1MP,2 64-bit, Windows 7, 8GB RAM, 2.7GHz dual CPUs).

Code:

capture log close
set more off
log using an_reconsider_sample_size, replace

version 14.1

//    SIMULATE POWER FOR TOTAL SAMPLE SIZES RANGING FROM 640 TO 700

clear*

//    SIMULATION OF ONE REPLICATION OF THE STUDY
//    ONE BASELINE AND ONE FOLLOW-UP MEASUREMENT PER SUBJECT
//    TWO ARMS OF EQUAL SIZES
//    ANALYZE AS MIXED MODEL WITH INTERACTION OF PRE-POST#STUDY ARM
capture program drop one_run
program define one_run, rclass
    args n_per_group delta sigma_u sigma_e
    assert `n_per_group' > 0 & `delta' > 0 & `sigma_u' > 0 & `sigma_e' > 0
    //    CREATE A DATA SET FOR STUDY POPULATION
    drop _all
    set obs `=2*`n_per_group''
    gen byte group = _n <= `n_per_group'
    gen int obs_no = _n
    //    SAMPLE SUBJECT LEVEL RANDOM INTERCEPTS
    gen u = rnormal(0, `sigma_u')
    //    CREATE PRE-POST OBSERVATIONS FOR EACH SUBJECT
    expand 2
    //    SAMPLE RESIDUAL ERRORS
    by obs_no, sort: gen e = rnormal(0, `sigma_e')
    by obs_no: gen byte pre_post = _n-1
    //    CALCULATE FIXED EFFECTS COMPONENT
    gen xb = cond(group & pre_post, `delta', 0)
    //    CALCULATE STUDY OUTCOME
    gen y = xb + u + e
    //    ANALYZE THE SYNTHETIC DATA
    xtmixed y i.group##i.pre_post || obs_no:
    //    RETURN ESTIMATED EFFECT SIZE, P VALUE AND SIGNIFICANCE DICHOTOMY
    tempname M
    matrix `M' = r(table)
    return scalar b = `M'[1, 8]
    return scalar p = `M'[4, 8]
    return scalar sig = (`M'[4, 8] < 0.05)
    exit
end

local arm_size 320(5)350 // RANGE OF # OF SUBJECTS PER ARM
local sigma_u = 1.1 // SUBJECT LEVEL RANDOM INTERCEPT S.D.
local sigma_e = 1.6 // RESIDUAL LEVEL RANDOM INTERCEPT S.D.
local deltas 45(5)60 // EXPECTED DIFFERENCES UNDER NULL HYPOTHESES (*100)
local reps 750

//    SET UP POSTFILE TO COLLECT SIMULATION RESULTS    
tempfile results
capture postutil clear
postfile handle int arm_size float delta b p byte sig using `results'

//    DO SIMULATIONS, LOOPING OVER SIZE AND EXPECTED DIFFERENCES
foreach a of numlist `arm_size' {
    display as text "Arm size " as result "`a'"
    foreach d of numlist `deltas' {
        local delta = `d'/100
        display as text _col(4) "Delta " as result %3.2f `delta'
        forvalues i = 1/`reps' {
            quietly one_run `a' `delta' `sigma_u' `sigma_e'
            post handle (`a') (`delta') (`r(b)') (`r(p)') (`r(sig)')
        }
    }
}
postclose handle

//    OPEN RESULTS FILE, CALCULATE POWER & DISPLAY FINDINGS
use `results', clear

levelsof arm_size, local(arm_sizes)
levelsof delta, local(deltas)
matrix drop _all
//    LOOP OVER COMBINATIONS OF ARM SIZE AND EFFECT SIZE
//    ACCUMULATING POWER RESULTS IN A MATRIX
foreach a of local arm_sizes {
    foreach d of local deltas {
        quietly ci proportions sig if arm_size == `a' & float(delta) == float(`d')
        matrix M = nullmat(M) \ (`a', `d', `r(proportion)', `r(lb)', `r(ub)')
    }
}

//    DISPLAY THE MATRIX
matrix colnames M = arm_size delta power power_lb power_ub

matrix list M, noheader format(%3.2f)

log close
exit

Comment

Dick Campbell

Join Date: Apr 2014

Posts: 279
#7

19 Mar 2016, 19:08

From the help file for ipdpower:

ipdpower is a simulations-based command that calculates power for complex
mixed effects two-level data structures. The command was developed having
individual patient data meta-analyses in mind, where patients are nested
within studies, but the methods apply to any two-level structure (say
patients within general practices in a primary care database).

Richard T. Campbell
Emeritus Professor of Biostatistics and Sociology
University of Illinois at Chicago
Comment
Lulu Dong

Join Date: Mar 2016

Posts: 12
#8

19 Mar 2016, 23:15

Dr. Schechter- Thank you so much for posting your own script as an example. This is extremely helpful, as I was able to adapt it to fit my case. I have a few additional questions and would be so grateful for any input and suggestions:

1. So far I was only able to calculate the power (for a given sample size) for changes between pre and post comparing the two treatment conditions, based on adaptation from your example. As I mentioned, our design is to have four measurements (pre, post, 6-month follow-up, and 12-month follow-up). In this case, should I carry out 3 separate power analysis: pre-post, pre-FU6, and pre-FU12?

2. For expected effect sizes, I was plugging in the observed means and sds (between and within subject) from our pilot study into the script. I know there is literature advising against using pilot data to inform power analysis (e.g., Kraemer et al.). However, in our case, there is no prior literature other than our own pilot study to inform what the difference between group over time may be. I wonder what you think about the approach I took in terms of using observed values from pilot data in power analysis.

Thanks!

Dr. Campbell- It looks like ipdpower can only let me add a continuous covariate, but the covariate I have is categorical. It would otherwise be a promising route and save much time! Thanks!
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4420
#9

20 Mar 2016, 00:25

You'd typically power on the basis of the study's primary outcome measure. What does the protocol say it is? When is it measured? What are the assumptions regarding patient attrition until that time point, including so-called unavailable cases? What does the protocol say about how these will be handled?

In your power analysis, you can assess how sensitive your time sample size estimate is to your assumptions drawn from whatever prior information you make use of.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4420
#10

20 Mar 2016, 04:58

unavailable -> unevaluable
Comment
Lulu Dong

Join Date: Mar 2016

Posts: 12
#11

20 Mar 2016, 09:27

The primary outcome measures is symptom rating score (continuous), measured by participant self-report, and assumed to have a total attrition of 20% by time 4 (in the literature about 15% is typical). The study is to compare two active treatment on symptom reduction, and hoping to confirm that treatment A is more effective than treatment B (current first line/gold standard treatment). The analysis plan is an ITT approach (is this what you were asking?).

Previously, we calculated a sample size (using a software where we just plug in numbers), and then added the overall attribution rate of 20% to obtain the sample size aimed for recruitment. Does this seem typical? Or should I be in someway accounting for attribution in the power analysis? The attribution is assumed to be the same in two groups (no group difference should be observed).

I attached how I adapted Dr. Schechter's script. Just in case helpful to see what I was doing. The problem of using a small pilot study to inform power analysis: the follow-up observations for treatment B deviation a bit from prior literature (while I do not have prior literature to inform the difference between treatment A and B, other than this pilot).

Thanks very much for the input!

Attached Files

test_run.do (3.7 KB, 1 view)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#12

20 Mar 2016, 12:12

I think Joseph's question about what is the primary outcome measure is this: is your primary outcome measure the result at time 4, or is it the result at time 3? Your power analysis should focus on whichever is the primary. If you are doing this for the purpose of submitting a grant application, you should scrutinize the Specific Aims section to see what it says you are going to do. The power analysis should focus on that. If the Specific Aims section is too vague to tell you what the primary outcome is, then it needs to be rewritten or the reviewers will eat you alive. It may also be that the specific aims section specifies a joint test for both times. In that case, you need to revise one_run to test that joint hypothesis. In any case, the specific aims section should determine this.

Most reviewers will accept handling attrition either by first calculating a sample size assuming no attrition (which is what you have done so far), and then multiplying up to reflect anticipated attrition, or by building probabilistic attrition into the simulation. The only general exception to that which I'm aware of is if the analysis is a survival analysis. In that case, attrition represents censoring and it affects the analysis in a more profound way. So for survival analyses, the power calculations should typically build in probabilistic attrition based on reasonable expectations.

As Joseph Coveney points out, if all you have to go on for parameter estimates is a single pilot study you have done, you should do sensitivity analysis. Bear in mind that there are two different kinds of parameters that inform the power calculations. There are the parameters of the data generating process, such as the standard deviations of the random effects. These are, when possible, estimated from the literature and available data. But the effect size is not an empirical quantity, and should not be based on what has been found in previous studies. The effect size should represent the minimum difference between groups that would be clinically meaningful. It is a value judgment and, ideally, it should be made by the clinical investigators, not the statistician (although usually the statistician has to work with the clinicians to explain the concept and elicit a suitable judgment). For example, for the study for which I sent you the power analysis code, we chose a minimum clinically significant effect size of 0.45 because other treatments for the same condition have generally produced an improvement of about 0.35, so if we couldn't get to 0.45 our treatment would not be sufficiently better to warrant its use, given that it is more complex. The use of additional larger effect sizes was done because we were not confident, before the power analysis, that we could, with available resources, put together a large enough sample to detect an effect size of only 0.45. So we wanted to also know if we could at least detect something in the 0.5 to 0.6 range with a sample size that was easy enough to get.
Comment
Lulu Dong

Join Date: Mar 2016

Posts: 12
#13

20 Mar 2016, 16:37

I see. The primary outcome would be pre-post symptoms reduction, and then to see if this treatment gain can be maintained through the two follow-ups.

So the 0.45 is the difference pre and post for the treatment of interest. I didn't understand why you did not need to specify the expected ES of 0.35 for the other treatment in the power analysis script (I thought your other treatment might not be an active treatment but just a control group). In my case, we expect that the difference between the two treatment groups at post would have a medium ES. Thanks!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#14

20 Mar 2016, 16:51

The primary outcome would be pre-post symptoms reduction, and then to see if this treatment gain can be maintained through the two follow-ups.

So I would think that you need two power analyses. One is based on the effect at time 3, and the other is based on the difference in effect between times 3 and 4. So your code would need to add the latter to the one_run program (and correspondingly change what's in the postfile and what analyses are done at the end.) Then one run of the simulation will give you both power analyses together.

I didn't understand why you did not need to specify the expected ES of 0.35 for the other treatment in the power analysis script (I thought your other treatment might not be an active treatment but just a control group).

You thought correctly: the study this power simulation represents has an inactive control group, it is not a comparison to other effective treatments, so we expect a 0 response in the control group here.

By the way, the effect size in my study is not effect size in the sense of Cohen's d or the like. It is the result on a certain measurement that happens to scale in such a way that it looks like it could be a Cohen's d. But, in the end that doesn't really matter for this kind of analysis, as it is the relationships between effect size, sigma u, and sigma e that drive the results, no matter what scale or unit of measurement they are denominated in.
Comment
Filipe Rodrigues

Join Date: Nov 2015

Posts: 18
#15

19 Sep 2019, 12:12

Dear all

I've run the code provided by @Clyde on his second intervention on this thread (post #6) and even reducing the number of repetitions to 100 or less after around 500 simulations Stata seems to freeze.
The same has happened to me in the past when using bootstrapping in loops or with other simulations even using the simulate command (happy to share code). I've waited hours and nothing seems to happen although the Stata spinning wheel continues to spin. No error is provided and the only thing I get is a frozen screen showing something like this for hours:

Code:

... Arm size 320 Delta 0.45 Delta 0.50 Delta 0.55 Delta 0.60 Arm size 325 Delta 0.45 Delta 0.50

I am running StataIC 15 in a MacBook Pro 2.7 GHz Intel Core i7 16 GB 2133 MHz LPDDR3 and I set segmentsize to 3Gb.

Does anyone has any idea why this happens?

Thanks!

PS: also in cases like this as loops progress then tend to become slower and slower

Last edited by Filipe Rodrigues; 19 Sep 2019, 12:21.
Comment

Announcement