Monte Carlo simulation: puzzling slowdown in runtime when number of reps reaches 3k-4k

Stephen Jenkins

Join Date: Apr 2014

Posts: 1480
#1

Monte Carlo simulation: puzzling slowdown in runtime when number of reps reaches 3k-4k

13 Feb 2026, 04:04

A colleague and I are undertaking Monte-Carlo simulation analysis to assess and compare the performance in terms of size of 4 different variance estimators for a form of difference in means test. The simulation code is necessarily bespoke –- we are not using -simulate- -- and output is accumulated at each MC repetition using -post- and its siblings. We need to run 10,000 MC replications.

Issue: the code is fast until it hits around 3,000 to 4,000 replications, and then slows down markedly.

Question: Any tips please regarding why this slowdown occurs? And how it can be avoided?

We don’t think the issue related to the size of the postfile. One comment we’ve had, from Claude AI, is that the slowdown is due to the accumulation of entries in Stata's tempvar and tempfile macro “registry”. (I don't think Stata has a specific macro registry; presumably it's referring to current working memory. We have large amounts available - Stata 19/MP with ~64Mb RAM.) The argument was that even though names are cleared, the registry grows indefinitely: each new tempvar checks against all previous names to ensure uniqueness, causing slowdown." We don’t know how to check the veracity of this remark -- any comments? . -help limits- doesn’t provide information to indicate that there are limits of the kind suggested.

One of the options suggested to address the problem was to use -file write- instead of -post-. We've not heard of this before and wonder if others have used it. Also, is it likely that -frame post- would address the problem? (Perhaps not if the fundamental problem is to do with the accumulation of entries in the macro registry.)
Tags: None
Joseph Coveney

Join Date: Apr 2014

Posts: 4540
#2

13 Feb 2026, 20:18

Originally posted by Stephen Jenkins View Post

One comment we’ve had, from Claude AI, is that the slowdown is due to the accumulation of entries in Stata's tempvar and tempfile macro “registry”. . . . The argument was that even though names are cleared, the registry grows indefinitely: each new tempvar checks against all previous names to ensure uniqueness, causing slowdown." We don’t know how to check the veracity of this remark -- any comments?

Assuming that you're invoking tempvar and tempfile only within a program (or ado-file) that you call for the repetition, then results from executing the code below do not corroborate Claude AI's take on things—there's no prolonging of the execution time over three progressive epochs of 2500 repetitions each, despite invoking all three of tempvar, tempfile and tempname in each repetition. Elapsed time for the third epoch (repetitions 5001 through 7500) is within a second and a half of that for the first (repetitions 1 through 2500), and indeed is fortuitously shorter. (Complete do-file and log file are attached.)

One of the options suggested to address the problem was to use -file write- instead of -post-. We've not heard of this before and wonder if others have used it. Also, is it likely that -frame post- would address the problem? (Perhaps not if the fundamental problem is to do with the accumulation of entries in the macro registry.)

I doubt that getting down and dirty with file will make your life easier. The example below uses the frame post approach, but if you're opening only a single postfile for the 10 000 repetitions, then I can't see how that could be the source of the slowdown.

Code:

version 19 clear * // seedem set seed 1898961433 program define simem version 19 syntax , Frame(string) drop _all set obs 14 // -tempvar- check tempvar y x generate byte `x' = mod(_n, 2) generate double `y' = rexponential(`x'+1) // -tempfile- check tempfile dataset save `dataset' foreach vce in ols robust hc2 hc3 { regress `y' i.`x', vce(`vce') tempname `vce' scalar define ``vce'' = r(table)["pvalue", "1.`x'"] } foreach dfa in unequal welch { ttest `y', by(`x') `dfa' if "`dfa'" == "unequal" { tempname satterthwaite scalar define `satterthwaite' = r(p) } else { tempname `dfa' scalar define ``dfa'' = r(p) } } frame post `frame' (`ols') (`robust') (`hc2') (`hc3') /// (`satterthwaite') (`welch') end frame create Store double(ols robust hc2 hc3 satterthwaite welch) timer clear forvalues timer = 1/3 { timer on `timer' forvalues i = 1/2500 { quietly simem , f(Store) } timer off `timer' } timer list exit

Attached Files

slowdown.do (1.1 KB, 1 view)

slowdown.smcl (2.2 KB, 1 view)
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1480
#3

14 Feb 2026, 04:24

Joseph Coveney Many thanks for giving our problem your close attention - helpful. My colleague will discuss things further in the light of your info. (In the meantime, we've been splitting the 10 000 reps into smaller batches, so as to avoid the problem, and then combining results.) Thanks again
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2449
#4

15 Feb 2026, 07:23

Is there by chance anywhere in the code some place where c(matsize) is changed? I have a long shot idea in that regard. Some years ago, I discovered (after considerably puzzlement and swearing) that using larger values of c(matsize), even when not actually used in the code, could drastically slow down simulations like this. (I posted here about it and somewhere in the list archives, a discussion of that occurred.) That was probably Stata 12 or so, and the changes to how c(matsize) is managed probably make this comment irrelevant strictly speaking, but I thought I'd throw it out. Perhaps there is something analogous to this going on.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4540
#5

16 Feb 2026, 02:18

Originally posted by Stephen Jenkins View Post

My colleague will discuss things further in the light of your info.

Your colleague wouldn't happen to be programming something analogous to this somewhere in the bespoke simulation code?

Code:

version 19 clear * // seedem set seed 388361189 program define simem version 19 syntax , m(name) tempname Row matrix define `Row' = J(1, 4, .) forvalues i = 1/4 { matrix define `Row'[1, `i'] = runiform() } matrix define `m' = (nullmat(`m') \ `Row') end tempname M timer clear forvalues epoch = 1/3 { timer on `epoch' forvalues rep = 1/2500 { simem , m(`M') } timer off `epoch' } timer list display in smcl as text "`c(matsize)'" exit

Although c(matsize) is not changed in this code (it might be deprecated by now), Mike does bring up an interesting point regarding the potential for in-memory matrix operations to affect speed in that if you're accreting rows or columns to a matrix, then Stata has to do more and more to-and-fro from memory to on-chip cache. I'm not sure, but it might even try asking the operating system repeatedly to locate contiguous patches of memory to store the ever-growing matrix.

Either way (or both ways), that overhead would progressively slow things down as the matrix grows: in this example (complete do-file and log file attached), the execution time grows threefold each epoch of 2500 repetitions, from 7½ seconds at first to over 46 seconds in the third.

(Likewise, writing and reading back from an accumulating disc file, then the growing I/O operations would analogously affect execution time, but I assume that that point of weakness would be obvious to your colleague.)
Attached Files

slowdown2.do (570 Bytes, 1 view)

slowdown2.smcl (1.2 KB, 1 view)
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1480
#6

16 Feb 2026, 02:39

Joseph Coveney Mike Lacy Thanks for your pointers! I'm passing them on to my colleague and will rreport back
Comment

Stephen Jenkins

Join Date: Apr 2014
Posts: 1480

16 Feb 2026, 03:07

My colleague doesn't think those matrix issues you raised are the issue. In the meantime, he has identified the specific part of the sim code that appears to make the slowdown appear. Here's an extract

Code:

        forval r = 1/$rep {

    //To post DF results
    tempname results2
    postfile `results2' step rep bsrep ///
        theilboot theilbootV theil_Wstar  ///
        giniboot ginibootV gini_Wstar  ///
        mldboot mldbootV mld_Wstar  ///
        using "$out\bs\df_${B1}_${B2}_simbs_driver_M${M}_`year'_r${rep}_s`step'.dta", replace  // ####    

        ...

        local nreps = $B2 - $B1 + 1

        _dots 0, title(Loop running) reps(`nreps')
        qui forvalues b = $B1/$B2 {
                use "`temp`r''", clear // ###
                preserve
                    keep if R == 1 // Rich
                    sort inc
                    drop inc
                    tempvar y1
                    gen `y1' = rpareto(alpha, x0)
                    egen inc = clsort(`y1')
                    save "`temp'", replace  // #####
                restore
                    bsample if R == 0 // Non-rich 
                    append using "`temp'"  // ####
                    
            dstat ($mlist) inc [pw = wgt]
            local i = 1
            foreach meas of global mlist {    
                local `meas'boot = e(b)[1,`i']
                local `meas'bootV = e(V)[`i',`i']
                local `meas'_Wstar = (``meas'boot' - `meas'_0star)/sqrt(``meas'bootV')    
                local i = `i' +1
            }
        
            post `results2' (`step') (`r') (`b') ///
                (`theilboot') (`theilbootV') (`theil_Wstar')  ///
                (`giniboot') (`ginibootV') (`gini_Wstar')  ///
                (`mldboot') (`mldbootV') (`mld_Wstar')                      
                        
            noisily _dots `b' 0
        }    //BS loop  

    postclose `results2'          

    } //rep loop

PS in case you're curious, what we are doing is assessing the performance of a two-sample test that 2 scalar income inequality indices are equal (for multiple indices). In this case, the test statistic is based on a semiparametric percentile-t bootstrap approach in which the top x% of incomes are replaced by draws from a Pareto distribution fit to that x% (Davidson and Flachaire, Journal of Econometrics, 2007). We are undertaking a performance 'beauty contest' between this approach and 3 others.

Comment

Daniel Feenberg

Join Date: Oct 2014

Posts: 334
#8

16 Feb 2026, 07:41

Is the tmp directory accumulating a new file entry for each replication? Perhaps creating or referenceing a new file is slow in such a large directory. Can you delete the temporary file before starting the next replication?
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1480
#9

16 Feb 2026, 09:02

Apologies for a mistake made in copy-pasting that code extract. The -postfile- command comes before the rep loop, not inside it. Thanks, Daniel Feenberg. I presume you referring to the "tempvar y1" line in the -forvalues- loop. We'll check it out.
Comment
Daniel Feenberg

Join Date: Oct 2014

Posts: 334
#10

16 Feb 2026, 09:45

I meant these lines:

Code:

tempfile dataset save `dataset'

but I don't know how many times that is executed, or if the tempfiles acculate to a problematic degree.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4540
#11

17 Feb 2026, 19:15

Originally posted by Stephen Jenkins View Post

My colleague . . . has identified the specific part of the sim code that appears to make the slowdown appear. Here's an extract

Code:

forval r = 1/$rep { . . . qui forvalues b = $B1/$B2 { use "`temp`r''", clear // ### . . . } //BS loop . . . } //rep loop

So, if r is the repetition loop counter and it goes from 1 to 10 000, then it appears that you accumulate to ten thousand "`temp`r''" temporary files that stick around until looping ends. If so, then that seems where the gradual slowdown occurs.

You can always home in on the location in the code by placing a couple of timers, with their activation (timer on # & timer off #) made conditional on the loop counter. Something like the following.

Code:

timer clear // outside the loop forvalues rep = 1/$rep { if `rep' < 1000 timer on 1 if `rep' > 3000 timer on 2 . . . // <- lines of suspected code #1 timer off 1 // <- these won't throw an exception even if they're not on timer off 2 }
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1480
#12

18 Feb 2026, 02:21

Thanks again, Joseph Coveney . As it happens, we anticipated your remarks telepathically and placed various timers in the code. It turns out that most probable source of the slowdown issue is an accumulation of Stata matrices as in the code snippet below. That bit of code is not in the extract I posted earlier in the thread unfortunately.

Code:

foreach meas of global mlist { dstat (`meas') inc [pw = wgt], over(`groupa') *mat list e(b) tempname est mat `est' = e(b) mat list `est' <output omitted>

Relatedly, have a look too at the screenshot of the output from the -memory- command from a couple of points along the repetitions trail. Observe the count of Stata matrices for the later repetition (LHS) compared to an earlier repetition (RHS) and the timers' output

We owe many thanks to Ben Jann for his insights on this issue, as well as his advice about how to avoid it ... and improve our program more generally. Ben also observed that "My personal opinion is that the number of matrices and the accumulated memory they consume is too small to justify such a drastic performance effect. I would suggest forwarding this exchange to Stata Corp. Maybe there is an issue they will want to fix." My colleague and I are looking into doing this.
Thanks again to everyone for helping us identify and resolve the issue.
Comment

Ben Jann

Join Date: Sep 2014
Posts: 292

#13

18 Feb 2026, 02:59

Yeah, the accumulation of matrices in memory is responsible for the slowdown. Below is a brief simulation using the mean command. If results are collected in separate matrices, execution of mean gets slower and slower.

Code:

set seed 23754987
local reps 300
clear all
set obs 100
gen y = exp(rnormal())
gen w = runiform()
mean y [pw=w]
matrix T = J(`reps',2,.)
// version 1: single results matrix
forv r=1/`reps' {
    quietly {
        timer clear
        forv i = 1/10 {
            timer on 1
            mean y [pw=w]
            timer off 1
            mat b = e(b)
        }
        timer list 1
        mat T[`r',1] = r(t1)
    }
    if !mod(`r',10) di as txt "." _c
}
// version 2: results matrices piling up
forv r=1/`reps' {
    quietly {
        timer clear
        forv i = 1/10 {
            timer on 1
            mean y [pw=w]
            timer off 1
            mat b_`r'_`i' = e(b)
        }
        timer list 1
        mat T[`r',2] = r(t1)
    }
    if !mod(`r',10) di as txt "." _c
}
// plot
drop _all
svmat T
matrix drop _all
gen n = _n
lab var T1 "single results matrix"
lab var T2 "results matrices piling up"
scatter T1 T2 n, ms(o o) yti("Execution time")

Click image for larger version

Name: Graph.png
Views: 1
Size: 44.3 KB
ID: 1784948

Comment

Felix Bittmann

Join Date: Aug 2018

Posts: 838
#14

18 Feb 2026, 06:31

Thanks for this nice example. It seems that the increase in the red dots is nonlinear. Does this explain the sudden increase at around 3k? It would be nice to extend this simulation but it might take some time to compute...

Last edited by Felix Bittmann; 18 Feb 2026, 07:11. Reason: added graph

Best wishes

Stata 18.0 MP | ORCID | Google Scholar
Comment
Jeff Pitblado (StataCorp)

StataCorp Employee

Join Date: Mar 2014

Posts: 747
#15

04 Mar 2026, 14:03

We have found where Stata is spending extra time when the number of Stata matrices gets moderately large.

Thank you to Stephen Jenkins for bringing this to our attention.

Thank you to Ben Jann for providing an example that greatly helped us diagnose the issue.

We hope to improve Stata's performance in this case in a future Stata 19 update.
5 likes
Comment

Announcement