Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Monte Carlo simulation: puzzling slowdown in runtime when number of reps reaches 3k-4k

    A colleague and I are undertaking Monte-Carlo simulation analysis to assess and compare the performance in terms of size of 4 different variance estimators for a form of difference in means test. The simulation code is necessarily bespoke –- we are not using -simulate- -- and output is accumulated at each MC repetition using -post- and its siblings. We need to run 10,000 MC replications.

    Issue: the code is fast until it hits around 3,000 to 4,000 replications, and then slows down markedly.

    Question: Any tips please regarding why this slowdown occurs? And how it can be avoided?

    We don’t think the issue related to the size of the postfile. One comment we’ve had, from Claude AI, is that the slowdown is due to the accumulation of entries in Stata's tempvar and tempfile macro “registry”. (I don't think Stata has a specific macro registry; presumably it's referring to current working memory. We have large amounts available - Stata 19/MP with ~64Mb RAM.) The argument was that even though names are cleared, the registry grows indefinitely: each new tempvar checks against all previous names to ensure uniqueness, causing slowdown." We don’t know how to check the veracity of this remark -- any comments? . -help limits- doesn’t provide information to indicate that there are limits of the kind suggested.

    One of the options suggested to address the problem was to use -file write- instead of -post-. We've not heard of this before and wonder if others have used it. Also, is it likely that -frame post- would address the problem? (Perhaps not if the fundamental problem is to do with the accumulation of entries in the macro registry.)

  • #2
    Originally posted by Stephen Jenkins View Post
    One comment we’ve had, from Claude AI, is that the slowdown is due to the accumulation of entries in Stata's tempvar and tempfile macro “registry”. . . . The argument was that even though names are cleared, the registry grows indefinitely: each new tempvar checks against all previous names to ensure uniqueness, causing slowdown." We don’t know how to check the veracity of this remark -- any comments?
    Assuming that you're invoking tempvar and tempfile only within a program (or ado-file) that you call for the repetition, then results from executing the code below do not corroborate Claude AI's take on things—there's no prolonging of the execution time over three progressive epochs of 2500 repetitions each, despite invoking all three of tempvar, tempfile and tempname in each repetition. Elapsed time for the third epoch (repetitions 5001 through 7500) is within a second and a half of that for the first (repetitions 1 through 2500), and indeed is fortuitously shorter. (Complete do-file and log file are attached.)

    One of the options suggested to address the problem was to use -file write- instead of -post-. We've not heard of this before and wonder if others have used it. Also, is it likely that -frame post- would address the problem? (Perhaps not if the fundamental problem is to do with the accumulation of entries in the macro registry.)
    I doubt that getting down and dirty with file will make your life easier. The example below uses the frame post approach, but if you're opening only a single postfile for the 10 000 repetitions, then I can't see how that could be the source of the slowdown.
    Code:
    version 19
    
    clear *
    
    // seedem
    set seed 1898961433
    
    program define simem
        version 19
        syntax , Frame(string)
    
        drop _all
        set obs 14
    
        // -tempvar- check
        tempvar y x
        generate byte `x' = mod(_n, 2)
        generate double `y' = rexponential(`x'+1)
    
        // -tempfile- check
        tempfile dataset
        save `dataset'
    
        foreach vce in ols robust hc2 hc3 {
            regress `y' i.`x', vce(`vce')
            tempname `vce'
            scalar define ``vce'' = r(table)["pvalue", "1.`x'"]
        }
    
        foreach dfa in unequal welch {
            ttest `y', by(`x') `dfa'
            if "`dfa'" == "unequal" {
                tempname satterthwaite
                scalar define `satterthwaite' = r(p)
            }
            else {
                tempname `dfa'
                scalar define ``dfa'' = r(p)
            }
        }
    
        frame post `frame' (`ols') (`robust') (`hc2') (`hc3') ///
            (`satterthwaite') (`welch')
    end
    
    frame create Store double(ols robust hc2 hc3 satterthwaite welch)
    
    timer clear
    
    forvalues timer = 1/3 {
    
        timer on `timer'
    
        forvalues i = 1/2500 {
            quietly simem , f(Store)
        }
    
        timer off `timer'
    }
    
    timer list
    
    exit
    Attached Files

    Comment


    • #3
      Joseph Coveney Many thanks for giving our problem your close attention - helpful. My colleague will discuss things further in the light of your info. (In the meantime, we've been splitting the 10 000 reps into smaller batches, so as to avoid the problem, and then combining results.) Thanks again

      Comment


      • #4
        Is there by chance anywhere in the code some place where c(matsize) is changed? I have a long shot idea in that regard. Some years ago, I discovered (after considerably puzzlement and swearing) that using larger values of c(matsize), even when not actually used in the code, could drastically slow down simulations like this. (I posted here about it and somewhere in the list archives, a discussion of that occurred.) That was probably Stata 12 or so, and the changes to how c(matsize) is managed probably make this comment irrelevant strictly speaking, but I thought I'd throw it out. Perhaps there is something analogous to this going on.

        Comment


        • #5
          Originally posted by Stephen Jenkins View Post
          My colleague will discuss things further in the light of your info.
          Your colleague wouldn't happen to be programming something analogous to this somewhere in the bespoke simulation code?
          Code:
          version 19
          
          clear *
          
          // seedem
          set seed 388361189
          
          program define simem
              version 19
              syntax , m(name)
           
              tempname Row
              matrix define `Row' = J(1, 4, .)
              forvalues i = 1/4 {
                  matrix define `Row'[1, `i'] = runiform() 
              }
              matrix define `m' = (nullmat(`m') \ `Row')
          end
          
          tempname M
          
          timer clear
          
          forvalues epoch = 1/3 {
          
              timer on `epoch'
          
              forvalues rep = 1/2500 {
                  simem , m(`M')
              }
          
              timer off `epoch'
          }
          
          timer list
          
          display in smcl as text "`c(matsize)'"
          
          exit
          Although c(matsize) is not changed in this code (it might be deprecated by now), Mike does bring up an interesting point regarding the potential for in-memory matrix operations to affect speed in that if you're accreting rows or columns to a matrix, then Stata has to do more and more to-and-fro from memory to on-chip cache. I'm not sure, but it might even try asking the operating system repeatedly to locate contiguous patches of memory to store the ever-growing matrix.

          Either way (or both ways), that overhead would progressively slow things down as the matrix grows: in this example (complete do-file and log file attached), the execution time grows threefold each epoch of 2500 repetitions, from 7½ seconds at first to over 46 seconds in the third.

          (Likewise, writing and reading back from an accumulating disc file, then the growing I/O operations would analogously affect execution time, but I assume that that point of weakness would be obvious to your colleague.)
          Attached Files

          Comment


          • #6
            Joseph Coveney Mike Lacy Thanks for your pointers! I'm passing them on to my colleague and will rreport back

            Comment


            • #7
              My colleague doesn't think those matrix issues you raised are the issue. In the meantime, he has identified the specific part of the sim code that appears to make the slowdown appear. Here's an extract

              Code:
                      forval r = 1/$rep {
              
                  //To post DF results
                  tempname results2
                  postfile `results2' step rep bsrep ///
                      theilboot theilbootV theil_Wstar  ///
                      giniboot ginibootV gini_Wstar  ///
                      mldboot mldbootV mld_Wstar  ///
                      using "$out\bs\df_${B1}_${B2}_simbs_driver_M${M}_`year'_r${rep}_s`step'.dta", replace  // ####    
              
                      ...
              
                      local nreps = $B2 - $B1 + 1
              
                      _dots 0, title(Loop running) reps(`nreps')
                      qui forvalues b = $B1/$B2 {
                              use "`temp`r''", clear // ###
                              preserve
                                  keep if R == 1 // Rich
                                  sort inc
                                  drop inc
                                  tempvar y1
                                  gen `y1' = rpareto(alpha, x0)
                                  egen inc = clsort(`y1')
                                  save "`temp'", replace  // #####
                              restore
                                  bsample if R == 0 // Non-rich 
                                  append using "`temp'"  // ####
                                  
                          dstat ($mlist) inc [pw = wgt]
                          local i = 1
                          foreach meas of global mlist {    
                              local `meas'boot = e(b)[1,`i']
                              local `meas'bootV = e(V)[`i',`i']
                              local `meas'_Wstar = (``meas'boot' - `meas'_0star)/sqrt(``meas'bootV')    
                              local i = `i' +1
                          }
                      
                          post `results2' (`step') (`r') (`b') ///
                              (`theilboot') (`theilbootV') (`theil_Wstar')  ///
                              (`giniboot') (`ginibootV') (`gini_Wstar')  ///
                              (`mldboot') (`mldbootV') (`mld_Wstar')                      
                                      
                          noisily _dots `b' 0
                      }    //BS loop  
              
                  postclose `results2'          
              
                  } //rep loop
              PS in case you're curious, what we are doing is assessing the performance of a two-sample test that 2 scalar income inequality indices are equal (for multiple indices). In this case, the test statistic is based on a semiparametric percentile-t bootstrap approach in which the top x% of incomes are replaced by draws from a Pareto distribution fit to that x% (Davidson and Flachaire, Journal of Econometrics, 2007). We are undertaking a performance 'beauty contest' between this approach and 3 others.

              Comment


              • #8
                Is the tmp directory accumulating a new file entry for each replication? Perhaps creating or referenceing a new file is slow in such a large directory. Can you delete the temporary file before starting the next replication?

                Comment


                • #9
                  Apologies for a mistake made in copy-pasting that code extract. The -postfile- command comes before the rep loop, not inside it. Thanks, Daniel Feenberg. I presume you referring to the "tempvar y1" line in the -forvalues- loop. We'll check it out.

                  Comment


                  • #10
                    I meant these lines:

                    Code:
                    tempfile dataset
                    save `dataset'
                    but I don't know how many times that is executed, or if the tempfiles acculate to a problematic degree.

                    Comment


                    • #11
                      Originally posted by Stephen Jenkins View Post
                      My colleague . . . has identified the specific part of the sim code that appears to make the slowdown appear. Here's an extract
                      Code:
                      forval r = 1/$rep {
                          . . .
                          qui forvalues b = $B1/$B2 {
                              use "`temp`r''", clear // ###
                              . . . 
                          } //BS loop
                          . . .
                      
                      } //rep loop
                      So, if r is the repetition loop counter and it goes from 1 to 10 000, then it appears that you accumulate to ten thousand "`temp`r''" temporary files that stick around until looping ends. If so, then that seems where the gradual slowdown occurs.

                      You can always home in on the location in the code by placing a couple of timers, with their activation (timer on # & timer off #) made conditional on the loop counter. Something like the following.
                      Code:
                      timer clear // outside the loop
                      
                      forvalues rep = 1/$rep {
                      
                          if `rep' < 1000 timer on 1
                          if `rep' > 3000 timer on 2
                      
                          . . .  // <- lines of suspected code #1
                      
                          timer off 1 // <- these won't throw an exception even if they're not on
                          timer off 2
                      }

                      Comment


                      • #12
                        Thanks again, Joseph Coveney . As it happens, we anticipated your remarks telepathically and placed various timers in the code. It turns out that most probable source of the slowdown issue is an accumulation of Stata matrices as in the code snippet below. That bit of code is not in the extract I posted earlier in the thread unfortunately.

                        Code:
                                foreach meas of global mlist {
                                
                                    dstat (`meas') inc [pw = wgt], over(`groupa')            
                                    *mat list e(b)
                                    tempname est
                                    mat `est' = e(b)
                                    mat list `est'
                        
                        <output omitted>
                        Relatedly, have a look too at the screenshot of the output from the -memory- command from a couple of points along the repetitions trail. Observe the count of Stata matrices for the later repetition (LHS) compared to an earlier repetition (RHS) and the timers' output
                        Click image for larger version

Name:	memory_use.png
Views:	1
Size:	59.0 KB
ID:	1784945



                        We owe many thanks to Ben Jann for his insights on this issue, as well as his advice about how to avoid it ... and improve our program more generally. Ben also observed that "My personal opinion is that the number of matrices and the accumulated memory they consume is too small to justify such a drastic performance effect. I would suggest forwarding this exchange to Stata Corp. Maybe there is an issue they will want to fix." My colleague and I are looking into doing this.
                        Thanks again to everyone for helping us identify and resolve the issue.

                        Comment


                        • #13
                          Yeah, the accumulation of matrices in memory is responsible for the slowdown. Below is a brief simulation using the mean command. If results are collected in separate matrices, execution of mean gets slower and slower.

                          Code:
                          set seed 23754987
                          local reps 300
                          clear all
                          set obs 100
                          gen y = exp(rnormal())
                          gen w = runiform()
                          mean y [pw=w]
                          matrix T = J(`reps',2,.)
                          // version 1: single results matrix
                          forv r=1/`reps' {
                              quietly {
                                  timer clear
                                  forv i = 1/10 {
                                      timer on 1
                                      mean y [pw=w]
                                      timer off 1
                                      mat b = e(b)
                                  }
                                  timer list 1
                                  mat T[`r',1] = r(t1)
                              }
                              if !mod(`r',10) di as txt "." _c
                          }
                          // version 2: results matrices piling up
                          forv r=1/`reps' {
                              quietly {
                                  timer clear
                                  forv i = 1/10 {
                                      timer on 1
                                      mean y [pw=w]
                                      timer off 1
                                      mat b_`r'_`i' = e(b)
                                  }
                                  timer list 1
                                  mat T[`r',2] = r(t1)
                              }
                              if !mod(`r',10) di as txt "." _c
                          }
                          // plot
                          drop _all
                          svmat T
                          matrix drop _all
                          gen n = _n
                          lab var T1 "single results matrix"
                          lab var T2 "results matrices piling up"
                          scatter T1 T2 n, ms(o o) yti("Execution time")
                          Click image for larger version

Name:	Graph.png
Views:	1
Size:	44.3 KB
ID:	1784948

                          Comment


                          • #14
                            Thanks for this nice example. It seems that the increase in the red dots is nonlinear. Does this explain the sudden increase at around 3k? It would be nice to extend this simulation but it might take some time to compute...

                            Click image for larger version

Name:	sim.png
Views:	1
Size:	75.1 KB
ID:	1784952
                            Last edited by Felix Bittmann; 18 Feb 2026, 07:11. Reason: added graph
                            Best wishes

                            Stata 18.0 MP | ORCID | Google Scholar

                            Comment


                            • #15
                              We have found where Stata is spending extra time when the number of Stata matrices gets moderately large.

                              Thank you to Stephen Jenkins for bringing this to our attention.

                              Thank you to Ben Jann for providing an example that greatly helped us diagnose the issue.

                              We hope to improve Stata's performance in this case in a future Stata 19 update.

                              Comment

                              Working...
                              X