Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Make many regressions run faster

    Most posts I've seen here about making something run faster are in reference to folks with very large datasets. I have a different problem: a fairly small dataset but extremely large do files w/ many regressions. I'm doing some simulation work and have a do file with roughly 1,000 regression models, and I run this a few different times (for different groups), so I end up running something like 4,000 total regressions at once. This seems to take roughly 30 minutes on my computer.

    My question is: does anyone have advice for trying to get this to run faster? I've added -qui- before both the -reg- command and the -regsave- command (since I'm saving the output from each model), which helps. Any other ideas?

  • #2
    It’s not really clear how much room there is to improve speed. Weakly identified models can dramatically increase runtime by searching the parameter space for longer, so perhaps there may be room for improvement or a different parameterization. Some estimators are faster than other, but let’s assume you are using -regress- and the model is well behaved, in which case that’s also a fast algorithm. Certain factors like single core processing speed and a higher Stata MP license will help, but probably it is not worth the expense for relatively modest time savings. You may only improve timing but up to 2x. It may be that some post-estimation commands are adding to the overall run time but may not be necessary or could be augmented (such as margins).

    General advice with any stats software is that it will always be slower ad either dataset or repetitive tasks increase. It’s not unheard of or surprising that even with small to moderate datasets, simulations that use only a few thousand repetitions can take several hours or even days. This isn’t the same situation as yours of course, but it is similar, and individual regressions click in around half a second with your data.

    Comment


    • #3
      in addition to what is in #2, you should show us your do file and maybe someone will have a concrete suggestion

      Comment


      • #4
        Originally posted by Leonardo Guizzetti View Post
        It’s not really clear how much room there is to improve speed. Weakly identified models can dramatically increase runtime by searching the parameter space for longer, so perhaps there may be room for improvement or a different parameterization. Some estimators are faster than other, but let’s assume you are using -regress- and the model is well behaved, in which case that’s also a fast algorithm. Certain factors like single core processing speed and a higher Stata MP license will help, but probably it is not worth the expense for relatively modest time savings. You may only improve timing but up to 2x. It may be that some post-estimation commands are adding to the overall run time but may not be necessary or could be augmented (such as margins).

        General advice with any stats software is that it will always be slower ad either dataset or repetitive tasks increase. It’s not unheard of or surprising that even with small to moderate datasets, simulations that use only a few thousand repetitions can take several hours or even days. This isn’t the same situation as yours of course, but it is similar, and individual regressions click in around half a second with your data.
        Thanks, this basically confirms what I suspected, which is that there isn't much one can do (at least from a coding standpoint...I don't want to get a new and faster computer).

        Rich Goldstein, the do file has nothing in it but -qui reg- with different combinations of variables, and a -qui regsave- after each model to store estimates.

        Comment


        • #5
          -regsave- is a user-written command. I am not familiar with it. But it is possible that writing your own code to save the desired regression results in a data file would result in improved performance. However, even if that is the case, the time it would take you to write such code probably outweighs the savings in execution time given that you are only talking about 30 minutes here. Unless this same do-file is going to be run many, many times, it would not be worthwhile to try this.

          Comment


          • #6
            I think it's possible though not likely some simple parallelizing could help. I have had the experience that two different identical jobs with multiple repetitions, run in separate instances of Stata, take about only 10% or so more time than running either one of them alone. Whether this kind of trick would help in this situation would depend on how well -regress- is parallelized internally, among many other things. I'd admit it's a long shot as to whether this would help, but experimenting should be easy, i.e., just -set processor n- in each do file (where n is 1/2 of the processors in your Stata version), and give each do-file 1/2 of the repetitions to do. Start one instance running, then open another instance of Stata, and start that one working.

            There is at least one community-contributed parallelizing package for Stata (-ssc describe parallel-), but I'd experiment in this simple way first.
            Last edited by Mike Lacy; 17 Jul 2022, 13:23.

            Comment


            • #7
              ok (re: #4) - I was just thinking that, for example, use of "if" in your command would very much slow down what was going on - but the implication of your answer is that there is nothing like this

              Comment


              • #8
                Thanks all.

                Comment


                • #9
                  There are no easy solutions here, as unfortunately there is no secret and undocumented setting (imaginary code follows):

                  Code:
                  set runcode fast, permanent
                  In your situation it is probably too late to apologise, because probably you would have to rewrite your do file completely to speed it up.

                  If you decide to rewrite, there are a couple of things you can do to speed things up.

                  1. The advice of Mike in #6. You can split your job into multiple do files, and manually parallelly run multiple Statas from each do file. I call this "poor man's multiprocessor Stata", and the rule I use is that I run as many Statas parallelly as the number of physical cores on my computer. For parallellisable tasks such as yours the speed up is tremendous.

                  2. You can check out whether the internal - _regress - is not doing the job faster than the standard -regress-. I have found in some instances the former to be faster, but this might be for old Statas, I have not experimented with this since I got Stata 17 MP.

                  3. You can check out the user contributed package -gtools-, in particular -gregress- (which is in beta status as of now). The author writes his contributions as C plug ins, and his tools are fast as lightning. The problem is that he does not really polish his contributions, so it would fall as a heavy burden on you the user to make this thing work for practical purposes.

                  Comment


                  • #10
                    Thanks Joro. Something else I've been wondering is if there is an even more advanced form of -qui-, which would suppress both the results and the Stata commands themselves--essentially, so that nothing would show up in the terminal window. For a few thousand models, if it saves even a half second to print out the command in the terminal window, this would save a bit of time.

                    Comment


                    • #11
                      Originally posted by Anne Todd View Post
                      Thanks Joro. Something else I've been wondering is if there is an even more advanced form of -qui-, which would suppress both the results and the Stata commands themselves--essentially, so that nothing would show up in the terminal window. For a few thousand models, if it saves even a half second to print out the command in the terminal window, this would save a bit of time.
                      You can run your do file in something called batch mode, which is to run Stata from the command line without its “head” (is graphical user interface). You would need to log your results or else you won’t see any output. I’ve never tried to benchmark if this is any faster compared to the GUI version of Stata, so it may be somewhat faster.

                      Comment


                      • #12
                        You don't need to use batch mode to do this, although that will work. You can just launch Stata and then execute your do-file with the -run- command instead of the -do- command. -run- suppresses echoing of commands.

                        That said, it sounds like you have already spent more than the 30 minutes of execution time for this file trying to squeeze a bit more efficiency out of it. I don't understand that. Is this some kind of production file that will be used many, many times, so that the small saving you might eke out from this will add up over time and justify the effort you are putting into it? Is it a program that needs to run in real time? (If the latter, it probably makes more sense to use a different application altogether, one that is compiled rather than interpreted and allows you to control the computer at a low level, like C++.)

                        I suppose my reaction to this is in part based on my own experiences. In my workflow, a 30 minute execution is something I consider normal. I am quite accustomed to programs that run for weeks, and, occasionally, months. It's hard for me to understand the fuss over 30 minutes.

                        Comment


                        • #13
                          I'll likely end up running similar files in the future, though not too many times. At this point it's mostly just that my curiosity has been piqued!

                          Comment


                          • #14
                            Here's an alternative that meets your description of a "more advanced form of quietly".
                            Code:
                            sysuse auto, clear
                            quietly regress price weight
                            quietly regress price length
                            quietly {
                                regress price weight
                                regress price length
                            }
                            produces in the Stata Results window
                            Code:
                            . sysuse auto, clear
                            (1978 automobile data)
                            
                            . quietly regress price weight
                            
                            . quietly regress price length
                            
                            . quietly {
                            
                            .
                            end of do-file
                            Added in edit: the noisily prefix overrides the effect of an enclosing quietly.
                            Code:
                            sysuse auto, clear
                            quietly regress price weight
                            quietly regress price length
                            quietly {
                                regress price weight
                                regress price length
                                noisily display "done!"
                            }
                            Code:
                            . sysuse auto, clear
                            (1978 automobile data)
                            
                            . quietly regress price weight
                            
                            . quietly regress price length
                            
                            . quietly {
                            done!
                            
                            .
                            end of do-file
                            Last edited by William Lisowski; 18 Jul 2022, 09:38.

                            Comment


                            • #15
                              Depending on what exactly you want to do, significant speed gains may result from switching to Mata. The moremata package (see https://github.com/benjann/moremata/; type ssc install moremata to install the package) offers an efficient and precise implementation of least-squared estimation; see help mata mm_ls() after installing moremata. Depending on situation, mm_ls() can be substantially faster than regress (less overhead etc). Here's a comparison (1000 regressions on 1000 observations and 10 predictors):

                              Code:
                              timer clear
                              
                              // regress
                              forv i=1/1000 {
                                  qui drawnorm y x1-x10, double clear n(1000)
                                  timer on 1
                                  qui regress y x1-x10
                                  timer off 1
                              }
                              
                              // mata mm_ls()
                              mata:
                                  n = 1000
                                  for (i=1;i<=1000;i++) {
                                      y = rnormal(n,1,0,1)
                                      X = rnormal(n,10,0,1)
                                      timer_on(2)
                                      b = mm_lsfit(y, X)
                                      timer_off(2)
                                  }
                              end
                              
                              timer list
                              Result on my computer:

                              Code:
                              . timer list
                                 1:      8.57 /     1000 =       0.0086
                                 2:      0.66 /     1000 =       0.0007
                              regress used 8.57 seconds for the 1000 regressions; mm_ls() only used 0.66 seconds.

                              ben

                              Comment

                              Working...
                              X