Make many regressions run faster

Anne Todd

Join Date: Dec 2018

Posts: 163
#1

Make many regressions run faster

17 Jul 2022, 10:19

Most posts I've seen here about making something run faster are in reference to folks with very large datasets. I have a different problem: a fairly small dataset but extremely large do files w/ many regressions. I'm doing some simulation work and have a do file with roughly 1,000 regression models, and I run this a few different times (for different groups), so I end up running something like 4,000 total regressions at once. This seems to take roughly 30 minutes on my computer.

My question is: does anyone have advice for trying to get this to run faster? I've added -qui- before both the -reg- command and the -regsave- command (since I'm saving the output from each model), which helps. Any other ideas?
Tags: None
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#2

17 Jul 2022, 11:15

It’s not really clear how much room there is to improve speed. Weakly identified models can dramatically increase runtime by searching the parameter space for longer, so perhaps there may be room for improvement or a different parameterization. Some estimators are faster than other, but let’s assume you are using -regress- and the model is well behaved, in which case that’s also a fast algorithm. Certain factors like single core processing speed and a higher Stata MP license will help, but probably it is not worth the expense for relatively modest time savings. You may only improve timing but up to 2x. It may be that some post-estimation commands are adding to the overall run time but may not be necessary or could be augmented (such as margins).

General advice with any stats software is that it will always be slower ad either dataset or repetitive tasks increase. It’s not unheard of or surprising that even with small to moderate datasets, simulations that use only a few thousand repetitions can take several hours or even days. This isn’t the same situation as yours of course, but it is similar, and individual regressions click in around half a second with your data.
1 like
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4466
#3

17 Jul 2022, 11:34

in addition to what is in #2, you should show us your do file and maybe someone will have a concrete suggestion
Comment
Anne Todd

Join Date: Dec 2018

Posts: 163
#4

17 Jul 2022, 11:48

Originally posted by Leonardo Guizzetti View Post

It’s not really clear how much room there is to improve speed. Weakly identified models can dramatically increase runtime by searching the parameter space for longer, so perhaps there may be room for improvement or a different parameterization. Some estimators are faster than other, but let’s assume you are using -regress- and the model is well behaved, in which case that’s also a fast algorithm. Certain factors like single core processing speed and a higher Stata MP license will help, but probably it is not worth the expense for relatively modest time savings. You may only improve timing but up to 2x. It may be that some post-estimation commands are adding to the overall run time but may not be necessary or could be augmented (such as margins).

General advice with any stats software is that it will always be slower ad either dataset or repetitive tasks increase. It’s not unheard of or surprising that even with small to moderate datasets, simulations that use only a few thousand repetitions can take several hours or even days. This isn’t the same situation as yours of course, but it is similar, and individual regressions click in around half a second with your data.

Thanks, this basically confirms what I suspected, which is that there isn't much one can do (at least from a coding standpoint...I don't want to get a new and faster computer).

Rich Goldstein, the do file has nothing in it but -qui reg- with different combinations of variables, and a -qui regsave- after each model to store estimates.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#5

17 Jul 2022, 12:39

-regsave- is a user-written command. I am not familiar with it. But it is possible that writing your own code to save the desired regression results in a data file would result in improved performance. However, even if that is the case, the time it would take you to write such code probably outweighs the savings in execution time given that you are only talking about 30 minutes here. Unless this same do-file is going to be run many, many times, it would not be worthwhile to try this.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2417
#6

17 Jul 2022, 12:40

I think it's possible though not likely some simple parallelizing could help. I have had the experience that two different identical jobs with multiple repetitions, run in separate instances of Stata, take about only 10% or so more time than running either one of them alone. Whether this kind of trick would help in this situation would depend on how well -regress- is parallelized internally, among many other things. I'd admit it's a long shot as to whether this would help, but experimenting should be easy, i.e., just -set processor n- in each do file (where n is 1/2 of the processors in your Stata version), and give each do-file 1/2 of the repetitions to do. Start one instance running, then open another instance of Stata, and start that one working.

There is at least one community-contributed parallelizing package for Stata (-ssc describe parallel-), but I'd experiment in this simple way first.

Last edited by Mike Lacy; 17 Jul 2022, 13:23.
1 like
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4466
#7

17 Jul 2022, 12:41

ok (re: #4) - I was just thinking that, for example, use of "if" in your command would very much slow down what was going on - but the implication of your answer is that there is nothing like this
Comment
Anne Todd

Join Date: Dec 2018

Posts: 163
#8

17 Jul 2022, 14:00

Thanks all.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#9

18 Jul 2022, 00:11

There are no easy solutions here, as unfortunately there is no secret and undocumented setting (imaginary code follows):

Code:

set runcode fast, permanent

In your situation it is probably too late to apologise, because probably you would have to rewrite your do file completely to speed it up.

If you decide to rewrite, there are a couple of things you can do to speed things up.

1. The advice of Mike in #6. You can split your job into multiple do files, and manually parallelly run multiple Statas from each do file. I call this "poor man's multiprocessor Stata", and the rule I use is that I run as many Statas parallelly as the number of physical cores on my computer. For parallellisable tasks such as yours the speed up is tremendous.

2. You can check out whether the internal - _regress - is not doing the job faster than the standard -regress-. I have found in some instances the former to be faster, but this might be for old Statas, I have not experimented with this since I got Stata 17 MP.

3. You can check out the user contributed package -gtools-, in particular -gregress- (which is in beta status as of now). The author writes his contributions as C plug ins, and his tools are fast as lightning. The problem is that he does not really polish his contributions, so it would fall as a heavy burden on you the user to make this thing work for practical purposes.
Comment
Anne Todd

Join Date: Dec 2018

Posts: 163
#10

18 Jul 2022, 05:54

Thanks Joro. Something else I've been wondering is if there is an even more advanced form of -qui-, which would suppress both the results and the Stata commands themselves--essentially, so that nothing would show up in the terminal window. For a few thousand models, if it saves even a half second to print out the command in the terminal window, this would save a bit of time.
Comment
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2403
#11

18 Jul 2022, 06:09

Originally posted by Anne Todd View Post

Thanks Joro. Something else I've been wondering is if there is an even more advanced form of -qui-, which would suppress both the results and the Stata commands themselves--essentially, so that nothing would show up in the terminal window. For a few thousand models, if it saves even a half second to print out the command in the terminal window, this would save a bit of time.

You can run your do file in something called batch mode, which is to run Stata from the command line without its “head” (is graphical user interface). You would need to log your results or else you won’t see any output. I’ve never tried to benchmark if this is any faster compared to the GUI version of Stata, so it may be somewhat faster.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30118
#12

18 Jul 2022, 08:58

You don't need to use batch mode to do this, although that will work. You can just launch Stata and then execute your do-file with the -run- command instead of the -do- command. -run- suppresses echoing of commands.

That said, it sounds like you have already spent more than the 30 minutes of execution time for this file trying to squeeze a bit more efficiency out of it. I don't understand that. Is this some kind of production file that will be used many, many times, so that the small saving you might eke out from this will add up over time and justify the effort you are putting into it? Is it a program that needs to run in real time? (If the latter, it probably makes more sense to use a different application altogether, one that is compiled rather than interpreted and allows you to control the computer at a low level, like C++.)

I suppose my reaction to this is in part based on my own experiences. In my workflow, a 30 minute execution is something I consider normal. I am quite accustomed to programs that run for weeks, and, occasionally, months. It's hard for me to understand the fuss over 30 minutes.
Comment
Anne Todd

Join Date: Dec 2018

Posts: 163
#13

18 Jul 2022, 09:27

I'll likely end up running similar files in the future, though not too many times. At this point it's mostly just that my curiosity has been piqued!
1 like
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

#14

18 Jul 2022, 09:33

Here's an alternative that meets your description of a "more advanced form of quietly".

Code:

sysuse auto, clear
quietly regress price weight
quietly regress price length
quietly {
    regress price weight
    regress price length
}

produces in the Stata Results window

Code:

. sysuse auto, clear
(1978 automobile data)

. quietly regress price weight

. quietly regress price length

. quietly {

.
end of do-file

Added in edit: the noisily prefix overrides the effect of an enclosing quietly.

Code:

sysuse auto, clear
quietly regress price weight
quietly regress price length
quietly {
    regress price weight
    regress price length
    noisily display "done!"
}

Code:

. sysuse auto, clear
(1978 automobile data)

. quietly regress price weight

. quietly regress price length

. quietly {
done!

.
end of do-file

Last edited by William Lisowski; 18 Jul 2022, 09:38.

Comment

Ben Jann

Join Date: Sep 2014

Posts: 262
#15

19 Jul 2022, 06:34

Depending on what exactly you want to do, significant speed gains may result from switching to Mata. The moremata package (see https://github.com/benjann/moremata/; type ssc install moremata to install the package) offers an efficient and precise implementation of least-squared estimation; see help mata mm_ls() after installing moremata. Depending on situation, mm_ls() can be substantially faster than regress (less overhead etc). Here's a comparison (1000 regressions on 1000 observations and 10 predictors):

Code:

timer clear // regress forv i=1/1000 { qui drawnorm y x1-x10, double clear n(1000) timer on 1 qui regress y x1-x10 timer off 1 } // mata mm_ls() mata: n = 1000 for (i=1;i<=1000;i++) { y = rnormal(n,1,0,1) X = rnormal(n,10,0,1) timer_on(2) b = mm_lsfit(y, X) timer_off(2) } end timer list

Result on my computer:

Code:

. timer list 1: 8.57 / 1000 = 0.0086 2: 0.66 / 1000 = 0.0007

regress used 8.57 seconds for the 1000 regressions; mm_ls() only used 0.66 seconds.

ben
Comment

Announcement

Make many regressions run faster

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment