Change a Statsby regression to a faster program

Charlie Clarke

Join Date: Apr 2014

Posts: 19
#1

Change a Statsby regression to a faster program

15 May 2015, 19:59

The following command does everything I need it to:

statsby _b _se, saving(File, replace) by(panelid) verbose nodots: regress y x1 x2 x3 x4

Unfortunately, it is pretty slow. I was wondering if there is a way to speed it up with a simple program. I found this program on statalist that computes rolling betas. I think it would be faster (is that right?) Is there a way to alter it to my simple setting, where I am running regressions by a panelid.

The code below seems to work with tsset to do rolling regressions. How is that? Is it the s`w'. command? I have not seen that before.

program rolling_beta
version 11.2
syntax varlist(numeric), window(real)

* get dependent and indpendent vars from varlist
tempvar x y x2 y2 xy xs ys xys x2s y2s covxy varx vary
tokenize "`varlist'"
generate `y' = `1'
generate `x' = `2'
local w = `window'

* generate products
generate `xy' = `x'*`y'
generate `x2' = `x'*`x'
generate `y2' = `y'*`y'

* generate cumulative sums
generate `xs' = sum(`x')
generate `ys' = sum(`y')
generate `xys' = sum(`xy')
generate `x2s' = sum(`x2')
generate `y2s' = sum(`y2')

* generate variances and covariances
generate `covxy' = (s`w'.`xys' - s`w'.`xs'*s`w'.`ys'/`w')/`w'
generate `varx' = (s`w'.`x2s' - s`w'.`xs'*s`w'.`xs'/`w')/`w'
generate `vary' = (s`w'.`y2s' - s`w'.`ys'*s`w'.`ys'/`w')/`w'

* generate alpha, beta, r2, s2
generate beta = `covxy'/`varx'
generate alpha = (s`w'.`ys' - beta*s`w'.`xs')/`w'
generate r2 = `covxy'*`covxy'/`varx'/`vary'
generate s2 = `vary'*`w'*(1 - r2)/(`w' - 2)
end
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#2

16 May 2015, 09:50

You can try it if you like and compare the two for speed. Personally, I'd be very surprised if this is faster. The regress command, while it does have some overhead, does its calculations in Mata and it is super-fast. Probably it's faster than doing it the way you propose.

The slowness of your process has more to do with other operations entailed by -statsby-: the results are being aggregated and written to disk (as a temporary file if you don't use the -saving()- option, as a permanent file if you do), and each regression is hobbled with an -if- condition that forces Stata to scan through the entire data set and examine each observation to see if it has the right panelid value to be included in the current regression. Both of these are very slow processes, and you will not avoid them by hacking your own -regress-.

You might be able to get a speedup if there is a simple way to identify which observations correspond to which value of panelid in your data. E.g. if the panel is strongly balanced, then panel1's observations might be in observations 1 through 100 (say), panel 2's in 101 through 200, etc. If something like that is the case then you could write a loop over panelid where the -regress- condition is specified as -in `begin'/`end'- (begin and end being appropriately calculated through the loop), and you can capture the results as you go using the -postfile- machinery. You might notice a material speed-up this way because -in- is much much faster than -if-, especially in large data sets. (Evaluating -in- is O(1), evaluating -if- is O(N).)
Comment
Charlie Clarke

Join Date: Apr 2014

Posts: 19
#3

16 May 2015, 13:07

Thanks for the reply Clyde. Unfortunately, the panel is not strongly balanced.

I got the idea from a couple of threads started by Richard Herron. One on stack overflow (http://stackoverflow.com/questions/7...sions-in-stata) and another on statalist (http://www.stata.com/statalist/archi.../msg01172.html). The thread was about trying to make the "-rolling" command work faster. And Richard reports that this seems to work a lot faster than the rolling command.

I though possibly that the code could be modified for my situation, which seems simpler. The code works through tsset, but I'm not sure "how it knows" to run separate regressions for each panelid. My situation is basically a special case where the window is the sample size of the panelid.

Speeding up rolling regressions in Stata - Stack Overflow

http://stackoverflow.com

Should I avoid rolling and manually code rolling regressions? Or am I better off creating a giant panel with overlapping entries and using statsby? I.e., give each window its own by entry. In R I can
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#4

16 May 2015, 16:12

Thanks for that linkage. It looks to me like what Richard Herron's code does is not so much hand-code -regress- (which it does), but it eliminates all of the overhead that -rolling- goes through to organize the regressions and carry them out in order. The other responders there agreed with me that -regress- is super-fast and that hand coding your own regress is not going to speed things up appreciably, if at all.

Herrin's code "knows" to do it separately for each panel because once the data are -tsset-, the s`w'. operators (seasonal difference) that appear in the code operate only within the scope of a single panel: if asked to produce, for example `s'30.x in any of the first 29 observations, it returns a missing value. That is the genius of time-series operators in Stata.

I have one thought how you might solve your problem. Your panel is not balanced, but you can make it so with -tsfill, full-, which will add enough observations (with missing values in all but the panel and time variables) to make it strongly balanced. Then you can calculate the start and end points for each panel and loop through regressions guarded by -in-. Something like this:

Code:

tsset panelid timevar tsfill, full // GENERATE A NEW PANEL ID WITH CONSECUTIVE NUMBERS // STARTING AT 1 egen long panel_num = group(panelid) summ panel_num, meanonly local n_panels `r(max)' local panel_size = `r(N)'/`n_panels' gen coef = . gen se = . gen r2 = . // LOOP OVER PANELS forvalues j = 1/`n_panels' { local end = `j'*`panel_size local begin = `end' - `panel_size' + 1 regress y x in `begin'/`end' replace coef = _b[x] in `begin'/`end' replace se = _se[x] in `begin'/`end' replace r2 = e(r2) in `begin'/`end' }

I haven't tested this approach, but I believe it will work. Of course you should replace the -regress- statement by the actual regression you want to carry out. And depending on what regression results you want to keep, you may want more or different variables than coef, se, and r2. The full array of output from the -regress- command is available in r(table) after the regression, so you can pull that out as a matrix and then reference the appropriate cells in it for whatever statistics you need.

This should be a lot faster than -statsby- even though it uses -regress-, because it avoids writing to disk (unless you are using virtual memory), and uses -in- qualifiers instead of -if- qualifiers. The most obvious downside of this approach is that you may run up against the limits of your available memory when you run -tsfill-.
Comment
Charlie Clarke

Join Date: Apr 2014

Posts: 19
#5

16 May 2015, 16:21

That's a great idea. Thanks!
Comment

Announcement

Change a Statsby regression to a faster program

Comment

Comment

Comment

Comment