Stata MP and -statsby-

Malcolm Wardlaw

Join Date: Apr 2014

Posts: 46
#1

Stata MP and -statsby-

10 May 2017, 09:29

Does anyone know why the statsby command is not parallelized by Stata MP? This seems like the simplest inherently parallel operation in Stata, and yet it appears to be implemented in a completely consecutive way.
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

10 May 2017, 10:35

To hazard a guess, it's not clear that Stata is prepared to run multiple copies of the same command in parallel. The gains in Stata/MP were achieved by modifying commands to split out parallelizable work and distribute it across multiple cores - but all under the control of the command itself, which understood which parts might be running in parallel and took care to avoid potential conflict. You really don't want multiple cores attempting to sort the same in-memory dataset at the same time, for example.
1 like
Comment
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#3

11 May 2017, 11:26

I think Malcolm has an interesting idea. Maybe as William indicates it doesn't work, but might be worth considering.

Malcolm - by the way, you can run multiple Stata sessions at the same time. Occasionally when I need an immense number of similar analyses, I'll parcel them out an run them in simultaneous Stata sessions.
Comment
cheng yan

Join Date: Nov 2017

Posts: 11
#4

10 Nov 2017, 13:36

-Statsby- even can have conflicts with -parallel- from https://github.com/gvegayon/parallel. Is there a better way to replace -statsby- to (for example), collect rmse from regressions using a large dataset?
Comment
cheng yan

Join Date: Nov 2017

Posts: 11
#5

10 Nov 2017, 13:39

If you try the codes below you will find an error message saying "variable __000006 not found an error occurred when statsby executed arima r(111);"

webuse grunfeld, clear.

tsset company year

parallel setclusters 2

parallel, by(company): statsby _b _se, basepop(company==1) by(company) clear: arima invest , ar(1)

Last edited by cheng yan; 10 Nov 2017, 13:55.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

10 Nov 2017, 14:52

I do not know anything about parallel but I know that a new program called runby (from SSC) will run code by groups of observations faster than statsby. The more by-groups, the faster runby will be compared to statsby (by orders of magnitude). Here's a quick example based on the example in #5:

Code:

* create a demonstration dataset
clear all
set seed 213123
set obs 100
gen long company = _n
expand 20
bysort company: gen year = _n
gen invest = runiform()
tempfile master
save "`master'"

* ---------- using runby ------------------
timer on 1
program my_arima
  tsset company year
  arima invest , ar(1)
  gen b_invest = _b[invest:_cons]
  gen b_ar     = _b[ARMA:L.ar]
  gen b_sigma  = _b[sigma:_cons]
  gen se_invest = _se[invest:_cons]
  gen se_ar     = _se[ARMA:L.ar]
  gen se_sigma  = _se[sigma:_cons]
  keep company b_* se_*
  keep in 1
end
runby my_arima, by(company) status
timer off 1

list in 1/10

* ---------- using statsby ------------------
use "`master'", clear
timer on 2
tsset company year
statsby _b _se, basepop(company==1) by(company) clear: arima invest , ar(1)
timer off 2

list in 1/10

timer list

and the results on my computer (running MP 15.1):

Code:

. runby my_arima, by(company) status

  elapsed ----------- by-groups ----------    ------- observations ------       time
     time      count     errors    no-data        processed         saved  remaining
------------------------------------------------------------------------------------
 00:00:01          9          0          0              180             9   00:00:11
 00:00:02         18          0          0              360            18   00:00:10
 00:00:03         28          0          0              560            28   00:00:08
 00:00:04         37          0          0              740            37   00:00:07
(now reporting every 5 seconds)
 00:00:09         79          0          0            1,580            79   00:00:02
 00:00:12        100          0          0            2,000           100   00:00:00

--------------------------------------
Number of by-groups    =           100
by-groups with errors  =             0
by-groups with no data =             0
Observations processed =         2,000
Observations saved     =           100
--------------------------------------

. timer off 1

. 
. list in 1/10

     +----------------------------------------------------------------------------+
     | company   b_invest        b_ar    b_sigma   se_inv~t      se_ar   se_sigma |
     |----------------------------------------------------------------------------|
  1. |       1   .4645525    .0136948   .2858653   .0652041   .2135187   .0636512 |
  2. |       2    .424218    .3304663   .3099041   .1093342   .2754071     .09213 |
  3. |       3   .4960087   -.5463693   .2582569   .0427073   .2131937   .0820881 |
  4. |       4   .5297394   -.1042499   .3055238   .0629797   .2478953   .0840082 |
  5. |       5   .3520795   -.2084872   .2388056   .0507218   .2522112    .070042 |
     |----------------------------------------------------------------------------|
  6. |       6   .4180342   -.3460465   .2749622    .048511   .2820182    .063372 |
  7. |       7   .5519149    -.176004   .2659995   .0586299   .2971022   .0540392 |
  8. |       8   .5655345   -.3378355   .2683479    .047193    .245184   .0642803 |
  9. |       9   .5521508    .5478458   .2388245   .1392444   .2043547   .0493356 |
 10. |      10   .4346327    .0485568   .2891552    .104015   .2296528   .1017445 |
     +----------------------------------------------------------------------------+

. 
. * ---------- using statsby ------------------
. use "`master'", clear

. timer on 2

. tsset company year
       panel variable:  company (strongly balanced)
        time variable:  year, 1 to 20
                delta:  1 unit

. statsby _b _se, basepop(company==1) by(company) clear: arima invest , ar(1)
(running arima on estimation sample)

      command:  arima invest, ar(1)
           by:  company

Statsby groups
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
..................................................    50
..................................................   100

. timer off 2

. 
. list in 1/10

     +----------------------------------------------------------------------------+
     | company   i~b_cons     _stat_2   s~b_cons   i~e_cons    _stat_5   s~e_cons |
     |----------------------------------------------------------------------------|
  1. |       1   .4645525    .0136948   .2858653   .0652041   .2135187   .0636512 |
  2. |       2    .424218    .3304663   .3099041   .1093342   .2754071     .09213 |
  3. |       3   .4960087   -.5463693   .2582569   .0427073   .2131937   .0820881 |
  4. |       4   .5297394   -.1042499   .3055238   .0629797   .2478953   .0840082 |
  5. |       5   .3520795   -.2084872   .2388056   .0507218   .2522112    .070042 |
     |----------------------------------------------------------------------------|
  6. |       6   .4180342   -.3460465   .2749622    .048511   .2820182    .063372 |
  7. |       7   .5519149    -.176004   .2659995   .0586299   .2971022   .0540392 |
  8. |       8   .5655345   -.3378355   .2683479    .047193    .245184   .0642803 |
  9. |       9   .5521508    .5478458   .2388245   .1392444   .2043547   .0493356 |
 10. |      10   .4346327    .0485568   .2891552    .104015   .2296528   .1017445 |
     +----------------------------------------------------------------------------+

. 
. timer list
   1:     12.06 /        1 =      12.0590
   2:    157.09 /        1 =     157.0930

Comment

cheng yan

Join Date: Nov 2017
Posts: 11

10 Nov 2017, 16:05

Dear Robert,

Thank you for your suggestion. I think it works, although it probably will take more than one hour to do the job. Is there any possibility to make it faster such as replacing status with allocate(100000000)? Another issue is that I found a lot of errors using my data, although your example runs perfectly well.

. program my_runby
1.
. sort tempid
2.
. xtset permno n
3.
. reg nightret L.nightret [aweight=sqrt(betweenhours)]
4.
. gen rmse=e(rmse)
5.
. keep in 1
6.
. end
r; t=0.18 22:24:18
.
. runby my_runby, by(tempid) status
elapsed ----------- by-groups ---------- ------- observations	------	time
time count errors no-data processed	saved	remaining

00:00:01 111 70 0 2,340	41	02:56:26
00:00:02 192 71 0 4,039	121	03:24:31
00:00:03 272 71 0 5,716	201	03:36:00
00:00:04 389 125 0 8,172	264	03:21:08
(now reporting every 5 seconds)
00:00:09 951 367 0 19,904	584	03:05:12
00:00:14 1,533 581 0 32,083	952	02:58:25
00:00:19 2,170 812 0 45,393	1,358	02:51:04
00:00:24 2,637 983 0 55,070	1,654	02:57:56
00:00:29 3,161 1,189 0 66,063	1,972	02:59:06
00:00:34 3,585 1,256 0 74,939	2,329	03:05:01
00:00:39 4,119 1,479 0 86,081	2,640	03:04:38
00:00:44 4,669 1,751 0 97,583	2,918	03:03:40
00:00:49 5,320 2,115 0 111,154	3,205	02:59:29
00:00:54 5,979 2,517 0 124,948	3,462	02:55:51
00:00:59 6,567 2,806 0 137,035	3,761	02:55:06
(now reporting every 15 seconds)
00:01:14 8,247 3,613 0 171,793	4,634	02:54:54
00:01:29 9,055 4,013 0 188,485	5,042	03:11:36
00:01:44 9,118 4,024 0 189,811	5,094	03:42:17
00:01:59 9,536 4,171 0 198,527	5,365	04:03:04
00:02:14 10,973 4,805 0 228,475	6,168	03:57:32
00:02:29 12,540 5,554 0 261,107	6,986	03:50:48
00:02:44 14,002 5,977 0 291,753	8,025	03:47:04
00:02:59 15,954 6,840 0 332,535	9,114	03:37:03
00:03:14 16,116 6,902 0 335,945	9,214	03:52:49
00:03:29 16,280 7,043 0 339,358	9,237	04:08:17
00:03:44 17,482 7,261 0 364,500	10,221	04:07:29
00:03:59 19,027 7,947 0 396,668	11,080	04:02:19
00:04:14 20,450 8,512 0 426,092	11,938	03:59:27
00:04:29 22,292 9,328 0 464,217	12,964	03:52:24
00:04:44 24,358 10,196 0 507,107	14,162	03:44:12
00:04:59 26,485 11,198 0 551,424	15,287	03:36:40
00:05:14 28,334 11,870 0 589,761	16,464	03:32:24
00:05:29 30,556 13,053 0 636,234	17,503	03:25:53
00:05:44 32,446 13,964 0 675,678	18,482	03:22:21
00:05:59 34,182 14,710 0 711,947	19,472	03:20:07
00:06:14 35,853 15,372 0 746,860	20,481	03:18:26
00:06:29 37,642 16,224 0 784,251	21,418	03:16:15
00:06:44 39,594 17,287 0 824,916	22,307	03:13:26
00:06:59 41,174 17,743 0 857,872	23,431	03:12:38
00:07:14 42,652 18,125 0 888,760	24,527	03:12:21
00:07:29 44,382 18,859 0 924,462	25,523	03:11:01
00:07:44 46,654 20,311 0 971,871	26,343	03:07:23
00:07:59 48,683 21,212 0 1,014,291	27,471	03:05:01
00:08:14 50,625 22,075 0 1,054,607	28,550	03:03:12
00:08:29 52,840 23,175 0 1,100,793	29,665	03:00:29
00:08:44 54,847 24,039 0 1,142,599	30,808	02:58:41
00:08:59 56,601 24,516 0 1,179,246	32,085	02:57:49
00:09:14 58,449 25,261 0 1,217,766	33,188	02:56:41
00:09:29 60,784 26,614 0 1,266,233	34,170	02:54:10
00:09:44 62,481 27,203 0 1,301,693	35,278	02:53:37
00:09:59 64,417 28,259 0 1,342,070	36,158	02:52:25

Last edited by cheng yan; 10 Nov 2017, 16:09.

Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

10 Nov 2017, 19:19

The status option is one of the great features of runby. The longer it takes, the more accurate the estimated time remaining becomes. Why would you want to remove this option and run such a big job blind?

No, the allocate() option will not speedup things. You are saving 1 observation per group, are you suggesting that you have 100 million groups?

Your program does not need to sort and xtset. You just need to store the lagged nightret in a regular variable.

I suspect that tempid is just another incarnation of permno. If you want suggestions, you need to better explain what you are doing.

With respect to the errors you are encountering, the most likely culprit is insufficient observations for the regression. Here's an example that shows how you can test what you are trying to do on a small subsample of your big ol' dataset:

Code:

* create demonstration dataset
clear all
set seed 31231
set obs 100
gen long permno = runiformint(1111111,9999999)
expand runiformint(1,100)
bysort permno: gen n = _n
gen nightret = runiform()
replace nightret = . if runiform() < .2
drop if runiform() < .2
gen betweenhours = runiform()

* create sequential group id
egen long gid = group(permno)

* learn how this works on a small subsample; pick first 5 groups
keep if inrange(gid,1,5)

* precompute the lag of nightret
xtset gid n
gen double Lnightret = L.nightret

* perform the task for each of these 5 groups using standard Stata commands
by gid: reg nightret Lnightret [aweight=sqrt(betweenhours)]

* repeat, this time with runby
program my_runby
  reg nightret Lnightret [aweight=sqrt(betweenhours)]
  gen rmse=e(rmse)
  keep in 1
end
runby my_runby, by(gid)

list

and the final part of the results:

Code:

. runby my_runby, by(gid)

--------------------------------------
Number of by-groups    =             5
by-groups with errors  =             1
by-groups with no data =             0
Observations processed =           155
Observations saved     =             4
--------------------------------------

. 
. list

     +---------------------------------------------------------------+
     |  permno   n   nightret   betwee~s   gid   Lnight~t       rmse |
     |---------------------------------------------------------------|
  1. | 1405036   1   .4981674   .9445046     1          .    .287572 |
  2. | 1766318   1   .9967703   .7038013     3          .   .2516169 |
  3. | 1780215   1   .8783187   .5729671     4          .   .2803902 |
  4. | 1856470   2   .0929261   .0373204     5          .   .3531582 |
     +---------------------------------------------------------------+

.

So there's no results for gid==2. If you scroll back the results window, you'll see that the regression output for fid == 2 is:

Code:

-> gid = 2
insufficient observations

If you would like to have results for by-groups with errors, you can capture the error. Here's one way to handle these:

Code:

program my_runby
  capture reg nightret Lnightret [aweight=sqrt(betweenhours)]
  gen rc = _rc
  if _rc == 0 gen rmse=e(rmse)
  keep in 1
end
runby my_runby, by(gid)

which will produce this:

Code:

. list

     +----------------------------------------------------------------------+
     |  permno   n   nightret   betwee~s   gid   Lnight~t     rc       rmse |
     |----------------------------------------------------------------------|
  1. | 1405036   1   .4981674   .9445046     1          .      0    .287572 |
  2. | 1742325   1   .9870269   .0540559     2          .   2001          . |
  3. | 1766318   1   .9967703   .7038013     3          .      0   .2516169 |
  4. | 1780215   1   .8783187   .5729671     4          .      0   .2803902 |
  5. | 1856470   2   .0929261   .0373204     5          .      0   .3531582 |
     +----------------------------------------------------------------------+

Finally, if runby is not fast enough for you, you can always write your own Mata routine to perform the weighted regression and use rangestat. See the following thread to get you started.

Comment

cheng yan

Join Date: Nov 2017

Posts: 11
#9

11 Nov 2017, 13:26

Dear Robert,

Thank you again for your suggestion. I am keep the -status- option. Yes, it is indeed faster than -allocate-. I have 0.3 million groups and runby takes almost but less than one hour. tempid is indeed another incarnation of permno, which I get from egen tempid=group(permno month) as i need to run daily data within every month at the first place.

Yes, it is due to missing values in some months, thanks for it.

Best
Cheng
Comment

Announcement