Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Stata MP and -statsby-

    Does anyone know why the statsby command is not parallelized by Stata MP? This seems like the simplest inherently parallel operation in Stata, and yet it appears to be implemented in a completely consecutive way.

  • #2
    To hazard a guess, it's not clear that Stata is prepared to run multiple copies of the same command in parallel. The gains in Stata/MP were achieved by modifying commands to split out parallelizable work and distribute it across multiple cores - but all under the control of the command itself, which understood which parts might be running in parallel and took care to avoid potential conflict. You really don't want multiple cores attempting to sort the same in-memory dataset at the same time, for example.

    Comment


    • #3
      I think Malcolm has an interesting idea. Maybe as William indicates it doesn't work, but might be worth considering.

      Malcolm - by the way, you can run multiple Stata sessions at the same time. Occasionally when I need an immense number of similar analyses, I'll parcel them out an run them in simultaneous Stata sessions.

      Comment


      • #4
        -Statsby- even can have conflicts with -parallel- from https://github.com/gvegayon/parallel. Is there a better way to replace -statsby- to (for example), collect rmse from regressions using a large dataset?

        Comment


        • #5

          If you try the codes below you will find an error message saying "variable __000006 not found an error occurred when statsby executed arima r(111);"

          webuse grunfeld, clear.

          tsset company year

          parallel setclusters 2

          parallel, by(company): statsby _b _se, basepop(company==1) by(company) clear: arima invest , ar(1)
          Last edited by cheng yan; 10 Nov 2017, 13:55.

          Comment


          • #6
            I do not know anything about parallel but I know that a new program called runby (from SSC) will run code by groups of observations faster than statsby. The more by-groups, the faster runby will be compared to statsby (by orders of magnitude). Here's a quick example based on the example in #5:

            Code:
            * create a demonstration dataset
            clear all
            set seed 213123
            set obs 100
            gen long company = _n
            expand 20
            bysort company: gen year = _n
            gen invest = runiform()
            tempfile master
            save "`master'"
            
            * ---------- using runby ------------------
            timer on 1
            program my_arima
              tsset company year
              arima invest , ar(1)
              gen b_invest = _b[invest:_cons]
              gen b_ar     = _b[ARMA:L.ar]
              gen b_sigma  = _b[sigma:_cons]
              gen se_invest = _se[invest:_cons]
              gen se_ar     = _se[ARMA:L.ar]
              gen se_sigma  = _se[sigma:_cons]
              keep company b_* se_*
              keep in 1
            end
            runby my_arima, by(company) status
            timer off 1
            
            list in 1/10
            
            * ---------- using statsby ------------------
            use "`master'", clear
            timer on 2
            tsset company year
            statsby _b _se, basepop(company==1) by(company) clear: arima invest , ar(1)
            timer off 2
            
            list in 1/10
            
            timer list
            and the results on my computer (running MP 15.1):
            Code:
            . runby my_arima, by(company) status
            
              elapsed ----------- by-groups ----------    ------- observations ------       time
                 time      count     errors    no-data        processed         saved  remaining
            ------------------------------------------------------------------------------------
             00:00:01          9          0          0              180             9   00:00:11
             00:00:02         18          0          0              360            18   00:00:10
             00:00:03         28          0          0              560            28   00:00:08
             00:00:04         37          0          0              740            37   00:00:07
            (now reporting every 5 seconds)
             00:00:09         79          0          0            1,580            79   00:00:02
             00:00:12        100          0          0            2,000           100   00:00:00
            
            --------------------------------------
            Number of by-groups    =           100
            by-groups with errors  =             0
            by-groups with no data =             0
            Observations processed =         2,000
            Observations saved     =           100
            --------------------------------------
            
            . timer off 1
            
            . 
            . list in 1/10
            
                 +----------------------------------------------------------------------------+
                 | company   b_invest        b_ar    b_sigma   se_inv~t      se_ar   se_sigma |
                 |----------------------------------------------------------------------------|
              1. |       1   .4645525    .0136948   .2858653   .0652041   .2135187   .0636512 |
              2. |       2    .424218    .3304663   .3099041   .1093342   .2754071     .09213 |
              3. |       3   .4960087   -.5463693   .2582569   .0427073   .2131937   .0820881 |
              4. |       4   .5297394   -.1042499   .3055238   .0629797   .2478953   .0840082 |
              5. |       5   .3520795   -.2084872   .2388056   .0507218   .2522112    .070042 |
                 |----------------------------------------------------------------------------|
              6. |       6   .4180342   -.3460465   .2749622    .048511   .2820182    .063372 |
              7. |       7   .5519149    -.176004   .2659995   .0586299   .2971022   .0540392 |
              8. |       8   .5655345   -.3378355   .2683479    .047193    .245184   .0642803 |
              9. |       9   .5521508    .5478458   .2388245   .1392444   .2043547   .0493356 |
             10. |      10   .4346327    .0485568   .2891552    .104015   .2296528   .1017445 |
                 +----------------------------------------------------------------------------+
            
            . 
            . * ---------- using statsby ------------------
            . use "`master'", clear
            
            . timer on 2
            
            . tsset company year
                   panel variable:  company (strongly balanced)
                    time variable:  year, 1 to 20
                            delta:  1 unit
            
            . statsby _b _se, basepop(company==1) by(company) clear: arima invest , ar(1)
            (running arima on estimation sample)
            
                  command:  arima invest, ar(1)
                       by:  company
            
            Statsby groups
            ----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5 
            ..................................................    50
            ..................................................   100
            
            . timer off 2
            
            . 
            . list in 1/10
            
                 +----------------------------------------------------------------------------+
                 | company   i~b_cons     _stat_2   s~b_cons   i~e_cons    _stat_5   s~e_cons |
                 |----------------------------------------------------------------------------|
              1. |       1   .4645525    .0136948   .2858653   .0652041   .2135187   .0636512 |
              2. |       2    .424218    .3304663   .3099041   .1093342   .2754071     .09213 |
              3. |       3   .4960087   -.5463693   .2582569   .0427073   .2131937   .0820881 |
              4. |       4   .5297394   -.1042499   .3055238   .0629797   .2478953   .0840082 |
              5. |       5   .3520795   -.2084872   .2388056   .0507218   .2522112    .070042 |
                 |----------------------------------------------------------------------------|
              6. |       6   .4180342   -.3460465   .2749622    .048511   .2820182    .063372 |
              7. |       7   .5519149    -.176004   .2659995   .0586299   .2971022   .0540392 |
              8. |       8   .5655345   -.3378355   .2683479    .047193    .245184   .0642803 |
              9. |       9   .5521508    .5478458   .2388245   .1392444   .2043547   .0493356 |
             10. |      10   .4346327    .0485568   .2891552    .104015   .2296528   .1017445 |
                 +----------------------------------------------------------------------------+
            
            . 
            . timer list
               1:     12.06 /        1 =      12.0590
               2:    157.09 /        1 =     157.0930

            Comment


            • #7
              Dear Robert,

              Thank you for your suggestion. I think it works, although it probably will take more than one hour to do the job. Is there any possibility to make it faster such as replacing status with allocate(100000000)? Another issue is that I found a lot of errors using my data, although your example runs perfectly well.
              . program my_runby
              1.
              . sort tempid
              2.
              . xtset permno n
              3.
              . reg nightret L.nightret [aweight=sqrt(betweenhours)]
              4.
              . gen rmse=e(rmse)
              5.
              . keep in 1
              6.
              . end
              r; t=0.18 22:24:18
              .
              . runby my_runby, by(tempid) status
              elapsed ----------- by-groups ---------- ------- observations ------ time
              time count errors no-data processed saved remaining
              00:00:01 111 70 0 2,340 41 02:56:26
              00:00:02 192 71 0 4,039 121 03:24:31
              00:00:03 272 71 0 5,716 201 03:36:00
              00:00:04 389 125 0 8,172 264 03:21:08
              (now reporting every 5 seconds)
              00:00:09 951 367 0 19,904 584 03:05:12
              00:00:14 1,533 581 0 32,083 952 02:58:25
              00:00:19 2,170 812 0 45,393 1,358 02:51:04
              00:00:24 2,637 983 0 55,070 1,654 02:57:56
              00:00:29 3,161 1,189 0 66,063 1,972 02:59:06
              00:00:34 3,585 1,256 0 74,939 2,329 03:05:01
              00:00:39 4,119 1,479 0 86,081 2,640 03:04:38
              00:00:44 4,669 1,751 0 97,583 2,918 03:03:40
              00:00:49 5,320 2,115 0 111,154 3,205 02:59:29
              00:00:54 5,979 2,517 0 124,948 3,462 02:55:51
              00:00:59 6,567 2,806 0 137,035 3,761 02:55:06
              (now reporting every 15 seconds)
              00:01:14 8,247 3,613 0 171,793 4,634 02:54:54
              00:01:29 9,055 4,013 0 188,485 5,042 03:11:36
              00:01:44 9,118 4,024 0 189,811 5,094 03:42:17
              00:01:59 9,536 4,171 0 198,527 5,365 04:03:04
              00:02:14 10,973 4,805 0 228,475 6,168 03:57:32
              00:02:29 12,540 5,554 0 261,107 6,986 03:50:48
              00:02:44 14,002 5,977 0 291,753 8,025 03:47:04
              00:02:59 15,954 6,840 0 332,535 9,114 03:37:03
              00:03:14 16,116 6,902 0 335,945 9,214 03:52:49
              00:03:29 16,280 7,043 0 339,358 9,237 04:08:17
              00:03:44 17,482 7,261 0 364,500 10,221 04:07:29
              00:03:59 19,027 7,947 0 396,668 11,080 04:02:19
              00:04:14 20,450 8,512 0 426,092 11,938 03:59:27
              00:04:29 22,292 9,328 0 464,217 12,964 03:52:24
              00:04:44 24,358 10,196 0 507,107 14,162 03:44:12
              00:04:59 26,485 11,198 0 551,424 15,287 03:36:40
              00:05:14 28,334 11,870 0 589,761 16,464 03:32:24
              00:05:29 30,556 13,053 0 636,234 17,503 03:25:53
              00:05:44 32,446 13,964 0 675,678 18,482 03:22:21
              00:05:59 34,182 14,710 0 711,947 19,472 03:20:07
              00:06:14 35,853 15,372 0 746,860 20,481 03:18:26
              00:06:29 37,642 16,224 0 784,251 21,418 03:16:15
              00:06:44 39,594 17,287 0 824,916 22,307 03:13:26
              00:06:59 41,174 17,743 0 857,872 23,431 03:12:38
              00:07:14 42,652 18,125 0 888,760 24,527 03:12:21
              00:07:29 44,382 18,859 0 924,462 25,523 03:11:01
              00:07:44 46,654 20,311 0 971,871 26,343 03:07:23
              00:07:59 48,683 21,212 0 1,014,291 27,471 03:05:01
              00:08:14 50,625 22,075 0 1,054,607 28,550 03:03:12
              00:08:29 52,840 23,175 0 1,100,793 29,665 03:00:29
              00:08:44 54,847 24,039 0 1,142,599 30,808 02:58:41
              00:08:59 56,601 24,516 0 1,179,246 32,085 02:57:49
              00:09:14 58,449 25,261 0 1,217,766 33,188 02:56:41
              00:09:29 60,784 26,614 0 1,266,233 34,170 02:54:10
              00:09:44 62,481 27,203 0 1,301,693 35,278 02:53:37
              00:09:59 64,417 28,259 0 1,342,070 36,158 02:52:25
              Last edited by cheng yan; 10 Nov 2017, 16:09.

              Comment


              • #8
                The status option is one of the great features of runby. The longer it takes, the more accurate the estimated time remaining becomes. Why would you want to remove this option and run such a big job blind?

                No, the allocate() option will not speedup things. You are saving 1 observation per group, are you suggesting that you have 100 million groups?

                Your program does not need to sort and xtset. You just need to store the lagged nightret in a regular variable.

                I suspect that tempid is just another incarnation of permno. If you want suggestions, you need to better explain what you are doing.

                With respect to the errors you are encountering, the most likely culprit is insufficient observations for the regression. Here's an example that shows how you can test what you are trying to do on a small subsample of your big ol' dataset:

                Code:
                * create demonstration dataset
                clear all
                set seed 31231
                set obs 100
                gen long permno = runiformint(1111111,9999999)
                expand runiformint(1,100)
                bysort permno: gen n = _n
                gen nightret = runiform()
                replace nightret = . if runiform() < .2
                drop if runiform() < .2
                gen betweenhours = runiform()
                
                * create sequential group id
                egen long gid = group(permno)
                
                * learn how this works on a small subsample; pick first 5 groups
                keep if inrange(gid,1,5)
                
                * precompute the lag of nightret
                xtset gid n
                gen double Lnightret = L.nightret
                
                * perform the task for each of these 5 groups using standard Stata commands
                by gid: reg nightret Lnightret [aweight=sqrt(betweenhours)]
                
                * repeat, this time with runby
                program my_runby
                  reg nightret Lnightret [aweight=sqrt(betweenhours)]
                  gen rmse=e(rmse)
                  keep in 1
                end
                runby my_runby, by(gid)
                
                list
                and the final part of the results:
                Code:
                . runby my_runby, by(gid)
                
                --------------------------------------
                Number of by-groups    =             5
                by-groups with errors  =             1
                by-groups with no data =             0
                Observations processed =           155
                Observations saved     =             4
                --------------------------------------
                
                . 
                . list
                
                     +---------------------------------------------------------------+
                     |  permno   n   nightret   betwee~s   gid   Lnight~t       rmse |
                     |---------------------------------------------------------------|
                  1. | 1405036   1   .4981674   .9445046     1          .    .287572 |
                  2. | 1766318   1   .9967703   .7038013     3          .   .2516169 |
                  3. | 1780215   1   .8783187   .5729671     4          .   .2803902 |
                  4. | 1856470   2   .0929261   .0373204     5          .   .3531582 |
                     +---------------------------------------------------------------+
                
                .
                So there's no results for gid==2. If you scroll back the results window, you'll see that the regression output for fid == 2 is:
                Code:
                -> gid = 2
                insufficient observations
                If you would like to have results for by-groups with errors, you can capture the error. Here's one way to handle these:
                Code:
                program my_runby
                  capture reg nightret Lnightret [aweight=sqrt(betweenhours)]
                  gen rc = _rc
                  if _rc == 0 gen rmse=e(rmse)
                  keep in 1
                end
                runby my_runby, by(gid)
                which will produce this:
                Code:
                . list
                
                     +----------------------------------------------------------------------+
                     |  permno   n   nightret   betwee~s   gid   Lnight~t     rc       rmse |
                     |----------------------------------------------------------------------|
                  1. | 1405036   1   .4981674   .9445046     1          .      0    .287572 |
                  2. | 1742325   1   .9870269   .0540559     2          .   2001          . |
                  3. | 1766318   1   .9967703   .7038013     3          .      0   .2516169 |
                  4. | 1780215   1   .8783187   .5729671     4          .      0   .2803902 |
                  5. | 1856470   2   .0929261   .0373204     5          .      0   .3531582 |
                     +----------------------------------------------------------------------+
                Finally, if runby is not fast enough for you, you can always write your own Mata routine to perform the weighted regression and use rangestat. See the following thread to get you started.

                Comment


                • #9
                  Dear Robert,

                  Thank you again for your suggestion. I am keep the -status- option. Yes, it is indeed faster than -allocate-. I have 0.3 million groups and runby takes almost but less than one hour. tempid is indeed another incarnation of permno, which I get from egen tempid=group(permno month) as i need to run daily data within every month at the first place.

                  Yes, it is due to missing values in some months, thanks for it.

                  Best
                  Cheng

                  Comment

                  Working...
                  X