Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Speeding up loop of time series regressions in mata

    Dear Statalist,
    I have a code (loop) to run time series regressions for each company (Identifier: lpermno) of prices (prccq) on earnings per share (epspiq). The code looks like this:

    Code:
    gen alpha =.
    gen alpha_stderr =.
    gen beta =.
    gen beta_stderr =.
     
     levelsof lpermno, local(levels)
     
    foreach i of local levels {
     
            reg prccq epspiq if lpermno== `i',robust
            replace alpha = _b[_cons] if lpermno == `i'
                                       replace alpha_stderr = _se[_cons] if lpermno == `i'
                                       replace beta = _b[epspiq] if lpermno == `i'
                                       replace beta_stderr = _se[epspiq] if lpermno == `i'
    }
    gen alpha_tstat = alpha / alpha_stderr
    gen beta_tstat = beta / beta_stderr
     
    save alphas_per_lpermno, replace
    It gives back the alpha, beta and their standard errors from the regressions in new columns of my original data.
    Unfortunately my code is way too slow. To run through once it takes 1.5 to 2 hours (I have almost 700,000 observations).

    Is it possible to speed this up in mata? I’ve been trying to adapt a rangestat code to my needs (I used this one from Robert Picard as a starting point: https://www.statalist.org/forums/for...tas-for-stocks) but unfortunately was unsuccessful.

    Any advice is appreciated.

    Thanks in advance,
    Timo

  • #2
    I don't think Mata is relevant here, and I'd suggest you move your question to the Stata forum. It's possible the slowness comes from your "replace ... if lpermno ..." qualifier, which often is slow. It's also possible that the simple effort of doing a large number of regressions (you didn't mention the number of levels of lpermno, but I'm guessing) that slows things down.

    In any event, what you are doing looks to me like a task for -statsby-. It would accumulate the statistics in another file, and you can then merge that with your data file. It's possible that -rangestat- could help, too, but you might as well try something built-in first.

    Comment


    • #3
      The (reg) function of rangestat can perform regressions on by-groups like this but not with the robust option. However, I showed how to do this in a recent thread. Adapted to the nomenclature of this thread, this would look like (runs in less than 2 seconds on my computer):

      Code:
      * come on finance people, it's not that hard to create a demonstration dataset!
      clear all
      set seed 123
      set obs 7000
      gen lpermno = _n
      expand 100
      bysort lpermno: gen time = _n
      gen prccq = runiform()
      gen epspiq = runiform()
      
      
      * this is a copy of the rs_reg() regression function used by rangestat,
      * modified to calculate robust standard errors
      
      mata:
      // http://blog.stata.com/2016/01/12/programming-an-estimation-command-in-stata-an-ols-command-using-mata/
      real rowvector reg_robust(real matrix Xall)
      {
          real colvector y, b, e, e2, se
          real matrix X, XpX, XpXi, V, M
          real scalar n, k, ymean, tss, mss, r2, r2a
          
          y = Xall[.,1]                // dependent var is first column of Xall
          X = Xall[.,2::cols(Xall)]    // the remaining cols are the independent variables
          n = rows(X)                  // the number of observations
          X = X,J(n,1,1)               // add a constant
          k = cols(X)                      // number of independent variables
          
          if (n > k) {    // need more than k obs to estimate model and standard errors
          
              // compute the OLS point estimates
              XpX  = quadcross(X, X)
              XpXi = invsym(XpX)
          
              // proceed only if no variable is omitted
              if (diag0cnt(XpXi)==0) {
              
      
                  b    = XpXi*quadcross(X, y)
              
                  // compute robust standard errors
                  e  = y - X*b
                  e2 = e:^2
                  M    = quadcross(X, e2, X)
                  V    = (n/(n-k))*XpXi*M*XpXi
                  se   = sqrt(diagonal(V))
              
                  // r2 and adjusted r2
                  ymean = mean(y)
                  tss   = sum((y :- ymean) :^ 2)        // total sum of squares
                  mss   = sum( (X * b :- ymean)  :^ 2)  // model sum of squares    
                  r2    = mss / tss
                  r2a   = 1 - (1 - r2) * (n - 1) / (n - k)
              }
          }
          
          return(rows(X), r2, r2a, b', se')
      }
      end
      
      rangestat (reg_robust) prccq epspiq, interval(time . .) by(lpermno)
      rename reg_robust1 reg_nobs
      rename reg_robust2 reg_r2
      rename reg_robust3 reg_r2a
      rename reg_robust4 beta
      rename reg_robust5 alpha
      rename reg_robust6 beta_stderr
      rename reg_robust7 alpha_stderr
      Here's a spot check for company 2
      Code:
      . * spot check for company 2
      . reg prccq epspiq if lpermno== 2,robust
      
      Linear regression                               Number of obs     =        100
                                                      F(1, 98)          =       0.24
                                                      Prob > F          =     0.6237
                                                      R-squared         =     0.0022
                                                      Root MSE          =     .28681
      
      ------------------------------------------------------------------------------
                   |               Robust
             prccq |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
            epspiq |   .0454931   .0924296     0.49   0.624    -.1379305    .2289167
             _cons |   .5679693   .0545472    10.41   0.000     .4597222    .6762163
      ------------------------------------------------------------------------------
      
      . list in 101
      
           +------------------------------------------------------------------------------------------------------------------------+
           | lpermno   time      prccq     epspiq   reg_nobs     reg_r2      reg_r2a       beta       alpha   beta_st~r   alpha_s~r |
           |------------------------------------------------------------------------------------------------------------------------|
      101. |       2      1   .7755345   .7378782        100   .0022436   -.00793759   .0454931   .56796925   .09242964   .05454715 |
           +------------------------------------------------------------------------------------------------------------------------+
      .

      Comment


      • #4
        Thank you for your prompt responses. Robert Picard the rangestat code you provided works perfectly well. It now takes about 10 seconds to do the regressions for my complete dataset, which is an improvement I did not expect but I am very grateful for. As far as I could check until now the regressions are calculated correctly. Thank you very much!

        Comment


        • #5
          Great! For completeness, I'll add that you can use runby (from SSC) to do the same calculations. runby is a general tool to run Stata code on by-groups of observations and is more efficient than statsby. It will be a bit slower than the rangestat solution above due to the extra overhead related to using regress. The code below repeats the same as above and runs in about 35 seconds on my computer.

          Code:
          * come on finance people, it's not that hard to create a demonstration dataset!
          clear all
          set seed 123
          set obs 7000
          gen lpermno = _n
          expand 100
          bysort lpermno: gen time = _n
          gen prccq = runiform()
          gen epspiq = runiform()
          
          program my_reg
            reg prccq epspiq, robust
            gen alpha        = _b[_cons]
            gen alpha_stderr = _se[_cons]
            gen beta        = _b[epspiq]
            gen beta_stderr = _se[epspiq]
          end
          
          runby my_reg, by(lpermno) status

          Comment


          • #6
            Hi again,

            if I use the above mentioned code twice in a Do-file (I acutally change it to run regressions on different variables and also change the names of the variables containing the results), I receive the following error message:

            "reg_robust() already exists
            (41 lines skipped)".

            The command "clear" which I usually used does not seem to do the trick here. Only restarting Stata worked for me until now.

            What can I do to be able to run the rangestat code twice in a Do-file, without having to restart Stata?

            Thanks in advance.
            Best,
            Timo

            Comment


            • #7
              Hint: the example starts with
              Code:
              clear all

              Comment

              Working...
              X