Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Rolling Regression with Unbalanced Panel Data

    Hello, I am very new to Stata and this forum so please excuse and feel free to correct any mistakes that I make (trying my best to post correctly)!

    I have 48 years of monthly panel data for stock returns as follows (only an illustration):

    gvkey date Peer_Excess Market_Excess sic Return
    1001 jan1982 .063452 .034827 508 .032871
    1001 feb1982 -.039245 .043827 508 -.067398
    .
    .
    .
    1002 jan1982 .032718 .031563 372 .021783
    1002 feb1982 -.023142 .024892 372 -.042892
    .
    .
    1003 march1983 .023813 .043291 210 .049382
    1003 may1983 .032432 .032871 210 .054938

    where gvkey is the identifier for each stock and Peer_Excess is the return on an equally weighted portfolio of peer firms in firm i's three-digit industry group (sic). I need to perform rolling regressions for each firm by regressing each firm's returns on the Peer_Excess and Market_Excess variables. The window begins with 24 months and increase up to 60 months. I need to save the intercept and betas for each regression. I have xtset my data prior to attempting any regressions.

    The main problem is that the data is unbalanced. For instance, one firm has 200 data points over time and another has 120. The majority of the firm data also have gaps between months. I have looked at all of the help files and the Stata manual as well as all of the forum postings regarding rolling regressions and I cannot seem to find a workable solution. I have made attempts with rolling and rollreg (user written command).

    I have also tried the following from another thread:
    Code:
    gen alpha=.
    compress 
    save "intermediate.dta"
    forvalues i=1/_N{
        use intermediate.dta
        keep if gvkey==`i'
        sum caldt if gvkey==`i'
        local start_time=r(min)+24
        local end_time=r(max)
        
        forvalues t=`start_time'/`end_time'{
            quitely reg Return Excess_Return Peer_Excess if gvkey==`i' & caldt<=`t' & caldt>=`t'-59
            replace alpha=_b[_cons] if gvkey==`i' & caldt==`t' & e(N)>=24
            }save intermediate`i'.dta
        }    
        
        use intermediate.dta
        forvalues i=1/_N{
        append using intermediate`i'.dta
        erase intermediate`i'.dta
        }
    Any help or hints are greatly appreciated!!

  • #2
    There have been several Statalist threads recently on how to do this. The simplest and most efficient way is to use rangestat (from SSC). I can't use your data example but it's pretty easy to concoct fake data that mimics it. Note that I use monthly dates. If you have daily dates, use the mofd() function to generate a monthly date. See help datetime for more information.

    The first part of the code below creates the data, then there's code that defines a basic regression in Mata. You do not need to understand this code and you should just include it as it. Then the rangestat call performs all the regressions, one for each observation in the data. rangestat creates the variable myreg1 with the number of observation and myreg2 myreg3 and myreg4 contain the regression coefficients. The next line renames these generic names to something more descriptive.

    A regression is performed for each observation in the original data. The number of observation in the sample can be used to reject results if they contain fewer observations than you want.

    Finally, you should always spot check that you set up the rangestat command correctly by manually performing some regressions. I show how to do this at the end.

    To run this example, copy the whole thing to a do-file and run it as a whole:

    Code:
    clear
    set seed 312123
    set obs 2
    gen long gvkey = _n
    gen long mdate = runiformint(ym(1968,1), ym(2016,12))
    format %tm mdate
    expand ym(2016,12) - mdate + 1
    bysort gvkey: replace mdate = mdate + _n - 1
    gen Peer_Excess = runiform()
    gen Excess_Return = runiform()
    gen Return = runiform()
    drop if runiform() < .2
    isid gvkey mdate, sort
    
    
    * --------- basic regression mata code: DO NOT CHANGE CODE BELOW ---------------
    * linear regression in Mata using quadcross() - help mata cross(), example 2
    mata:
    mata clear
    mata set matastrict on
    real rowvector myreg(real matrix Xall)
    {
        real colvector y, b, Xy
        real matrix X, XX
    
        y = Xall[.,1]                // dependent var is first column of Xall
        X = Xall[.,2::cols(Xall)]    // the remaining cols are the independent variables
        X = X,J(rows(X),1,1)         // add a constant
        
        XX = quadcross(X, X)        // linear regression, see help mata cross(), example 2
        Xy = quadcross(X, y)
        b  = invsym(XX) * Xy
        
        return(rows(X), b')
    }
    end
    * --------- end of basic regression mata code: DO NOT CHANGE CODE ABOVE --------
    
    
    rangestat (myreg) Return Excess_Return Peer_Excess, interval(mdate -59 0) by(gvkey) casewise
    rename myreg* (obs b_Excess_Return b_Peer_Excess b_cons)
    
    
    * ------------------- spot check the coefficients for a few cases --------------
    
    local i 50
    reg Return Excess_Return Peer_Excess if inrange(mdate, mdate[`i']-59, mdate[`i']) & gvkey == gvkey[`i']
    list obs b_* in `i'
    
    local i 150
    reg Return Excess_Return Peer_Excess if inrange(mdate, mdate[`i']-59, mdate[`i']) & gvkey == gvkey[`i']
    list obs b_* in `i'

    Comment


    • #3
      Hi Robert, this code worked perfectly!!
      However I did have to change the Mata code a bit to match my specifications. Here is what I did for anyone who may be interested:

      Code:
      /*******basic regression mata code*********************************************/
      * linear regression in Mata using quadcross()
      mata:
      mata clear
      mata set matastrict on
      real rowvector myreg(real matrix Xall)
      {
      real colvector y, b, Xy
      real matrix X, XX
      real scalar ymean, tss, mss, r2, r2a
      
      y = Xall[.,1] // dependent var is first column of Xall
      X = Xall[.,2::cols(Xall)] // the remaining cols are the independent variables
      X = X,J(rows(X),1,1) // add a constant
      
      XX = quadcross(X, X) // linear regression
      Xy = quadcross(X, y)
      b = invsym(XX) * Xy
      
      ymean = mean(y)
      tss = sum((y :- ymean) :^ 2) // total sum of squares
      mss = sum( (X * b :- ymean) :^ 2) // model sum of squares
      r2 = mss / tss
      r2a = 1 - (1 - r2) * (rows(X) - 1) / (rows(X) - cols(X))
      
      return(rows(X), r2a, b')
      }
      end
      /*************Mata end*********************************************************/
      
      rangestat (myreg) return mkt_excess peer_excess, interval(edate -59 0) by(permno) casewise
      rename myreg* (obs AdjR2 b_mkt b_peer b_cons)

      Comment


      • #4
        Hi Robert Picard and Melina Murren.

        I am having the same problem where I have a very large dataset.
        I have 20 years of daily panel data, 2000 companies. I would like to run a monthly-time series regression for each company. I have used the rolling technique, but it is very time-consuming as the data is very large.
        This is the sample of data:

        ----------------------- copy starting from the next line -----------------------
        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input float idc str6 company float(caldate holidummy MonDum TuesDum WedDum ThurDum bcaldate return monthly yearly residualR aggregate_market_residualR)
        2 "982912" 12786 0 0 1 0 0  2  -3.774033 420 1995  -2.589102 -3.0176375
        2 "982912" 12787 0 0 0 1 0  3  -3.922071 420 1995 -3.6458254  -3.429811
        2 "982912" 12788 0 0 0 0 1  4  1.9802628 420 1995 -.17556953  -.4146185
        2 "982912" 12789 0 0 0 0 0  5 -1.9802628 420 1995 -3.0008125  -2.668158
        2 "982912" 12792 0 1 0 0 0  6  -6.187541 420 1995  -.7600012 -.14440288
        2 "982912" 12793 0 0 1 0 0  7 -11.247798 420 1995 -10.062867  -7.401803
        2 "982912" 12794 0 0 0 1 0  8  4.6520014 420 1995   4.928247   3.439408
        2 "982912" 12795 0 0 0 0 1  9   8.701138 420 1995   6.545305   7.295943
        2 "982912" 12796 0 0 0 0 0 10   6.062462 420 1995   5.041913  4.6351266
        2 "982912" 12799 0 1 0 0 0 11 -1.9802628 420 1995  3.4472766  2.4033484
        2 "982912" 12800 0 0 1 0 0 12          0 420 1995  1.1849307  1.9046142
        2 "982912" 12801 0 0 0 1 0 13 -4.0821996 420 1995 -3.8059535 -1.3642504
        2 "982912" 12802 0 0 0 0 1 14 -4.2559614 420 1995  -6.411794  -5.885695
        2 "982912" 12803 0 0 0 0 0 15          0 420 1995 -1.0205499 -.42145535
        2 "982912" 12806 0 1 0 0 0 16  -4.445176 420 1995   .9823632 .019192345
        2 "982912" 12807 0 0 1 0 0 17          0 420 1995  1.1849307  2.2053733
        2 "982912" 12808 0 0 0 1 0 18  2.2472856 420 1995  2.5235314    1.06253
        2 "982912" 12809 0 0 0 0 1 19  2.1978908 420 1995  .04205847 -1.0214025
        2 "982912" 12810 0 0 0 0 0 20          0 420 1995 -1.0205499 -1.5473646
        2 "982912" 12813 0 1 0 0 0 21 -9.0971775 420 1995  -3.669638 -2.2325153
        2 "982912" 12814 0 0 1 0 0 22  9.0971775 420 1995  10.282108   6.477159
        2 "982912" 12815 0 0 0 1 0 23          0 421 1995   4.048622     3.9379
        2 "982912" 12816 0 0 0 0 1 24          0 421 1995  1.2170115   1.237449
        2 "982912" 12817 0 0 0 0 0 25          0 421 1995  -3.509646 -1.1499851
        2 "982912" 12820 0 1 0 0 0 26  -4.445176 421 1995  -1.313517   .6534857
        end
        format %td caldate
        format %tbmybcal bcaldate
        format %tm monthly
        format %ty yearly
        I want to run two-step regression:
        For the first regression, I regress day-of-week-dummy, holiday dummy on return. I want to store the residual value.
        For the second regression, I regress aggregate market residual on the residual. I would like to save the beta coefficient and R-square for each regression.

        This is the command I used for both regressions:

        Code:
        Code:
        *equation 1*
        egen group = group(company monthly)
        gen fittedR = .
        su group, meanonly
        
        forval g = 1/`r(max)' {
            regress return MonDum TuesDum WedDum ThurDum FriDum holidummy if group == `g', noconstant
            predict workR
            replace fittedR = workR if group == `g'  
            drop workR
        }
        gen residualR = return - fittedR
        
        
        *Equation 2*
        **store r2 using forval
        xtset
        gen r2R = .
        su group, meanonly
        
        forval g = 1/`r(max)' {
            capture reg residualR aggregate_market_residualR L.aggregate_market_residualR F.aggregate_market_residualR if group == `g'
            replace r2R = e(r2_a) if group == `g'  
        }

        I had run the regression more than 24hours, but it is still running. Can you help me to solve this problem.
        Your help is highly appreciated.



        Comment

        Working...
        X