Rolling Regression with Unbalanced Panel Data

Melina Murren

Join Date: Mar 2017

Posts: 8
#1

Rolling Regression with Unbalanced Panel Data

16 Mar 2017, 06:47

Hello, I am very new to Stata and this forum so please excuse and feel free to correct any mistakes that I make (trying my best to post correctly)!

I have 48 years of monthly panel data for stock returns as follows (only an illustration):

gvkey date Peer_Excess Market_Excess sic Return
1001 jan1982 .063452 .034827 508 .032871
1001 feb1982 -.039245 .043827 508 -.067398
.
.
.
1002 jan1982 .032718 .031563 372 .021783
1002 feb1982 -.023142 .024892 372 -.042892
.
.
1003 march1983 .023813 .043291 210 .049382
1003 may1983 .032432 .032871 210 .054938

where gvkey is the identifier for each stock and Peer_Excess is the return on an equally weighted portfolio of peer firms in firm i's three-digit industry group (sic). I need to perform rolling regressions for each firm by regressing each firm's returns on the Peer_Excess and Market_Excess variables. The window begins with 24 months and increase up to 60 months. I need to save the intercept and betas for each regression. I have xtset my data prior to attempting any regressions.

The main problem is that the data is unbalanced. For instance, one firm has 200 data points over time and another has 120. The majority of the firm data also have gaps between months. I have looked at all of the help files and the Stata manual as well as all of the forum postings regarding rolling regressions and I cannot seem to find a workable solution. I have made attempts with rolling and rollreg (user written command).

I have also tried the following from another thread:

Code:

gen alpha=. compress save "intermediate.dta" forvalues i=1/_N{ use intermediate.dta keep if gvkey==`i' sum caldt if gvkey==`i' local start_time=r(min)+24 local end_time=r(max) forvalues t=`start_time'/`end_time'{ quitely reg Return Excess_Return Peer_Excess if gvkey==`i' & caldt<=`t' & caldt>=`t'-59 replace alpha=_b[_cons] if gvkey==`i' & caldt==`t' & e(N)>=24 }save intermediate`i'.dta } use intermediate.dta forvalues i=1/_N{ append using intermediate`i'.dta erase intermediate`i'.dta }

Any help or hints are greatly appreciated!!
Tags: gaps, panel data, rolling regression

Robert Picard

Join Date: Mar 2014
Posts: 1536

16 Mar 2017, 09:35

There have been several Statalist threads recently on how to do this. The simplest and most efficient way is to use rangestat (from SSC). I can't use your data example but it's pretty easy to concoct fake data that mimics it. Note that I use monthly dates. If you have daily dates, use the mofd() function to generate a monthly date. See help datetime for more information.

The first part of the code below creates the data, then there's code that defines a basic regression in Mata. You do not need to understand this code and you should just include it as it. Then the rangestat call performs all the regressions, one for each observation in the data. rangestat creates the variable myreg1 with the number of observation and myreg2 myreg3 and myreg4 contain the regression coefficients. The next line renames these generic names to something more descriptive.

A regression is performed for each observation in the original data. The number of observation in the sample can be used to reject results if they contain fewer observations than you want.

Finally, you should always spot check that you set up the rangestat command correctly by manually performing some regressions. I show how to do this at the end.

To run this example, copy the whole thing to a do-file and run it as a whole:

Code:

clear
set seed 312123
set obs 2
gen long gvkey = _n
gen long mdate = runiformint(ym(1968,1), ym(2016,12))
format %tm mdate
expand ym(2016,12) - mdate + 1
bysort gvkey: replace mdate = mdate + _n - 1
gen Peer_Excess = runiform()
gen Excess_Return = runiform()
gen Return = runiform()
drop if runiform() < .2
isid gvkey mdate, sort


* --------- basic regression mata code: DO NOT CHANGE CODE BELOW ---------------
* linear regression in Mata using quadcross() - help mata cross(), example 2
mata:
mata clear
mata set matastrict on
real rowvector myreg(real matrix Xall)
{
    real colvector y, b, Xy
    real matrix X, XX

    y = Xall[.,1]                // dependent var is first column of Xall
    X = Xall[.,2::cols(Xall)]    // the remaining cols are the independent variables
    X = X,J(rows(X),1,1)         // add a constant
    
    XX = quadcross(X, X)        // linear regression, see help mata cross(), example 2
    Xy = quadcross(X, y)
    b  = invsym(XX) * Xy
    
    return(rows(X), b')
}
end
* --------- end of basic regression mata code: DO NOT CHANGE CODE ABOVE --------


rangestat (myreg) Return Excess_Return Peer_Excess, interval(mdate -59 0) by(gvkey) casewise
rename myreg* (obs b_Excess_Return b_Peer_Excess b_cons)


* ------------------- spot check the coefficients for a few cases --------------

local i 50
reg Return Excess_Return Peer_Excess if inrange(mdate, mdate[`i']-59, mdate[`i']) & gvkey == gvkey[`i']
list obs b_* in `i'

local i 150
reg Return Excess_Return Peer_Excess if inrange(mdate, mdate[`i']-59, mdate[`i']) & gvkey == gvkey[`i']
list obs b_* in `i'

Comment

Melina Murren

Join Date: Mar 2017
Posts: 8

12 Apr 2017, 08:10

Hi Robert, this code worked perfectly!!
However I did have to change the Mata code a bit to match my specifications. Here is what I did for anyone who may be interested:

Code:

/*******basic regression mata code*********************************************/
* linear regression in Mata using quadcross()
mata:
mata clear
mata set matastrict on
real rowvector myreg(real matrix Xall)
{
real colvector y, b, Xy
real matrix X, XX
real scalar ymean, tss, mss, r2, r2a

y = Xall[.,1] // dependent var is first column of Xall
X = Xall[.,2::cols(Xall)] // the remaining cols are the independent variables
X = X,J(rows(X),1,1) // add a constant

XX = quadcross(X, X) // linear regression
Xy = quadcross(X, y)
b = invsym(XX) * Xy

ymean = mean(y)
tss = sum((y :- ymean) :^ 2) // total sum of squares
mss = sum( (X * b :- ymean) :^ 2) // model sum of squares
r2 = mss / tss
r2a = 1 - (1 - r2) * (rows(X) - 1) / (rows(X) - cols(X))

return(rows(X), r2a, b')
}
end
/*************Mata end*********************************************************/

rangestat (myreg) return mkt_excess peer_excess, interval(edate -59 0) by(permno) casewise
rename myreg* (obs AdjR2 b_mkt b_peer b_cons)

Comment

rozita saadan

Join Date: Apr 2016
Posts: 60

18 Apr 2017, 05:24

Hi Robert Picard and Melina Murren.

I am having the same problem where I have a very large dataset.
I have 20 years of daily panel data, 2000 companies. I would like to run a monthly-time series regression for each company. I have used the rolling technique, but it is very time-consuming as the data is very large.
This is the sample of data:

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float idc str6 company float(caldate holidummy MonDum TuesDum WedDum ThurDum bcaldate return monthly yearly residualR aggregate_market_residualR)
2 "982912" 12786 0 0 1 0 0  2  -3.774033 420 1995  -2.589102 -3.0176375
2 "982912" 12787 0 0 0 1 0  3  -3.922071 420 1995 -3.6458254  -3.429811
2 "982912" 12788 0 0 0 0 1  4  1.9802628 420 1995 -.17556953  -.4146185
2 "982912" 12789 0 0 0 0 0  5 -1.9802628 420 1995 -3.0008125  -2.668158
2 "982912" 12792 0 1 0 0 0  6  -6.187541 420 1995  -.7600012 -.14440288
2 "982912" 12793 0 0 1 0 0  7 -11.247798 420 1995 -10.062867  -7.401803
2 "982912" 12794 0 0 0 1 0  8  4.6520014 420 1995   4.928247   3.439408
2 "982912" 12795 0 0 0 0 1  9   8.701138 420 1995   6.545305   7.295943
2 "982912" 12796 0 0 0 0 0 10   6.062462 420 1995   5.041913  4.6351266
2 "982912" 12799 0 1 0 0 0 11 -1.9802628 420 1995  3.4472766  2.4033484
2 "982912" 12800 0 0 1 0 0 12          0 420 1995  1.1849307  1.9046142
2 "982912" 12801 0 0 0 1 0 13 -4.0821996 420 1995 -3.8059535 -1.3642504
2 "982912" 12802 0 0 0 0 1 14 -4.2559614 420 1995  -6.411794  -5.885695
2 "982912" 12803 0 0 0 0 0 15          0 420 1995 -1.0205499 -.42145535
2 "982912" 12806 0 1 0 0 0 16  -4.445176 420 1995   .9823632 .019192345
2 "982912" 12807 0 0 1 0 0 17          0 420 1995  1.1849307  2.2053733
2 "982912" 12808 0 0 0 1 0 18  2.2472856 420 1995  2.5235314    1.06253
2 "982912" 12809 0 0 0 0 1 19  2.1978908 420 1995  .04205847 -1.0214025
2 "982912" 12810 0 0 0 0 0 20          0 420 1995 -1.0205499 -1.5473646
2 "982912" 12813 0 1 0 0 0 21 -9.0971775 420 1995  -3.669638 -2.2325153
2 "982912" 12814 0 0 1 0 0 22  9.0971775 420 1995  10.282108   6.477159
2 "982912" 12815 0 0 0 1 0 23          0 421 1995   4.048622     3.9379
2 "982912" 12816 0 0 0 0 1 24          0 421 1995  1.2170115   1.237449
2 "982912" 12817 0 0 0 0 0 25          0 421 1995  -3.509646 -1.1499851
2 "982912" 12820 0 1 0 0 0 26  -4.445176 421 1995  -1.313517   .6534857
end
format %td caldate
format %tbmybcal bcaldate
format %tm monthly
format %ty yearly

I want to run two-step regression:
For the first regression, I regress day-of-week-dummy, holiday dummy on return. I want to store the residual value.
For the second regression, I regress aggregate market residual on the residual. I would like to save the beta coefficient and R-square for each regression.

This is the command I used for both regressions:

Code:

Code:
*equation 1*
egen group = group(company monthly)
gen fittedR = .
su group, meanonly

forval g = 1/`r(max)' {
    regress return MonDum TuesDum WedDum ThurDum FriDum holidummy if group == `g', noconstant
    predict workR
    replace fittedR = workR if group == `g'  
    drop workR
}
gen residualR = return - fittedR


*Equation 2*
**store r2 using forval
xtset
gen r2R = .
su group, meanonly

forval g = 1/`r(max)' {
    capture reg residualR aggregate_market_residualR L.aggregate_market_residualR F.aggregate_market_residualR if group == `g'
    replace r2R = e(r2_a) if group == `g'  
}

I had run the regression more than 24hours, but it is still running. Can you help me to solve this problem.
Your help is highly appreciated.

Announcement