Panel data - Residual Standard Deviation of Regressions

Nils Holstein

Join Date: Oct 2017

Posts: 14
#1

Panel data - Residual Standard Deviation of Regressions

07 Nov 2017, 16:10

Hello everybody!

I have a panel dataset consisting of daily data with around 10000 companies and I want to perform regressions while obtaining the standard deviation of the residuals for each regression performed.

My regression formula looks like this:

Code:

regress ret MktRF SMB HML

I have declared a panel dataset and want to restrict my regression sample to one of the companies (permno) and one month for every single regression (month). In a last step I want to store the residual standard deviation in a seperate variable in my panel for every regression sample performed.

First:

Reg Y X
Calculate residuals
Compute Standard Deviation of residuals
Store result in "new variable"

Second:

Reg Y X
Calculate residuals
Compute Standard Deviation of residuals
Store result in "new variable" 2nd cell
...

A similar thread considered the following code:
https://www.statalist.org/forums/for...ing-regression

Code:

capture program drop my_regress program define my_regress, rclass syntax varlist [if] regress `varlist' `if' tempvar resid predict `resid' if e(sample), resid summ `resid' return scalar sdr = r(sd) exit end rolling sd_resid = r(sdr) reg_id, w(15): my_regress ret MktRF SMB HML

With the code stated above I cannot run the regressions. Do you have any considerations on my code or on a faster way for computing the standard deviation of regression residuals?

Thank you in advance!

Nils
Tags: None

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17680

08 Nov 2017, 03:31

Nils:
just one more general comment: if you have panel data and you are sure that -regress- outperforms -xtreg-, you should -cluster()- your standard errors on -panelid-, as your observations are not independent (but Stata cannot know that):

Code:

. use "http://www.stata-press.com/data/r14/nlswork.dta", clear
(National Longitudinal Survey.  Young Women 14-26 years of age in 1968)

. reg ln_wage race age

      Source |       SS           df       MS      Number of obs   =    28,510
-------------+----------------------------------   F(2, 28507)     =   1409.09
       Model |  586.276287         2  293.138144   Prob > F        =    0.0000
    Residual |  5930.41386    28,507  .208033601   R-squared       =    0.0900
-------------+----------------------------------   Adj R-squared   =    0.0899
       Total |  6516.69015    28,509  .228583611   Root MSE        =    .45611

------------------------------------------------------------------------------
     ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        race |  -.1112173   .0056016   -19.85   0.000    -.1221967   -.1002379
         age |   .0197575   .0004032    49.01   0.000     .0189673    .0205477
       _cons |   1.246038   .0141305    88.18   0.000     1.218341    1.273734
------------------------------------------------------------------------------

. predict res, res
(24 missing values generated)

. reg ln_wage race age, vce(cluster idcode)

Linear regression                               Number of obs     =     28,510
                                                F(2, 4709)        =     528.08
                                                Prob > F          =     0.0000
                                                R-squared         =     0.0900
                                                Root MSE          =     .45611

                             (Std. Err. adjusted for 4,710 clusters in idcode)
------------------------------------------------------------------------------
             |               Robust
     ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        race |  -.1112173   .0131033    -8.49   0.000     -.136906   -.0855287
         age |   .0197575   .0006371    31.01   0.000     .0185085    .0210065
       _cons |   1.246038   .0244231    51.02   0.000     1.198157    1.293919
------------------------------------------------------------------------------


. predict res2
(option xb assumed; fitted values)
(24 missing values generated)

. list res res2 in 1

     +---------------------+
     |      res       res2 |
     |---------------------|
  1. | .0719757   1.379238 |
     +---------------------+

.

Last edited by Carlo Lazzaro; 08 Nov 2017, 03:35.

Kind regards,
Carlo
(Stata 19.0)

Comment

Nils Holstein

Join Date: Oct 2017

Posts: 14
#3

08 Nov 2017, 06:37

Hello Calro,

thanks for the reply. In purpose of replicating empirical results, my aim is to regress each company individually and therefore the sample is limited to that company alone.

I want to conduct daily regressions of every company, and save the residual in a new variable. The variable should contain all the residual values of the daily company regressions. Next, I want compute the monthly residual standard deviation of the regressions by company.

Right now I am struggling with the code that stores the residual values of daily regressions in the new variable.

I hope there is an easy solution of the problem.

Kind regards,

Nils
Comment
Nils Holstein

Join Date: Oct 2017

Posts: 14
#4

08 Nov 2017, 07:27

Here is the code I am currently using, i feel that there it is still wrong though.

Obs_id identifies the company and day and permno identifies the company; is a cluster robust regression necessary although the regression sample is limited to daily company data?

Code:

levels obs_id, local(C) gen zresid = . foreach c in obs_id { bysort obs_id: regress ret MktRF SMB HML if obs_id == `c', vce(cluster permno) predict temp, res replace zresid = temp if obs_id == `c' drop temp }

Last edited by Nils Holstein; 08 Nov 2017, 07:30.
Comment
Nils Holstein

Join Date: Oct 2017

Posts: 14
#5

16 Nov 2017, 15:38

Maybe i need to further clarify the problem. For my analysis I am using a set of financial data from the CRSP database. The dataset consists of daily stock data for tradingdays (with gaps), in total I have 20,000 companies with diverging length of individual time series. The whole dataset consists of 37M observations and is declared as a panel.

To replicate a former study (Ang 2006) I want to extract regression residuals on seperate regressions for month and company (12 monthly regressions per company and year) . Companies with less than 17 monthly observations are filtered from the dataset.

Because the dataset is extremely large, i split the dataset to consider only one year of data which reduces the dataset to 1M. In total, I still run 50,000 regressions per year and it takes about 2 hours to compute residuals for a single year within the dataset.

This is the loop I am currently using:

Code:

egen long id = group(permno month) gen resid = . label variable resid "Residuen" summ id, meanonly quietly forvalues x = 1/`r(max)' { reg retRF MktRF SMB HML if id == `x' predict temp, res replace resid = temp if id == `x' drop temp }

I already cleared the splitted dataset from any unused variables and try to optimize the regression performance. I already tried to use the fastreg command by Geertsema (2014) but it is not compatible with extracting regression residuals. (http://ssrn.com/abstract=2423171)

I would be very glad for any advice on optimizing performance speed of the calcualtions.

I also posted my problem in another forum:
https://stackoverflow.com/questions/...-panel-dataset

Kind regards,

Nils Holstein

Last edited by Nils Holstein; 16 Nov 2017, 15:42.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#6

16 Nov 2017, 16:10

Perhaps replacing

Code:

predict temp, res

with

Code:

predict temp if id == `x', res

will speed things up.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

16 Nov 2017, 17:32

You can use runby (from SSC) to perform regressions on by-groups but because of the size of the problem, you are probably better off using rangestat (also from SSC). Here's a quick example that mimics your code.

Code:

clear all
set seed 123
set obs 200
gen long permno = _n
expand 12
bysort permno: gen month = _n
expand runiformint(17,50)
gen retRF = runiform()
gen MktRF = runiform()
gen SMB = runiform()
gen HML = runiform()
sort permno month

egen long id = group(permno month)

rangestat (reg) retRF MktRF SMB HML, interval(month 0 0) by(permno)
gen resid = retRF - (b_MktRF*MktRF + b_SMB*SMB + b_HML*HML + b_cons)

reg retRF MktRF SMB HML if id == 2
predict temp if id == 2, res
list permno month resid temp if id == 2

The last 3 lines perform a spot check using observations from id 2.

Announcement