Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Panel data - Residual Standard Deviation of Regressions

    Hello everybody!

    I have a panel dataset consisting of daily data with around 10000 companies and I want to perform regressions while obtaining the standard deviation of the residuals for each regression performed.

    My regression formula looks like this:
    Code:
    regress ret MktRF SMB HML
    I have declared a panel dataset and want to restrict my regression sample to one of the companies (permno) and one month for every single regression (month). In a last step I want to store the residual standard deviation in a seperate variable in my panel for every regression sample performed.


    First:

    Reg Y X
    Calculate residuals
    Compute Standard Deviation of residuals
    Store result in "new variable"


    Second:

    Reg Y X
    Calculate residuals
    Compute Standard Deviation of residuals
    Store result in "new variable" 2nd cell
    ...

    A similar thread considered the following code:
    https://www.statalist.org/forums/for...ing-regression

    Code:
    capture program drop my_regress
    program define my_regress, rclass
        syntax varlist [if]
        regress `varlist' `if'
        tempvar resid
        predict `resid' if e(sample), resid
        summ `resid'
        return scalar sdr = r(sd)
        exit
    end
    
    rolling sd_resid = r(sdr) reg_id, w(15): my_regress ret MktRF SMB HML
    With the code stated above I cannot run the regressions. Do you have any considerations on my code or on a faster way for computing the standard deviation of regression residuals?

    Thank you in advance!

    Nils

  • #2
    Nils:
    just one more general comment: if you have panel data and you are sure that -regress- outperforms -xtreg-, you should -cluster()- your standard errors on -panelid-, as your observations are not independent (but Stata cannot know that):
    Code:
    . use "http://www.stata-press.com/data/r14/nlswork.dta", clear
    (National Longitudinal Survey.  Young Women 14-26 years of age in 1968)
    
    . reg ln_wage race age
    
          Source |       SS           df       MS      Number of obs   =    28,510
    -------------+----------------------------------   F(2, 28507)     =   1409.09
           Model |  586.276287         2  293.138144   Prob > F        =    0.0000
        Residual |  5930.41386    28,507  .208033601   R-squared       =    0.0900
    -------------+----------------------------------   Adj R-squared   =    0.0899
           Total |  6516.69015    28,509  .228583611   Root MSE        =    .45611
    
    ------------------------------------------------------------------------------
         ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
            race |  -.1112173   .0056016   -19.85   0.000    -.1221967   -.1002379
             age |   .0197575   .0004032    49.01   0.000     .0189673    .0205477
           _cons |   1.246038   .0141305    88.18   0.000     1.218341    1.273734
    ------------------------------------------------------------------------------
    
    . predict res, res
    (24 missing values generated)
    
    . reg ln_wage race age, vce(cluster idcode)
    
    Linear regression                               Number of obs     =     28,510
                                                    F(2, 4709)        =     528.08
                                                    Prob > F          =     0.0000
                                                    R-squared         =     0.0900
                                                    Root MSE          =     .45611
    
                                 (Std. Err. adjusted for 4,710 clusters in idcode)
    ------------------------------------------------------------------------------
                 |               Robust
         ln_wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
            race |  -.1112173   .0131033    -8.49   0.000     -.136906   -.0855287
             age |   .0197575   .0006371    31.01   0.000     .0185085    .0210065
           _cons |   1.246038   .0244231    51.02   0.000     1.198157    1.293919
    ------------------------------------------------------------------------------
    
    
    . predict res2
    (option xb assumed; fitted values)
    (24 missing values generated)
    
    . list res res2 in 1
    
         +---------------------+
         |      res       res2 |
         |---------------------|
      1. | .0719757   1.379238 |
         +---------------------+
    
    .
    Last edited by Carlo Lazzaro; 08 Nov 2017, 03:35.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Hello Calro,

      thanks for the reply. In purpose of replicating empirical results, my aim is to regress each company individually and therefore the sample is limited to that company alone.

      I want to conduct daily regressions of every company, and save the residual in a new variable. The variable should contain all the residual values of the daily company regressions. Next, I want compute the monthly residual standard deviation of the regressions by company.


      Right now I am struggling with the code that stores the residual values of daily regressions in the new variable.

      I hope there is an easy solution of the problem.


      Kind regards,

      Nils

      Comment


      • #4
        Here is the code I am currently using, i feel that there it is still wrong though.

        Obs_id identifies the company and day and permno identifies the company; is a cluster robust regression necessary although the regression sample is limited to daily company data?

        Code:
        levels obs_id, local(C)
        gen zresid = .
        foreach c in obs_id {
            bysort obs_id: regress ret MktRF SMB HML if obs_id == `c', vce(cluster permno)
            predict temp, res
            replace zresid = temp if obs_id == `c'
            drop temp
        }
        Last edited by Nils Holstein; 08 Nov 2017, 07:30.

        Comment


        • #5
          Maybe i need to further clarify the problem. For my analysis I am using a set of financial data from the CRSP database. The dataset consists of daily stock data for tradingdays (with gaps), in total I have 20,000 companies with diverging length of individual time series. The whole dataset consists of 37M observations and is declared as a panel.

          To replicate a former study (Ang 2006) I want to extract regression residuals on seperate regressions for month and company (12 monthly regressions per company and year) . Companies with less than 17 monthly observations are filtered from the dataset.

          Because the dataset is extremely large, i split the dataset to consider only one year of data which reduces the dataset to 1M. In total, I still run 50,000 regressions per year and it takes about 2 hours to compute residuals for a single year within the dataset.

          This is the loop I am currently using:

          Code:
              egen long id = group(permno month)
              
              gen resid = .
                  label variable resid "Residuen"
                  
              summ id, meanonly
              quietly forvalues x = 1/`r(max)' {
                  reg retRF MktRF SMB HML if id == `x'
                  predict temp, res
                  replace resid = temp if id == `x'
                  drop temp
                  }
          I already cleared the splitted dataset from any unused variables and try to optimize the regression performance. I already tried to use the fastreg command by Geertsema (2014) but it is not compatible with extracting regression residuals. (http://ssrn.com/abstract=2423171)

          I would be very glad for any advice on optimizing performance speed of the calcualtions.

          I also posted my problem in another forum:
          https://stackoverflow.com/questions/...-panel-dataset

          Kind regards,

          Nils Holstein
          Last edited by Nils Holstein; 16 Nov 2017, 15:42.

          Comment


          • #6
            Perhaps replacing
            Code:
            predict temp, res
            with
            Code:
            predict temp if id == `x', res
            will speed things up.

            Comment


            • #7
              You can use runby (from SSC) to perform regressions on by-groups but because of the size of the problem, you are probably better off using rangestat (also from SSC). Here's a quick example that mimics your code.

              Code:
              clear all
              set seed 123
              set obs 200
              gen long permno = _n
              expand 12
              bysort permno: gen month = _n
              expand runiformint(17,50)
              gen retRF = runiform()
              gen MktRF = runiform()
              gen SMB = runiform()
              gen HML = runiform()
              sort permno month
              
              egen long id = group(permno month)
              
              rangestat (reg) retRF MktRF SMB HML, interval(month 0 0) by(permno)
              gen resid = retRF - (b_MktRF*MktRF + b_SMB*SMB + b_HML*HML + b_cons)
              
              reg retRF MktRF SMB HML if id == 2
              predict temp if id == 2, res
              list permno month resid temp if id == 2
              The last 3 lines perform a spot check using observations from id 2.

              Comment

              Working...
              X