Rolling window time-series correlation in panel where the correlation is to be between observations with different cross-sectional id

Guenther Mueller

Join Date: Oct 2018

Posts: 7
#1

Rolling window time-series correlation in panel where the correlation is to be between observations with different cross-sectional id

18 Oct 2018, 10:27

The structure of my data is as follows (simpliefied example)

id
date
x

id date x
1 1 3214324
1 2 3456435
1 3 3253455
1 4 345435
....
1 T 345435
.
.
.
2 1 323424
2 2 346435
2 3 6563455
2 4 564546
....
2 T 3532
.
.
.
3 1 324524
3 2 126435
3 3 363455
3 4 54546
....
3 T 32245
.
.
.
N 1 324524
N 2 126435
N 3 363455
N 4 54546
....
N T 34345

What I want is a rolling window correlation between the realizations of x across ids at each date so, for example between the first n observations of x where id==1 and the first n observations where id==2, and the the first n observations of x where id==1 and the first n observations where id==3, and so on. at each date.

Do I need to reorganize the data to have variables such as id1x and id2x with unique identifier date for one observation? if so, how do I do it ideally?

Thanks for your help!

Best,

Guenther
Tags: None
Guenther Mueller

Join Date: Oct 2018

Posts: 7
#2

18 Oct 2018, 12:53

Sorry, it shouldn't say time-series in the header.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#3

18 Oct 2018, 13:02

If I understand correctly what you want to do here, you want to calculate N*(N-1)/2 rolling window correlations, of which there are order of magnitude T. So your results will be kT*N*(N-1)/2 correlation coefficients, k < 1, calculated from a corpus of N*T observations. Unless N and T are small, you are going to have a very unwieldy set of results, and I have difficulty envisioning what you could usefully do with it once you got it. And if, in particular, N is large, you are going to get an explosively large set of results from doing this.

So perhaps I am misunderstanding what you want to do. If I'm understanding it properly, my first thoughts would not be about how to do this but about finding a better plan. Maybe there really is a good rationale for doing this, but I can't figure it out. What am I missing?
Comment
Guenther Mueller

Join Date: Oct 2018

Posts: 7
#4

18 Oct 2018, 15:33

Thanks for your reply. You are absolutely right. From a matrix point of view, I would want to compute kT correlation matrices, where k<1 comes from "losing observations" due to the minimum length of the rolling window.

Believe it or not, I am replicating the approach from a paper where the exact same thing is done. For them, N=700 or so. I'd like to to do it with about 8000. I understand that I will get a very large set of results but on the other hand, it's not so odd to compute rolling correlations from a set of variables (not necessarily variables in the stata sense), is it?
I need the correlation between a large number of time series computed in a rolling window fashion. that's all.

So now that we have established that this is really what I want. How would I want to do it?

I already used reshape to get N new variables computed from the x and the id.

Now I think I should compute the correlations on a rolling window maybe with mvcorr? I think mvcorr can only handle 2 time series. So what would be an the alternative?

again thanks for your help!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#5

18 Oct 2018, 16:07

Sorry, you haven't convinced me. If anything, this sounds even less sensible than before. With N = 8,000 you are talking about 32 million cross-correlations, each done on a rolling window. Even if there were no rolling window to further multiply this, what on earth can you sensibly do with 32 million correlation coefficients?
Comment
Guenther Mueller

Join Date: Oct 2018

Posts: 7
#6

18 Oct 2018, 16:52

In the end, it won't be as many because not every time series will have observations for all dates. some only come up later in the sample, some that were there in the beginning will vanish. At each point in time it will be something like 250,000 correlation coefficients.

I can explain the paper to you, or you can read it for yourself, if you really want. I don't know how familiar you are with Asset Pricing but in the end, it will be used to compute a beta with respect to the market portfolio for each of the ~8000 stocks at each date using option-implied volatilities (i.e. rolling window regression on just one variable is not what I need!). You need the correlations between all constituents of the market to back out the correlation of each with the market (under some assumptions). I'm not sure rolling window correlations are necessarily needed but the original paper does it so I want to keep close to that.

thanks.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30097
#7

18 Oct 2018, 17:42

So here's an illustration of how you might approach this. I've used the grunfeld data set, reshaped wide, to illustrate. I also picked the width of the rolling window to be 10. Adapt the code using your own variable names, etc.

Code:

// SET UP A DEMONSTRATION DATASET webuse grunfeld, clear keep company year mvalue reshape wide mvalue, i(year) j(company) des summ year // DEFINE WINDOW WIDTH local window 10 // SETUP A POSTFILE TO RECEIVE THE RESULTS tempfile correlations capture postutil clear postfile handle str32 firm1 str32 firm2 int start_year int n_obs float correlation /// using `correlations' // CREATE A LOCAL MACRO LISTING THE ASSETS TO BE CORRELATED ds mvalue* local assets `r(varlist)' local n_assets: word count `assets' // LOOP OVER PAIRS OF ASSETS // CALCULATING THE CORRELATIONS AND STORING THEM // IN THE POSTFILE forvalues i = 1/`n_assets' { local firm1: word `i' of `assets' forvalues j = `=`i'+1'/`n_assets' { local firm2: word `j' of `assets' rangestat (corr) `firm1' `firm2', interval(year 0 `=`window'-1') forvalues k = 1/`=_N' { if !missing(corr_x[`k']) { post handle ("`firm1'") ("`firm2'") (year[`k']) (corr_nobs[`k']) (corr_x[`k']) } } drop corr_nobs corr_x } } postclose handle use `correlations', clear

Notes:

1. You will need the -rangestat- program to run this. It was written by Robert Picard, Nick Cox, and Roberto Ferrer, and it is available from SSC.

2. The results file indicates the year in which the window begins, and the window covers that year through that year + 9 years (or as many as are available). I know that sometimes other conventions are used, and you can adapt the code by modifying the numbers in the -interval()- option of the -rangestat- command.

3. The correlations are included in the results file whenever a correlation could be computed. If you only want correlations for which the full window is available, modify the -if- condition to add a restriction based on the value of corr_nobs[`k'].

4. Although I have tried to do everything in the most efficient way that came to my mind, with the size data set you are talking about, this may take a very long time to run. You might want to leave instructions with your grandchildren what to do with the results :-)

Good luck.
Comment
Guenther Mueller

Join Date: Oct 2018

Posts: 7
#8

19 Oct 2018, 11:35

Thank you very much!
Comment

Guenther Mueller

Join Date: Oct 2018
Posts: 7

19 Oct 2018, 16:31

Your code was really helpful, I only had to change a few bits and pieces for my purposes. It works well if there are enough overlapping observations but I am still running into the problem that if there are pairs of variables that have no overlapping observations I get the error

Code:

no result for all obs: corr ret10012 ret10158
corr_x not found

that will result in the code not going through (even with a -capture-) because eventually, at

Code:

if !missing(corr_x[`k'])

,

Stata will want to look for corr_x if I understand correctly.

Is there away to have an if-clause for the non-existence of a variable such as corr_x in this case?

I tried putting

Code:

if (`stock1'_count>2 & `stock2'_count>2)

into the `k' loop but that gave me

Code:

ret10001_count not found

. I assume because it is not stored inside another loop?

This is thecode:

Code:

use crspwide, clear

rename date actdate
gen date = _n

*des
*summ date

// shorten sample to test code
keep  date ret10001-ret10158

//    DEFINE WINDOW WIDTH
local window 60

//    SETUP A POSTFILE TO RECEIVE THE RESULTS
tempfile correlations
capture postutil clear
postfile handle str32 stock1 str32 stock2 int start_year int n_obs float correlation ///
    using `correlations'
    
//    CREATE A LOCAL MACRO LISTING THE ASSETS TO BE CORRELATED
ds ret*
local assets `r(varlist)'
local n_assets: word count `assets'

//    LOOP OVER PAIRS OF ASSETS
//    CALCULATING THE CORRELATIONS AND STORING THEM
//    IN THE POSTFILE
forvalues i = 1/`n_assets' {
    local stock1: word `i' of `assets'
    forvalues j = `=`i'+1'/`n_assets' {
        local stock2: word `j' of `assets'

        rangestat (corr) `stock1' `stock2' (count) `stock1' `stock2',  interval(date `=-`window'' -1)
        forvalues k = 1/`=_N' {
        if !missing(corr_x[`k']) {
                post handle ("`stock1'") ("`stock2'") (date[`k']) (corr_nobs[`k']) (corr_x[`k'])
                                 }
                             }
        capture drop `stock1'_count `stock2'_count
        capture drop corr_nobs corr_x
                                    }
                            }
        
postclose handle

use `correlations', clear

Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30097

#10

19 Oct 2018, 17:17

I believe this does it:

Code:

use crspwide, clear

rename date actdate
gen date = _n

*des
*summ date

// shorten sample to test code
keep  date ret10001-ret10158

//    DEFINE WINDOW WIDTH
local window 60

//    SETUP A POSTFILE TO RECEIVE THE RESULTS
tempfile correlations
capture postutil clear
postfile handle str32 stock1 str32 stock2 int start_year int n_obs float correlation ///
    using `correlations'
    
//    CREATE A LOCAL MACRO LISTING THE ASSETS TO BE CORRELATED
ds ret*
local assets `r(varlist)'
local n_assets: word count `assets'

//    LOOP OVER PAIRS OF ASSETS
//    CALCULATING THE CORRELATIONS AND STORING THEM
//    IN THE POSTFILE
forvalues i = 1/`n_assets' {
    local stock1: word `i' of `assets'
    forvalues j = `=`i'+1'/`n_assets' {
        local stock2: word `j' of `assets'

        capture rangestat (corr) `stock1' `stock2',  interval(date `=-`window'' -1)
        if c(rc) == 0 {
            forvalues k = 1/`=_N' {
                if !missing(corr_x[`k']) {
                        post handle ("`stock1'") ("`stock2'") (date[`k']) (corr_nobs[`k']) (corr_x[`k'])
                }
            }
            drop corr_nobs corr_x
        }
    }
}
        
postclose handle

use `correlations', clear

As you did not provide any example data, this code is not tested. It may contain errors. Beware, in particular, the possibility that some of the { and } do not pair up correctly. I made several copy/paste's and moves while editing this, and things may have gotten disrupted.

Comment

Guenther Mueller

Join Date: Oct 2018

Posts: 7
#11

21 Oct 2018, 10:02

Thank you. I attached the sample data. Unfortunately, I still run into the same problem at

Code:

if !missing(corr_x[`k'])

The reason seems to be that c(rc) for the error

no result for all obs: corr ret10012 ret10158

is zero:

Code:

. capture rangestat (corr) ret10012 ret10158, interval(date -11 -1) . display _rc 0

hence the
if c(rc) == 0 does not do what it was intended to do.

What could be an alternative? I
Attached Files

sample_data.dta (77.0 KB, 1 view)
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30097

#12

21 Oct 2018, 10:33

Hmm! It is odd that -rangestat- returns a 0 value for c(rc) in that circumstance, but nevertheless causes a break. But we can work around it in a different way. Before going ahead with the loop doing the posting, we can check to see if the variable corr_x has actually been created. Here's the code, modified, and demonstrated with the grunfeld data set, modified to create a no-observations condition.

Code:

//    SET UP A DEMONSTRATION DATASET
webuse grunfeld, clear

keep company year mvalue

reshape wide mvalue, i(year) j(company)


// CREATE A NO OBSERVATIONS CONDITION
// BETWEEN mvalue4 AND mvalue5
replace mvalue4 = . in 1/10
replace mvalue5 = . in 11/20

//    DEFINE WINDOW WIDTH
local window 10

//    SETUP A POSTFILE TO RECEIVE THE RESULTS
tempfile correlations
capture postutil clear
postfile handle str32 firm1 str32 firm2 int start_year int n_obs float correlation ///
    using `correlations'
    
//    CREATE A LOCAL MACRO LISTING THE ASSETS TO BE CORRELATED
ds mvalue*
local assets `r(varlist)'
local n_assets: word count `assets'

//    LOOP OVER PAIRS OF ASSETS
//    CALCULATING THE CORRELATIONS AND STORING THEM
//    IN THE POSTFILE
forvalues i = 1/`n_assets' {
    local firm1: word `i' of `assets'
    forvalues j = `=`i'+1'/`n_assets' {
        local firm2: word `j' of `assets'
        capture rangestat (corr) `firm1' `firm2', interval(year 0 `=`window'-1')
        capture confirm var corr_x, exact
        if c(rc) == 0 {
            forvalues k = 1/`=_N' {
                if !missing(corr_x[`k']) {
                    post handle ("`firm1'") ("`firm2'") (year[`k']) (corr_nobs[`k']) (corr_x[`k'])
                }
            }
            drop corr_nobs corr_x
        }
    }
}

postclose handle

use `correlations', clear

Note that both -rangestat- and -confirm- have to be -capture-d here. -rangestat- will break on the no-observations condition if not captured. But, as it does not return a non-zero c(rc), you must also -capture confirm- the existence of corr_x.

The authors of -rangestat- are all active on this Forum, and they might want to consider if this is a bug in -rangestat-, or, if it is not, to consider changing the design of -rangestat- so that a non-zero c(rc) is returned whenever -rangestat- breaks.

Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#13

21 Oct 2018, 14:47

I haven't reviewed the whole thread but I'm inclined to consider this a quirk but not a bug. Consider the following example:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long date double(ret10001 ret10012)
11353   .01315789483487606   .0416666679084301
11381  .012987012974917889 -.03999999910593033
11409 -.011538460850715637  -.0833333358168602
11442   .03947368264198303  -.1818181872367859
11473                    0   .8333333134651184
end
format %d date

rangestat (mean) ret10001,  interval(date -11 -1)
list, ab(15)

replace date = .
rangestat (mean) ret10012,  interval(date -11 -1)

and the results:

Code:

. rangestat (mean) ret10001,  interval(date -11 -1)

. list, ab(15)

     +-----------------------------------------------------+
     |      date     ret10001     ret10012   ret10001_mean |
     |-----------------------------------------------------|
  1. | 31jan1991    .01315789    .04166667               . |
  2. | 28feb1991    .01298701         -.04               . |
  3. | 28mar1991   -.01153846   -.08333334               . |
  4. | 30apr1991    .03947368   -.18181819               . |
  5. | 31may1991            0    .83333331               . |
     +-----------------------------------------------------+

. 
. replace date = .
(5 real changes made, 5 to missing)

. rangestat (mean) ret10012,  interval(date -11 -1)
no observations
r(2000);

In the first case, the date variable is never missing so all observations are included in the sample and rangestat will calculate the requested statistic(s) using observations that fall within the desired range. As it turns out, for each observation, there are no observations within the requested range so results are missing for all observations.

In the second case, since the keyvar is missing for all observations, you get a no observations error 2000.

A flex_stat like corr, reg or a user-supplied Mata function may generate multiple variables. When faced with situations like this, rather that creating multiple variables with all missing observations, rangestat will simply show a warning message that there are no results for all observations:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long date double(ret10001 ret10012)
11353   .01315789483487606   .0416666679084301
11381  .012987012974917889 -.03999999910593033
11409 -.011538460850715637  -.0833333358168602
11442   .03947368264198303  -.1818181872367859
11473                    0   .8333333134651184
end
format %d date

rangestat (mean) ret10001 (reg) ret10001 ret10012,  interval(date -11 -1)
list, ab(15)

and the results:

Code:

. rangestat (mean) ret10001 (reg) ret10001 ret10012,  interval(date -11 -1)
no result for all obs: reg ret10001 ret10012

. list, ab(15)

     +-----------------------------------------------------+
     |      date     ret10001     ret10012   ret10001_mean |
     |-----------------------------------------------------|
  1. | 31jan1991    .01315789    .04166667               . |
  2. | 28feb1991    .01298701         -.04               . |
  3. | 28mar1991   -.01153846   -.08333334               . |
  4. | 30apr1991    .03947368   -.18181819               . |
  5. | 31may1991            0    .83333331               . |
     +-----------------------------------------------------+

.

Note that if statistics are calculated on variables that contain missing values, you may also get results with missing values. I can understand the desire to identify cases where results for all observations are missing. That's easy to check, and Clyde's workaround for flex_stat seems easy to implement. Perhaps the help file could be tweaked to make clear that no variables are created if all results are missing when using a flex_stat.

Announcement