Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Rolling window time-series correlation in panel where the correlation is to be between observations with different cross-sectional id

    The structure of my data is as follows (simpliefied example)

    id
    date
    x


    id date x
    1 1 3214324
    1 2 3456435
    1 3 3253455
    1 4 345435
    ....
    1 T 345435
    .
    .
    .
    2 1 323424
    2 2 346435
    2 3 6563455
    2 4 564546
    ....
    2 T 3532
    .
    .
    .
    3 1 324524
    3 2 126435
    3 3 363455
    3 4 54546
    ....
    3 T 32245
    .
    .
    .
    N 1 324524
    N 2 126435
    N 3 363455
    N 4 54546
    ....
    N T 34345

    What I want is a rolling window correlation between the realizations of x across ids at each date so, for example between the first n observations of x where id==1 and the first n observations where id==2, and the the first n observations of x where id==1 and the first n observations where id==3, and so on. at each date.

    Do I need to reorganize the data to have variables such as id1x and id2x with unique identifier date for one observation? if so, how do I do it ideally?

    Thanks for your help!

    Best,

    Guenther

  • #2
    Sorry, it shouldn't say time-series in the header.

    Comment


    • #3
      If I understand correctly what you want to do here, you want to calculate N*(N-1)/2 rolling window correlations, of which there are order of magnitude T. So your results will be kT*N*(N-1)/2 correlation coefficients, k < 1, calculated from a corpus of N*T observations. Unless N and T are small, you are going to have a very unwieldy set of results, and I have difficulty envisioning what you could usefully do with it once you got it. And if, in particular, N is large, you are going to get an explosively large set of results from doing this.

      So perhaps I am misunderstanding what you want to do. If I'm understanding it properly, my first thoughts would not be about how to do this but about finding a better plan. Maybe there really is a good rationale for doing this, but I can't figure it out. What am I missing?

      Comment


      • #4
        Thanks for your reply. You are absolutely right. From a matrix point of view, I would want to compute kT correlation matrices, where k<1 comes from "losing observations" due to the minimum length of the rolling window.

        Believe it or not, I am replicating the approach from a paper where the exact same thing is done. For them, N=700 or so. I'd like to to do it with about 8000. I understand that I will get a very large set of results but on the other hand, it's not so odd to compute rolling correlations from a set of variables (not necessarily variables in the stata sense), is it?
        I need the correlation between a large number of time series computed in a rolling window fashion. that's all.

        So now that we have established that this is really what I want. How would I want to do it?

        I already used reshape to get N new variables computed from the x and the id.

        Now I think I should compute the correlations on a rolling window maybe with mvcorr? I think mvcorr can only handle 2 time series. So what would be an the alternative?

        again thanks for your help!

        Comment


        • #5
          Sorry, you haven't convinced me. If anything, this sounds even less sensible than before. With N = 8,000 you are talking about 32 million cross-correlations, each done on a rolling window. Even if there were no rolling window to further multiply this, what on earth can you sensibly do with 32 million correlation coefficients?

          Comment


          • #6
            In the end, it won't be as many because not every time series will have observations for all dates. some only come up later in the sample, some that were there in the beginning will vanish. At each point in time it will be something like 250,000 correlation coefficients.

            I can explain the paper to you, or you can read it for yourself, if you really want. I don't know how familiar you are with Asset Pricing but in the end, it will be used to compute a beta with respect to the market portfolio for each of the ~8000 stocks at each date using option-implied volatilities (i.e. rolling window regression on just one variable is not what I need!). You need the correlations between all constituents of the market to back out the correlation of each with the market (under some assumptions). I'm not sure rolling window correlations are necessarily needed but the original paper does it so I want to keep close to that.

            thanks.

            Comment


            • #7
              So here's an illustration of how you might approach this. I've used the grunfeld data set, reshaped wide, to illustrate. I also picked the width of the rolling window to be 10. Adapt the code using your own variable names, etc.

              Code:
              //    SET UP A DEMONSTRATION DATASET
              webuse grunfeld, clear
              
              keep company year mvalue
              
              reshape wide mvalue, i(year) j(company)
              
              des
              summ year
              
              //    DEFINE WINDOW WIDTH
              local window 10
              
              //    SETUP A POSTFILE TO RECEIVE THE RESULTS
              tempfile correlations
              capture postutil clear
              postfile handle str32 firm1 str32 firm2 int start_year int n_obs float correlation ///
                  using `correlations'
                  
              //    CREATE A LOCAL MACRO LISTING THE ASSETS TO BE CORRELATED
              ds mvalue*
              local assets `r(varlist)'
              local n_assets: word count `assets'
              
              //    LOOP OVER PAIRS OF ASSETS
              //    CALCULATING THE CORRELATIONS AND STORING THEM
              //    IN THE POSTFILE
              forvalues i = 1/`n_assets' {
                  local firm1: word `i' of `assets'
                  forvalues j = `=`i'+1'/`n_assets' {
                      local firm2: word `j' of `assets'
                      rangestat (corr) `firm1' `firm2', interval(year 0 `=`window'-1')
                      forvalues k = 1/`=_N' {
                          if !missing(corr_x[`k']) {
                              post handle ("`firm1'") ("`firm2'") (year[`k']) (corr_nobs[`k']) (corr_x[`k'])
                          }
                      }
                      drop corr_nobs corr_x
                  }
              }
              
              postclose handle
              
              use `correlations', clear
              Notes:

              1. You will need the -rangestat- program to run this. It was written by Robert Picard, Nick Cox, and Roberto Ferrer, and it is available from SSC.

              2. The results file indicates the year in which the window begins, and the window covers that year through that year + 9 years (or as many as are available). I know that sometimes other conventions are used, and you can adapt the code by modifying the numbers in the -interval()- option of the -rangestat- command.

              3. The correlations are included in the results file whenever a correlation could be computed. If you only want correlations for which the full window is available, modify the -if- condition to add a restriction based on the value of corr_nobs[`k'].

              4. Although I have tried to do everything in the most efficient way that came to my mind, with the size data set you are talking about, this may take a very long time to run. You might want to leave instructions with your grandchildren what to do with the results :-)

              Good luck.

              Comment


              • #8
                Thank you very much!

                Comment


                • #9
                  Your code was really helpful, I only had to change a few bits and pieces for my purposes. It works well if there are enough overlapping observations but I am still running into the problem that if there are pairs of variables that have no overlapping observations I get the error

                  Code:
                  no result for all obs: corr ret10012 ret10158
                  corr_x not found
                  that will result in the code not going through (even with a -capture-) because eventually, at

                  Code:
                  if !missing(corr_x[`k'])
                  ,

                  Stata will want to look for corr_x if I understand correctly.


                  Is there away to have an if-clause for the non-existence of a variable such as corr_x in this case?

                  I tried putting
                  Code:
                  if (`stock1'_count>2 & `stock2'_count>2)
                  into the `k' loop but that gave me
                  Code:
                  ret10001_count not found
                  . I assume because it is not stored inside another loop?

                  This is thecode:

                  Code:
                  use crspwide, clear
                  
                  rename date actdate
                  gen date = _n
                  
                  *des
                  *summ date
                  
                  // shorten sample to test code
                  keep  date ret10001-ret10158
                  
                  //    DEFINE WINDOW WIDTH
                  local window 60
                  
                  //    SETUP A POSTFILE TO RECEIVE THE RESULTS
                  tempfile correlations
                  capture postutil clear
                  postfile handle str32 stock1 str32 stock2 int start_year int n_obs float correlation ///
                      using `correlations'
                      
                  //    CREATE A LOCAL MACRO LISTING THE ASSETS TO BE CORRELATED
                  ds ret*
                  local assets `r(varlist)'
                  local n_assets: word count `assets'
                  
                  //    LOOP OVER PAIRS OF ASSETS
                  //    CALCULATING THE CORRELATIONS AND STORING THEM
                  //    IN THE POSTFILE
                  forvalues i = 1/`n_assets' {
                      local stock1: word `i' of `assets'
                      forvalues j = `=`i'+1'/`n_assets' {
                          local stock2: word `j' of `assets'
                  
                          rangestat (corr) `stock1' `stock2' (count) `stock1' `stock2',  interval(date `=-`window'' -1)
                          forvalues k = 1/`=_N' {
                          if !missing(corr_x[`k']) {
                                  post handle ("`stock1'") ("`stock2'") (date[`k']) (corr_nobs[`k']) (corr_x[`k'])
                                                   }
                                               }
                          capture drop `stock1'_count `stock2'_count
                          capture drop corr_nobs corr_x
                                                      }
                                              }
                          
                  postclose handle
                  
                  use `correlations', clear

                  Comment


                  • #10
                    I believe this does it:

                    Code:
                    use crspwide, clear
                    
                    rename date actdate
                    gen date = _n
                    
                    *des
                    *summ date
                    
                    // shorten sample to test code
                    keep  date ret10001-ret10158
                    
                    //    DEFINE WINDOW WIDTH
                    local window 60
                    
                    //    SETUP A POSTFILE TO RECEIVE THE RESULTS
                    tempfile correlations
                    capture postutil clear
                    postfile handle str32 stock1 str32 stock2 int start_year int n_obs float correlation ///
                        using `correlations'
                        
                    //    CREATE A LOCAL MACRO LISTING THE ASSETS TO BE CORRELATED
                    ds ret*
                    local assets `r(varlist)'
                    local n_assets: word count `assets'
                    
                    //    LOOP OVER PAIRS OF ASSETS
                    //    CALCULATING THE CORRELATIONS AND STORING THEM
                    //    IN THE POSTFILE
                    forvalues i = 1/`n_assets' {
                        local stock1: word `i' of `assets'
                        forvalues j = `=`i'+1'/`n_assets' {
                            local stock2: word `j' of `assets'
                    
                            capture rangestat (corr) `stock1' `stock2',  interval(date `=-`window'' -1)
                            if c(rc) == 0 {
                                forvalues k = 1/`=_N' {
                                    if !missing(corr_x[`k']) {
                                            post handle ("`stock1'") ("`stock2'") (date[`k']) (corr_nobs[`k']) (corr_x[`k'])
                                    }
                                }
                                drop corr_nobs corr_x
                            }
                        }
                    }
                            
                    postclose handle
                    
                    use `correlations', clear
                    As you did not provide any example data, this code is not tested. It may contain errors. Beware, in particular, the possibility that some of the { and } do not pair up correctly. I made several copy/paste's and moves while editing this, and things may have gotten disrupted.

                    Comment


                    • #11
                      Thank you. I attached the sample data. Unfortunately, I still run into the same problem at

                      Code:
                       if !missing(corr_x[`k'])
                      The reason seems to be that c(rc) for the error
                      no result for all obs: corr ret10012 ret10158
                      is zero:

                      Code:
                      . capture rangestat (corr) ret10012 ret10158,  interval(date -11 -1)
                      
                      . display _rc
                      0
                      hence the
                      if c(rc) == 0 does not do what it was intended to do.

                      What could be an alternative? I
                      Attached Files

                      Comment


                      • #12
                        Hmm! It is odd that -rangestat- returns a 0 value for c(rc) in that circumstance, but nevertheless causes a break. But we can work around it in a different way. Before going ahead with the loop doing the posting, we can check to see if the variable corr_x has actually been created. Here's the code, modified, and demonstrated with the grunfeld data set, modified to create a no-observations condition.

                        Code:
                        //    SET UP A DEMONSTRATION DATASET
                        webuse grunfeld, clear
                        
                        keep company year mvalue
                        
                        reshape wide mvalue, i(year) j(company)
                        
                        
                        // CREATE A NO OBSERVATIONS CONDITION
                        // BETWEEN mvalue4 AND mvalue5
                        replace mvalue4 = . in 1/10
                        replace mvalue5 = . in 11/20
                        
                        //    DEFINE WINDOW WIDTH
                        local window 10
                        
                        //    SETUP A POSTFILE TO RECEIVE THE RESULTS
                        tempfile correlations
                        capture postutil clear
                        postfile handle str32 firm1 str32 firm2 int start_year int n_obs float correlation ///
                            using `correlations'
                            
                        //    CREATE A LOCAL MACRO LISTING THE ASSETS TO BE CORRELATED
                        ds mvalue*
                        local assets `r(varlist)'
                        local n_assets: word count `assets'
                        
                        //    LOOP OVER PAIRS OF ASSETS
                        //    CALCULATING THE CORRELATIONS AND STORING THEM
                        //    IN THE POSTFILE
                        forvalues i = 1/`n_assets' {
                            local firm1: word `i' of `assets'
                            forvalues j = `=`i'+1'/`n_assets' {
                                local firm2: word `j' of `assets'
                                capture rangestat (corr) `firm1' `firm2', interval(year 0 `=`window'-1')
                                capture confirm var corr_x, exact
                                if c(rc) == 0 {
                                    forvalues k = 1/`=_N' {
                                        if !missing(corr_x[`k']) {
                                            post handle ("`firm1'") ("`firm2'") (year[`k']) (corr_nobs[`k']) (corr_x[`k'])
                                        }
                                    }
                                    drop corr_nobs corr_x
                                }
                            }
                        }
                        
                        postclose handle
                        
                        use `correlations', clear
                        Note that both -rangestat- and -confirm- have to be -capture-d here. -rangestat- will break on the no-observations condition if not captured. But, as it does not return a non-zero c(rc), you must also -capture confirm- the existence of corr_x.

                        The authors of -rangestat- are all active on this Forum, and they might want to consider if this is a bug in -rangestat-, or, if it is not, to consider changing the design of -rangestat- so that a non-zero c(rc) is returned whenever -rangestat- breaks.

                        Comment


                        • #13
                          I haven't reviewed the whole thread but I'm inclined to consider this a quirk but not a bug. Consider the following example:
                          Code:
                          * Example generated by -dataex-. To install: ssc install dataex
                          clear
                          input long date double(ret10001 ret10012)
                          11353   .01315789483487606   .0416666679084301
                          11381  .012987012974917889 -.03999999910593033
                          11409 -.011538460850715637  -.0833333358168602
                          11442   .03947368264198303  -.1818181872367859
                          11473                    0   .8333333134651184
                          end
                          format %d date
                          
                          rangestat (mean) ret10001,  interval(date -11 -1)
                          list, ab(15)
                          
                          replace date = .
                          rangestat (mean) ret10012,  interval(date -11 -1)
                          and the results:
                          Code:
                          . rangestat (mean) ret10001,  interval(date -11 -1)
                          
                          . list, ab(15)
                          
                               +-----------------------------------------------------+
                               |      date     ret10001     ret10012   ret10001_mean |
                               |-----------------------------------------------------|
                            1. | 31jan1991    .01315789    .04166667               . |
                            2. | 28feb1991    .01298701         -.04               . |
                            3. | 28mar1991   -.01153846   -.08333334               . |
                            4. | 30apr1991    .03947368   -.18181819               . |
                            5. | 31may1991            0    .83333331               . |
                               +-----------------------------------------------------+
                          
                          . 
                          . replace date = .
                          (5 real changes made, 5 to missing)
                          
                          . rangestat (mean) ret10012,  interval(date -11 -1)
                          no observations
                          r(2000);
                          In the first case, the date variable is never missing so all observations are included in the sample and rangestat will calculate the requested statistic(s) using observations that fall within the desired range. As it turns out, for each observation, there are no observations within the requested range so results are missing for all observations.

                          In the second case, since the keyvar is missing for all observations, you get a no observations error 2000.

                          A flex_stat like corr, reg or a user-supplied Mata function may generate multiple variables. When faced with situations like this, rather that creating multiple variables with all missing observations, rangestat will simply show a warning message that there are no results for all observations:

                          Code:
                          * Example generated by -dataex-. To install: ssc install dataex
                          clear
                          input long date double(ret10001 ret10012)
                          11353   .01315789483487606   .0416666679084301
                          11381  .012987012974917889 -.03999999910593033
                          11409 -.011538460850715637  -.0833333358168602
                          11442   .03947368264198303  -.1818181872367859
                          11473                    0   .8333333134651184
                          end
                          format %d date
                          
                          rangestat (mean) ret10001 (reg) ret10001 ret10012,  interval(date -11 -1)
                          list, ab(15)
                          and the results:
                          Code:
                          . rangestat (mean) ret10001 (reg) ret10001 ret10012,  interval(date -11 -1)
                          no result for all obs: reg ret10001 ret10012
                          
                          . list, ab(15)
                          
                               +-----------------------------------------------------+
                               |      date     ret10001     ret10012   ret10001_mean |
                               |-----------------------------------------------------|
                            1. | 31jan1991    .01315789    .04166667               . |
                            2. | 28feb1991    .01298701         -.04               . |
                            3. | 28mar1991   -.01153846   -.08333334               . |
                            4. | 30apr1991    .03947368   -.18181819               . |
                            5. | 31may1991            0    .83333331               . |
                               +-----------------------------------------------------+
                          
                          .
                          Note that if statistics are calculated on variables that contain missing values, you may also get results with missing values. I can understand the desire to identify cases where results for all observations are missing. That's easy to check, and Clyde's workaround for flex_stat seems easy to implement. Perhaps the help file could be tweaked to make clear that no variables are created if all results are missing when using a flex_stat.

                          Comment

                          Working...
                          X