Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Rolling Correlation for Stock return and Clustering

    Dear Statalisters,

    I have a problem and desperately need your help. My data is organised as below: a variable Date, other variables (x1, x2, x3) that contain returns of stocks 1, 2 and 3 for the specific month.
    I need to compute the 24-months correlation between each stock (x1 & x2 as well as x1 & x3 as well as x2 & x3) every quarter IF the last 24 months contains at least 12 returns (as you can see below, x3 doesn't have any data after June 1995). Precisely I need: Correlation between x1 and x2 from 1/31/94 to 12/31/95 if both variables have at least 12 data points in those months.
    I also need the correlation between those two variables for the period 4/30/94 to 3/31/96 etc....

    I need to do that between all my variables, which means that I will end up with a correlation matrix every 3-month. I need to cluster these correlations in order to see if the "buckets" will change through time...
    I have spent 3 days on this, and I can't seem to find the right solution!
    I need your help!!

    Thanks a lot,
    Q.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int Date double(x1 x2 x3)
    1/31/94   -.0041000000000000    .00390000000000055    .0140056022408962
    2/28/94    -.0140572703607288    -.0347000000000049    .0359116022099448
    3/31/94  -.013578756789378501    -.0462000000000001                -.024
    4/30/94    -.0127714897522177   -.00229999999999944    -.075591985428051
    5/31/94    .00650708807808508     .0604000000000079    -.084729064039409
    6/30/94    -.0188563072423613     .0362999999999932     .023681377825619
    7/31/94       .01757138374647  -.021399999999997598     .028391167192429
    8/31/94     .0262102991057663    .00819999999999676   -.0214723926380367
    9/30/94     .0204326923076923    .00540000000000324 -.039707419017763895
    10/31/94  -.011925795053003601  -.019900000000000598  .026115342763873797
    11/03/94  -.032856504246759004   .032700000000001304    .0381760339342523
    12/31/94    -.0368230490717202                 .0217    -.069458631256384
    1/31/95 -.0007998080460688981    -.0654000000000033   -.0779363336992315
    2/28/95  -.027455375010005702    -.0036999999999978    .0345238095238094
    3/31/95    .0554732510288067     .0463000000000038    -.054085155350978
    4/30/95     .0168434185901435   -.00200000000000461    .0231143552311435
    5/31/95    .0197852760736195   .015299999999999201     .124851367419738
    6/30/95     .0117310873815612  -.049799999999999095                    .
    7/31/95  .004979931618849531      .060699999999999                    .
    8/31/95  .030914873160269002      .145400000000003                    .
    9/30/95     .0149221608436761 -.0025000000000030696                    .
    10/31/95     .0351311232063335     .0222000000000074                    .
    11/30/95   -.0127014476918874    .11299999999999899                    .
    12/31/95  -.006363259095310661    .07459999999999939                    .
    1/31/96     .0603508283447029     .0552000000000001                    .
    2/28/96   .010175277358366598    -.0904000000000026                    .
    3/31/96     -.008253184299454     .0679999999999978                    .
    4/30/96  -.000720791560186184     .0318000000000051                    .
    end
    format %tdnn/dd/CCYY Date

  • #2
    It would help if you gave us some suggestion of what you'd been able to do. It is also not clear exactly what you want the output to look like. Are you looking for a separate matrix to be created every quarter or are you putting correlations into variables, or what?

    You can generate all permutations of the x's. There is a routine that does this easily or you can simply do it with a couple of loops writing out pairs.

    First you set up a quarter variable, then for all permutations of x's,you can do a count of the observations in the previous 8 quarters using egen. Then you can do the correlations for the previous eight quarters using egenmore. You can replace the corr with missing based on the count. .

    Comment


    • #3
      Something messed up your -dataex- output because the first line identifies Date as an int variable, but the subsequent lines show human-readable dates that contain slashes. So this data cannot be read in.

      Please re-load your data and run -dataex- again so that you get the actual numerical Stata internal format dates.

      Comment


      • #4
        Thank you both for your quick replies.

        @Clyde: please find below my output. Indeed, something went wrong. Hope it's better now.

        @Phil: What I aim to do is cluster the stocks based on their correlation every quarter. This will give me stocks that actually perform similarly. The goal is to see how those stocks behave as I go on (I have data up to 2013), meaning: is a stock likely to "change" cluster? I know it sounds a bit complicated and I really don't know how to do it.
        To answer your question, I don't know the simplest way to do it, i.e. if I need to generate new variables for each correlation or if I just create a correlation matrix each quarter.
        Note that I have a lot of stocks (4100) so the correlation matrix are likely to be rather large (even though I don't have the data for all of them through the whole time series).

        Thank you so much for helping me!

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input int Date double(x1 x2 x3)
        12449   -.00414784090067404    .00390000000000055    .0140056022408962
        12477    -.0140572703607288    -.0347000000000049    .0359116022099448
        12508  -.013578756789378501    -.0462000000000001                -.024
        12538    -.0127714897522177   -.00229999999999944    -.075591985428051
        12569    .00650708807808508     .0604000000000079    -.084729064039409
        12599    -.0188563072423613     .0362999999999932     .023681377825619
        12630       .01757138374647  -.021399999999997598     .028391167192429
        12661     .0262102991057663    .00819999999999676   -.0214723926380367
        12691     .0204326923076923    .00540000000000324 -.039707419017763895
        12722  -.011925795053003601  -.019900000000000598  .026115342763873797
        12752  -.032856504246759004   .032700000000001304    .0381760339342523
        12783    -.0368230490717202                 .0217    -.069458631256384
        12814 -.0007998080460688981    -.0654000000000033   -.0779363336992315
        12842  -.027455375010005702    -.0036999999999978    .0345238095238094
        12873     .0554732510288067     .0463000000000038    -.054085155350978
        12903     .0168434185901435   -.00200000000000461    .0231143552311435
        12934     .0197852760736195   .015299999999999201     .124851367419738
        12964     .0117310873815612  -.049799999999999095                    .
        12995   .004979931618849531      .060699999999999                    .
        13026   .030914873160269002      .145400000000003                    .
        13056     .0149221608436761 -.0025000000000030696                    .
        13087     .0351311232063335     .0222000000000074                    .
        13117    -.0127014476918874    .11299999999999899                    .
        13148  -.006363259095310661    .07459999999999939                    .
        13179     .0603508283447029     .0552000000000001                    .
        13208   .010175277358366598    -.0904000000000026                    .
        13239     -.008253184299454     .0679999999999978                    .
        13269  -.000720791560186184     .0318000000000051                    .
        end
        format %tdnn/dd/CCYY Date

        Comment


        • #5
          Thanks for clearing up the -dataex- issues.

          I could show you some code that would, in theory, give you a correlation matrix for each two year period. But hearing that you want to correlate 4100 variables gives me great pause. You are proposing to calculate 8,402,950 correlations in each such period. I think Stata would leave you no choice but to store these in matrices, because making them variables in a data set would exceed the maximum number of allowed variables in a data set by more than an order of magnitude. You are also proposing to generate a large number of these matrices (one for each quarter from 1994 through 2013, so about 80). I think trying to create and maintain that many matrices will challenge the memory limits of your system.

          Even assuming you get past those obstacles, you next want to feed these into a cluster analysis. Now "cluster analysis" is a fairly broad term that encompasses several different procedures. One of the things those procedures have in common is that I know rather little about them--it's just not something that is much used in my line of work. But a quick look at the ones available in official Stata suggests that they are not written to take correlation matrices as inputs. So you would actually be better off skipping the creation of correlation matrices and just set up some code that executes your cluster analysis commands rolling through the data. But there is another problem there: the grouping variable that -cluster- programs create (at least those I have a passing familiarity with) will not necessarily be consistent from one rolling-window to another. That is, there could be an exact correspondence between cluster 1 from one window and cluster 7 from another, but they will be labeled 1 and 7 respectively. The task of somehow matching up which cluster corresponds to which across analyses strikes me, at least at first glance, as intractable (especially since there will very likely be some cluster patterns that don't even show extensive overlap with each other.)

          In short, my intuition is that you have bitten off a great deal more than can be chewed here. Perhaps others who understand cluster analysis better than I do will feel more optimistic about this project, and if so, I hope they will speak up here. But from where I sit, it looks like this idea needs some re-thinking and simplification.

          Comment


          • #6
            g quarter=quarter(Date)
            g year=year(Date)
            g quarternum=year*4+(quarter-1)

            forvalues q=7984/7985 {
            foreach var1 in x1 x2 {
            foreach var2 in x2 x3 {
            if `var1'==`var2' {
            continue
            }
            egen corr`var1'`var2'`q'=corr(`var1' `var2') if ///
            (quarternum<=`q' & quarternum>=`q'-7 )
            egen num1=count(`var1') if (quarternum<=`q' & quarternum>=`q'-7 )
            egen num2=count(`var2') if (quarternum<=`q' & quarternum>=`q'-7 )
            replace corr`var1'`var2'`q'=. if num1<12 | num2<12
            drop num1 num2
            }
            }
            }



            This gives you correlations, but note that I have not got rid of the extra lines.
            I think cluster generally clusters variables so I gave you variables.
            Phil

            Comment

            Working...
            X