Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Pairwise correlation over several variables

    Hi,

    I need to correlate returns of funds within and between fund families. The data set is wide structured (return_family1_fund1, return_family1_fund2,...).
    How can I compute the pairwise correlation of every fund within a fund family and then between fund families?

    Florian

  • #2
    So, first you need to reshape your data. It's a bit complicated because of the way your variables are named. What you need to do is go half-long and half-wide (wide on fund, long on family). Given the way your variables are named,it's actually easier to go double-long in one step and then go back to half-long/half-wide

    Code:
    // GO DOUBLE LONG WITH ONE RESHAPE
    gen long obs_no = _n
    reshape long return_, i(obs_no) j(ff) string
    split ff, gen(fff) parse("_")
    rename fff1 family
    rename fff2 fund
    destring family, ignore("family") replace
    destring fund, ignore("fund") replace
    drop ff
    
    // NOW GO HALF LONG/HALF WIDE ON FUND
    reshape wide return_, i(obs_no family) j(fund)
    
    // NOW DO PAIRWISE CORRELATIONS WITHIN FAMILIES
    by family, sort: pwcorr return_*
    This is somewhat roundabout, but it is actually simpler than juggling the variable names so as to do a single reshape to half long half wide.

    At this point I can't go any farther. Apparently you have in mind some sort of aggregate return for a family that is calculated in some way from its constituent fund returns. But you don't say what that calculation is. So, not knowing how to calculate a family return, I can't calculate correlations among them. I will venture the guess that whatever that calculation is, it will best be done by returning to double-long layout first. You can do that by simply running -reshape long- after the code shown above.

    Comment


    • #3
      Thank you very much for the quick answer Clyde, the calculation works.

      The thing is, that we have a very large dataset, so it computes all the data in the console for about half an hour and I have no chance of reading any information out of it or building the average for the correlations of the funds with the same objective, what was the actual goal of this calculation.

      We actually converted our inital data to a wide structure, because we didn't see any chance to compute a pairwise correlation otherwise.

      This is how the inital data looks like:
      investor_name fund_id date fund_classification return
      A Investments 1 2001 A 0,005
      A Investments 1 2002 A 0,003
      A Investments 2 2001 A -0,0025
      B Investments 3 2001 A 0,004
      B Investments 3 2002 A 0,002
      B Investments 3 2003 A -0,005
      C Investments 4 2001 B 0,002
      C Investments 4 2002 B 0,003
      C Investments 5 2001 A 0,009
      C Investments 5 2002 A -0,03
      So, maybe it is clearer this way. Firstly, we want to compute average correlation over all families for the funds within a fund family, and then the average correlation over all families for the funds between fund families. The fund objective is actually negligible, because we narrowed those down and can calculate for every objective manually.

      I hope I could clarify the problem a little bit.

      Thanks a lot in advance for any help.

      Regards,
      Florian

      Comment


      • #4
        I'm afraid this version is even less clear to me than the original. I don't work in finance and I don't know the jargon. The example data you show in #3 doesn't indicate which funds are in which family, unless one of those other variable names is a synonym for family. And I don't even get the concept of "the average correlation over all families for the funds between fund families." I don't know what that means.

        That said, returning to #1 and #2, it is not that hard to convert the code in #2 to calculate the average correlation among funds in each family:

        Code:
        capture program drop avg_corr
        program define avg_corr, rclass
            syntax varlist [if] [in]
            marksample touse
            pwcorr `varlist' if `touse'
            tempname M  J X
            matrix `M' = r(C)
            local r = rowsof(`M')
            matrix `J' = J(`r', `r', 1)
            forvalues i = 1/`r' {
                forvalues j = `i'/`r' {
                    matrix `J'[`i', `j'] = 0
                }
            }
            matrix `X' = (`M'*`J')
            return scalar average_r = trace(`X')/(`r'*(`r'-1)/2)
            exit
        end
        
        
        // GO DOUBLE LONG WITH ONE RESHAPE
        gen long obs_no = _n
        reshape long return_, i(obs_no) j(ff) string
        split ff, gen(fff) parse("_")
        rename fff1 family
        rename fff2 fund
        destring family, ignore("family") replace
        destring fund, ignore("fund") replace
        drop ff
        
        // NOW GO HALF LONG/HALF WIDE ON FUND
        reshape wide return_, i(obs_no family) j(fund)
        
        gen family_avg_corr = .
        levelsof family, local(families)
        foreach fam of local families {
            avg_corr return_* if family == `f'
            replace family_avg_corr = r(average_r) if family == `f'
        }
        This code assumes that you do not want to include the correlation of each fund with itself (which is always 1) in the average. The average correlation among the funds in a family will appear as the value of the new variable family_avg_corr in each observation for that family.

        Comment


        • #5
          Sorry for the unclear descriptions and thank you again for the answer.

          Unfortunately the code didn't work the way we intendet it to.

          But with a little bit of work we figured out a way that gave us our wanted results. The code is the following:

          The data is wide shaped as described in #1

          Code:
          gen id = _n
          
          gen dummy = .
          
          
          summarize id
          scalar idmin = r(min)
          scalar idmax = r(max)
          
          forvalues i= `=idmin'/`=num_id' {
              
              forvalues n = 1/1 {
                  
                  pwcorr *vw`i'_*
                  
                  mata: C = st_matrix("r(C)")
                  mata: st_numscalar("avg_cor", mean(abs(select(vech(C), vech(C) :< 1))))
                  
                  display avg_cor
                  replace dummy = avg_cor if id == `i'
                  }
                  }
              
          mean(dummy)
          
          qui summarize dummy
          scalar avg_cor_M = r(mean)

          Thank you very much for your support Clyde.

          Florian
          ​​​​​​​

          Comment

          Working...
          X