Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cross-Sectional Correlations by date

    Dear Stata users

    I have daily panel dataset with 5000 firms from 4th January 2006 to December 2020. I need to estimate cross-sectional correlations between 40 variables every day (by date) and then take the means of these estimates. Running a loop every day over 40 variables is very inefficient. It takes a lot of time. Do you have any advice on the faster way to get daily cross-sectional correlations and their time-series means?

    Thanks.

    Regards,
    Olena

  • #2
    If I'm not mistaken, you should have about 27,225,000 observations, correct? I've generated some data that should look a bit like what you have:

    Code:
    clear
    local num_units = 5000
    local num_times = 5445
    local total_obs = `num_times' * `num_units'
    set obs `total_obs'
    egen unit = seq(), f(1) t(`num_units') b(`num_times')
    egen time = seq(), f(1) t(`num_times')
    
    
    gen variable1 = rnormal()
    local current = 2
    forv rho = 0.01(0.01)0.40{
        local previous = `current' - 1
        gen variable`current' = `rho' * variable`previous' + (sqrt(1 - `rho') * runiform())
        local current = `current' + 1
    }
    Here is my Stata solution:

    Code:
    quietly corr variable* if time == 1
    matrix sum = r(C)
    forv time = 2/`num_times'{
        quietly corr variable* if time == `time'
        matrix sum = sum + r(C)
    }
    matrix av = sum / `num_times'
    matrix list av
    But it is indeed very slow. Everything I try with data this large is very slow. I don't think there's an obvious optimization strategy here, so I think Mata is your best bet for a more tightly optimized solution. Here is some example code to get you started.

    Code:
    sort time unit // this is essential. We rely on this ordering to properly index below.
    mata
        num_units = 5000
        num_times = 5445
        sum = J(40,40,0) // sum needs to store a 40 by 40 correlation matrix. Start with all zeros.
        for (i = 0; i < num_times; i++){
            start_row_index = (i * num_units) + 1 // Indexing like this to avoid doing search in O(N) time.
            end_row_index  = (i + 1) * num_units
            // (3..42) are just my 40 test variables. You may need to find your own way to index.
            data = st_data((start_row_index, end_row_index), (3..42)) 
            sum = sum + corr(variance(data))
        }
        av = sum / num_times
        av
    end
    It's essentially the same algorithm, but better optimized. This runs in about 36 seconds on my machine with all 40 variables and 27,225,000 observations. Not great, but much better than the Stata solution. Maybe someone else who knows Mata a little better will step in with a more tightly optimized solution. Theoretically speaking, I don't think you can do better than linear time here. I bet you could get something on the order of two or three orders of magnitude faster in SQL or with a C or C++ plugin, but I wouldn't open that can of worms unless you plan on running this code every day for the next decade and you expect your data to continue to grow over that time.

    Comment

    Working...
    X