Cross-Sectional Correlations by date

Olena Onishchenko

Join Date: Oct 2015

Posts: 165
#1

Cross-Sectional Correlations by date

21 Sep 2023, 17:32

Dear Stata users

I have daily panel dataset with 5000 firms from 4th January 2006 to December 2020. I need to estimate cross-sectional correlations between 40 variables every day (by date) and then take the means of these estimates. Running a loop every day over 40 variables is very inefficient. It takes a lot of time. Do you have any advice on the faster way to get daily cross-sectional correlations and their time-series means?

Thanks.

Regards,
Olena
Tags: None

Daniel Schaefer

Join Date: Mar 2020
Posts: 822

22 Sep 2023, 16:37

If I'm not mistaken, you should have about 27,225,000 observations, correct? I've generated some data that should look a bit like what you have:

Code:

clear
local num_units = 5000
local num_times = 5445
local total_obs = `num_times' * `num_units'
set obs `total_obs'
egen unit = seq(), f(1) t(`num_units') b(`num_times')
egen time = seq(), f(1) t(`num_times')


gen variable1 = rnormal()
local current = 2
forv rho = 0.01(0.01)0.40{
    local previous = `current' - 1
    gen variable`current' = `rho' * variable`previous' + (sqrt(1 - `rho') * runiform())
    local current = `current' + 1
}

Here is my Stata solution:

Code:

quietly corr variable* if time == 1
matrix sum = r(C)
forv time = 2/`num_times'{
    quietly corr variable* if time == `time'
    matrix sum = sum + r(C)
}
matrix av = sum / `num_times'
matrix list av

But it is indeed very slow. Everything I try with data this large is very slow. I don't think there's an obvious optimization strategy here, so I think Mata is your best bet for a more tightly optimized solution. Here is some example code to get you started.

Code:

sort time unit // this is essential. We rely on this ordering to properly index below.
mata
    num_units = 5000
    num_times = 5445
    sum = J(40,40,0) // sum needs to store a 40 by 40 correlation matrix. Start with all zeros.
    for (i = 0; i < num_times; i++){
        start_row_index = (i * num_units) + 1 // Indexing like this to avoid doing search in O(N) time.
        end_row_index  = (i + 1) * num_units
        // (3..42) are just my 40 test variables. You may need to find your own way to index.
        data = st_data((start_row_index, end_row_index), (3..42)) 
        sum = sum + corr(variance(data))
    }
    av = sum / num_times
    av
end

It's essentially the same algorithm, but better optimized. This runs in about 36 seconds on my machine with all 40 variables and 27,225,000 observations. Not great, but much better than the Stata solution. Maybe someone else who knows Mata a little better will step in with a more tightly optimized solution. Theoretically speaking, I don't think you can do better than linear time here. I bet you could get something on the order of two or three orders of magnitude faster in SQL or with a C or C++ plugin, but I wouldn't open that can of worms unless you plan on running this code every day for the next decade and you expect your data to continue to grow over that time.

Announcement

Cross-Sectional Correlations by date

Comment