Calculating monthly covariances based on daily data

Dimitris Chlorokostas

Join Date: Mar 2017
Posts: 7

Calculating monthly covariances based on daily data

03 Apr 2017, 13:23

Dear all,

I have panel data comprising of firms and daily stock returns. The returns of the firms are called "pch". Assume some arbitrary weights that I call "weight".

Every 21 days from my daily data, I want to compute the monthly covariance between two variables. he variable "id_firm_month" is just a group(firm month) variable. But first I want to exclude calculations on firm-month observations that have less than 10 non-zero and non-missing returns. I type the following command:

Code:

bys id_firm_month: egen int counta = count(id_firm_month) if !missing(pch) & pch!=0
bys id_firm_month: egen cova = corr(pch weight) if counta>9, covariance

The first command runs fine. But the second I ran it for 3 hours and I got no result. The id_firm_month has in total 50,000 values, each containing 21 days.
I have Stata MP/14.1 and desktop with 16GB of RAM and AMD A4 PRO-7300B APU processor so my machine is quite powerful. I run the same command for 20 id_firm_month and I still need to wait 7-8 seconds.

I follow exactly the methodology that other researchers have followed, so in theory it should work (because later I have to do the same thing in a dataset that is 10 times larger).

Please if you think I am doing something in an inefficient way and monthly covariances can be calculated somehow easier let me know.

I have an example of my dataset below, so for these 2 id_firm_month that are observable below, I should get 2 covariances, each for every month.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str16 firm float(id_firm_month pch weight) int counta
"CY00052802151D_w" 17009   .02 .0014250644 19
"CY00052802151D_w" 17009  5.28  .005935295 19
"CY00052802151D_w" 17009 -5.02    .1029575 19
"CY00052802151D_w" 17009     0  .002908295  .
"CY00052802151D_w" 17009   3.3  .012112848 19
"CY00052802151D_w" 17009  9.24  .004154707 19
"CY00052802151D_w" 17009  9.24  .072070256 19
"CY00052802151D_w" 17009 -2.79  .017304068 19
"CY00052802151D_w" 17009   1.3  .000342158 19
"CY00052802151D_w" 17009   1.6 .0009975451 19
"CY00052802151D_w" 17009 -7.82   .30016765 19
"CY00052802151D_w" 17009  2.67   .21011737 19
"CY00052802151D_w" 17009  -.98  .024720097 19
"CY00052802151D_w" 17009 -1.29  .008478994 19
"CY00052802151D_w" 17009 -1.55   .05044918 19
"CY00052802151D_w" 17009   1.8 .0004887971 19
"CY00052802151D_w" 17009 -1.87   .14708215 19
"CY00052802151D_w" 17009  1.45 .0006982816 19
"CY00052802151D_w" 17009  -.64  .035314426 19
"CY00052802151D_w" 17009  3.97 .0020358064 19
"CY00052802151D_w" 17009     0 .0002395106  .
"CY00052802151D_w" 17010     0 .0002395106  .
"CY00052802151D_w" 17010  5.17  .012112848 15
"CY00052802151D_w" 17010 -5.69    .1029575 15
"CY00052802151D_w" 17010     0  .035314426  .
"CY00052802151D_w" 17010   2.9  .017304068 15
"CY00052802151D_w" 17010 -1.46 .0004887971 15
"CY00052802151D_w" 17010   .76 .0006982816 15
"CY00052802151D_w" 17010   4.4   .14708215 15
"CY00052802151D_w" 17010 -3.13   .21011737 15
"CY00052802151D_w" 17010     0  .072070256  .
"CY00052802151D_w" 17010  3.23 .0014250644 15
"CY00052802151D_w" 17010     0  .005935295  .
"CY00052802151D_w" 17010 -3.13  .004154707 15
"CY00052802151D_w" 17010 -3.23   .30016765 15
"CY00052802151D_w" 17010 -2.13 .0020358064 15
"CY00052802151D_w" 17010  3.61  .002908295 15
"CY00052802151D_w" 17010     0  .008478994  .
"CY00052802151D_w" 17010     0   .05044918  .
"CY00052802151D_w" 17010  3.81 .0009975451 15
"CY00052802151D_w" 17010  -.38  .024720097 15
"CY00052802151D_w" 17010    .7  .000342158 15
end

Thank you in advance,
Dimitris Chlorokostas

Last edited by Dimitris Chlorokostas; 03 Apr 2017, 13:27. Reason: large dataset

Tags: covariance, large dataset, panel data

Clyde Schechter

Join Date: Apr 2014

Posts: 30160
#2

03 Apr 2017, 14:26

-corr()- is not an official Stata -egen- function. I'm guessing it's part of the -egenmore- package. In any case, I suspect that the slow execution is due to a combination of standard -egen- overhead and probably multiple evaluation of -if- qualifier as given explicitly in your command, ans possibly within -egen, corr()- itself. -if-s are very slow because the entire data set must be processed and each observation tested for the condition. In a very large data set, that can use a lot of time.

My first suggestion would be, instead of flagging the data you want to analyze with the counta variable, can you just drop altogether the data that you want to exclude? So

Code:

by firm_month, sort: gen valid_obs = sum(pch!= 0 & !missing(pch)) by firm_month: drop if valid_obs < 10

Then try running

Code:

bys id_firm_month: egen cova = corr(pch weight), covariance // NOTE: NO -if- QUALIFIER!!!

This alone may speed things up appreciably. If you ultimately have to get back your original data, you could merge it back after you're done.

Added: If you need to keep the original data intact, and if -merge-ing back would be too slow, then another possibility is to just run -egen cova- on all the id_firm_months, whether they have 10 observations or not, and then zap the values for those firm-months that have too few valid observations:

Code:

by id_firm_month_sort: replace cova = . if valid_obs < 10

If none of this produces an adequate speedup (which would mean that -if-s internal to -egen, corr()- are the likely source of the bottleneck), post back for a different approach.

Last edited by Clyde Schechter; 03 Apr 2017, 14:30.
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#3

03 Apr 2017, 16:25

The fastest way to do this is to use rangestat from SSC. The following is a variation on the help file's example of calculating correlations using a Mata function.

With rangestat, you calculate results for each observation in the dataset. This is overkill in your case since every observation within the same id_firm_month will have the same result. You can however designate a single observation per id_firm_month group and only perform calculations for that observation.

The first step is to identify using standard Stata commands how you would perform the calculation, The following example uses the first and last observation in the dataset and calculates the covariance using the correlate command with the covariance option.

As explained in rangestat's help file, there is no built-in covariance function but you can supply your own Mata function to perform the calculation. It's a pretty simple one, you just need to use Mata's variance() function. The function I supplied returns the number of observations and the covariance.

To restrict the computations to one observation per id_firm_month group, you tag one observation per group and generate the desired bounds only for those tagged observations. For the tagged observation, you want to pick up all observations that have the same id_firm_month value. So the lower and upper bounds of the interval is simply the value of id_firm_month for the tagged observation. All the other observations have bounds of -1, a value that is not found in the data.

rangestat is very fast, on my computer, the whole thing takes less than 5 seconds to run.

Code:

* fake data on 50K id_firm_month groups with 21 returns each clear set seed 123123 set obs 50000 gen long id_firm_month = _n expand 21 bysort id_firm_month: gen ret_id = _n gen pch = runiform() if runiform() < .99 gen weight = runiform() * calculate results using -correlate- for first and last observation corr pch weight if id_firm_month == id_firm_month[1], covariance dis as res r(cov_12) corr pch weight if id_firm_month == id_firm_month[_N], covariance dis as res r(cov_12) * define Mata function to return number of obs and covariance mata: mata clear real rowvector mycov(real matrix X) { real matrix R R = variance(X) return(rows(X), R[2,1]) } end * tag one observation per id_firm_month group bysort id_firm_month (ret_id): gen first = _n == 1 * generate low and high bounds, use out-of-sample value for non-tagged obs gen low = cond(first, id_firm_month, -1) gen high= low rangestat (mycov) pch weight, interval(id_firm_month low high) casewise * rename results using more descriptive names and compare results rename mycov1 nobs rename mycov2 cova * carry over results in first obs to other obs bysort id_firm_month (ret_id): replace nobs = nobs[1] bysort id_firm_month (ret_id): replace cova = cova[1] list id_firm_month pch weight nobs cova if inlist(_n,1,_N)

Last edited by Robert Picard; 03 Apr 2017, 16:42. Reason: fixed typos in the text
1 like
Comment
Dimitris Chlorokostas

Join Date: Mar 2017

Posts: 7
#4

06 Apr 2017, 15:14

Dear Clyde,
indeed, the corr() function is from the egenmore package, sorry that I didn't mention this. Your first two lines of code don't work properly though, because it's a sum, so it starts to count from 1 and then it adds up, and then in the second line it deletes also days within a month that are useful and I don't want to drop. The code that I posted does the right thing. But thank you very much for the idea that you gave me to drop the missing data and then merge back the covariances in the original dataset. This idea only reduced the time to about 15 min.

Dear Robert,
I am astonished by your answer. Your code runs only when the first observation within a month is not missing though, otherwise this line of code results in an error:

Code:

corr pch weight if id_firm_month == id_firm_month[1], covariance

But when I cleaned my dataset and ran it again the result was impressive, indeed only 5 seconds. I just thought though that Stata doesn't do the same calculation again and again on each row, but it's clever enough to do it once and then to fill in all the rows with the same value.
Do you recommend learning mata to make stata more efficient, given the fact that I have a very large dataset and only two months until I finish my master thesis? Because to be honest I didn't understand what the mata function did, except of the line that it calculates somehow a variance.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30160
#5

07 Apr 2017, 09:57

Do you recommend learning mata to make stata more efficient, given the fact that I have a very large dataset and only two months until I finish my master thesis? ... Because to be honest I didn't understand what the mata function did, except of the line that it calculates somehow a variance.

Interesting questions. You were motivated to post your original question because your original solution was taking too long to run: after 3 hours it was still cranking away with no results. My proposed solution (sorry I miscoded the -drop- command: it should have been -by firm_month: drop if valid_obs[_N] < 10-) when you corrected it reduced run time to 15 minutes. Not bad, especially in the context of a master's thesis where the time scale for the project is months.

Now, Robert's solution is plainly faster than that by a large margin: 5 seconds, almost 3 orders of magnitude faster. The drawback is that it uses code that you don't understand. If you are challenged to defend it, you can't. If you need at some point to modify it, you can't. If you come back to look at it in 6 months, you may not remember what it does. My philosophy is that in learning situations such as master's theses or doctoral dissertations, the whole point is to gain an in depth understanding of a problem. So I would argue that you either need to learn enough Mata to be able to explain this code, reproduce it from your own head without copying, and make modifications to it. Whether you can do that in two months really depends on you. Mata is a bit more difficult for most people to learn that Stata. But it is by no means insurmountable, and you don't have to delve particularly deep into it to get to this point.

Now, the philosophy has to be bounded by some rule of reason. After all, I use the official functions of Stata all the time. And I have not even attempted to look at the code and understand them, except in a few select cases, where I had a particular need to do so. Of course, I'm relying there on the reputation of StataCorp, and at some point you have to trust code that others have written for you. In the real world nobody writes every line of code personally--we all use pre-existing modules that we have reason to believe have been developed and tested by capable people. Now, in terms of capable and reputable people, it really doesn't get much better than Robert Picard!

The other side of the coin here is that 15 minutes of run time is not unreasonably long in this context (unless this is just one instance of something that will have to be done thousands of time, in which case the difference between 15 minutes each and 5 seconds each is quite substantial.) So my advice would be to try to learn enough Mata to understand Robert Picard's code. If that proves too difficult or time consuming, than substitute the 15 minute version that is based on #2 in your "production runs" for your thesis.
Comment

Announcement

Calculating monthly covariances based on daily data

Comment

Comment

Comment

Comment