Running loop of VAR with gigantic daily data

Wen-Hung Hsu

Join Date: Mar 2021
Posts: 37

Running loop of VAR with gigantic daily data

21 Oct 2022, 07:30

Dear Stata users,

I am currently trying to replicate a paper, where Vector Autoregression is applied per stock in the whole US market and per year from 1960 to 2015. To get the average across stocks and years, I have to save the results from the VAR such as coefficient and significance level.

However, I have the issue with computation capacity in my computer. For example, I have tried running the syntax for the whole day, and only around 1% is computed. If I keep on the same method, then it takes months to accomplish the computation.
Therefore, I would love have suggestion from your side, to see how I can amend this situation from the syntax side, or from any other sides.
Below is my syntax for VAR loop, where id is the group index of stock and year. For example, Stock 1 in year 1 is 1, and stock 1 in year 2 is 2. Id has around 160,000. The syntax aims to generate the variables regarding to the VAR variables and the legs. The local count function is to avoid record the constant values, which are not required in my project.
Code:

Code:

 su id, meanonly
set trace on
forvalue i = 1/`r(max)'{
    display `i'
    var sprtrn sign_trading ret if id == `i', lags(1/5)
    mat t = r(table)
    predict rid_rm if id == `i', residuals equation (sprtrn) 
    predict rid_x if id == `i', residuals equation (sign_trading) 
    predict rid_r if id == `i', residuals equation (ret) 
    replace resid_rm = rid_rm if !missing(rid_rm)
    replace resid_x = rid_x if !missing(rid_x)
    replace resid_r = rid_r if !missing(rid_r)
    drop rid_rm rid_x rid_r
    
    local count = 0
    foreach k in rm x r{
        forvalues m = 1/5{        
            replace rm_`k'_l`m'_coef = t[1,`= `m' + `count''] if id == `i'
            replace rm_`k'_l`m'_t = t[3,`= `m' + `count''] if id == `i'            
                    
        }
            local count = `count' +5
    }
    local count = `count'+1
    foreach k in rm x r{
        forvalues m = 1/5{        
            replace x_`k'_l`m'_coef = t[1,`= `m' + `count''] if id == `i'
            replace x_`k'_l`m'_t = t[3,`= `m' + `count''] if id == `i'            
                    
        }
            local count = `count' +5
    }
    local count = `count'+1
    foreach k in rm x r{
        forvalues m = 1/5{        
            replace r_`k'_l`m'_coef = t[1,`= `m' + `count''] if id == `i'
            replace r_`k'_l`m'_t = t[3,`= `m' + `count''] if id == `i'            
                    
        }
            local count = `count' +5
    }
}
set trace off

Tags: None

Wen-Hung Hsu

Join Date: Mar 2021

Posts: 37
#2

21 Oct 2022, 07:35

FYI, it is a panel data

Code:

xtset id mydate

mydate is just _n for the date variable in case if there is gap.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30078
#3

21 Oct 2022, 09:00

All those -if id == `i'-'s are killing you. This is a job for -runby-, written by Robert Picard and me, available from SSC. Basically you will have to rewrite the contents of your loop as a program. In that program, all of the -if id == `i'- clauses must be removed. You apply -runby- to that program, and the whole thing moves much, much faster.

The helpfile for -runby- is well written and has good examples.
Comment
Wen-Hung Hsu

Join Date: Mar 2021

Posts: 37
#4

21 Oct 2022, 12:26

Originally posted by Clyde Schechter View Post

All those -if id == `i'-'s are killing you. This is a job for -runby-, written by Robert Picard and me, available from SSC. Basically you will have to rewrite the contents of your loop as a program. In that program, all of the -if id == `i'- clauses must be removed. You apply -runby- to that program, and the whole thing moves much, much faster.

The helpfile for -runby- is well written and has good examples.

Thank you, Clyde. However, when I tried to adjust the content of loop into the program, like this:

Code:

program var_1 var sprtrn sign_trading ret, lags(1/5) end runby var_1, by(id)

This program erase the whole data and return error:

Code:

system limit exceeded - see manual

But I cannot find relevant explanation in the manual - help runby. Please help me with this memory issue.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30078
#5

21 Oct 2022, 12:44

Oops! Sorry. -runby- is not going to work for this. The problem is that your iterator id takes on a different value in each year, but then the command you want to iterate requires lagged data. The way -runby- works, only the observations with the current value of id are in memory at any given time, and since id incorporates a specific time, the lagged data will not be there.

I'm sorry I didn't think about how -var- works before making that recommendation.

You might be able to do this using Robert Picard's -rangerun- command (from SSC). Basically it seems like you are trying to do this calculation in a moving 6 year window, separately on each stock. And the speedup from -rangerun- (again, no -if id -- `i'- clauses needed) should be similar to what I expected -runby- to give you.
Comment
Wen-Hung Hsu

Join Date: Mar 2021

Posts: 37
#6

21 Oct 2022, 14:34

Originally posted by Clyde Schechter View Post

Oops! Sorry. -runby- is not going to work for this. The problem is that your iterator id takes on a different value in each year, but then the command you want to iterate requires lagged data. The way -runby- works, only the observations with the current value of id are in memory at any given time, and since id incorporates a specific time, the lagged data will not be there.

I'm sorry I didn't think about how -var- works before making that recommendation.

You might be able to do this using Robert Picard's -rangerun- command (from SSC). Basically it seems like you are trying to do this calculation in a moving 6 year window, separately on each stock. And the speedup from -rangerun- (again, no -if id -- `i'- clauses needed) should be similar to what I expected -runby- to give you.

I am actually doing a moving 5 day window per year for VAR model. I have looked into the usage of rangerun, the coding is

Code:

bysort permno year: gen mydate = _n program var_1 xtset id mydate var sprtrn sign_trading ret, lags(1/5) end rangerun var, interval(mydate -5 0) by(id) verbose

And have this error

Code:

last estimates not found

I now set "mydate" as the count for each stock and for each year - or the count of each id. My idea to set the interval is to let Stata get the data between mydate -5 to mydate +0. These are also the input to VAR for each date.
I am not sure if I set the interval correctly in this situation, and maybe this leads to the error ?

Last edited by Wen-Hung Hsu; 21 Oct 2022, 15:31.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30078
#7

21 Oct 2022, 20:23

The way you set the interval in the -rangerun- command looks correct to me. However, it seems that in -rangerun- you are calling -var- itself rather than your wrapper program var_1. That, indeed, can lead to the "last estimates not found" error message.

By the way, not to be argumentative, but -interval(mydate -5 0)- is a 6 day moving window, not 5. Count from -5 to 0 and you count 6 numbers.
Comment
Wen-Hung Hsu

Join Date: Mar 2021

Posts: 37
#8

22 Oct 2022, 02:18

Originally posted by Clyde Schechter View Post

The way you set the interval in the -rangerun- command looks correct to me. However, it seems that in -rangerun- you are calling -var- itself rather than your wrapper program var_1. That, indeed, can lead to the "last estimates not found" error message.

I am not quite sure if I understand you. So if I name the program "var" it may confuse -rangerun- that it run the var model itself rather than the program? In such case, I have rename the program.

Sorry, I made some confusion here. I think it's better that I still paste the whole coding here.
The program for the VAR model is

Code:

program my_var xtset id mydate var sprtrn sign_trading ret, lags(1/5) mat t = r(table) predict rid_rm, residuals equation (sprtrn) predict rid_x, residuals equation (sign_trading) predict rid_r, residuals equation (ret) replace resid_rm = rid_rm if !missing(rid_rm) replace resid_x = rid_x if !missing(rid_x) replace resid_r = rid_r if !missing(rid_r) drop rid_rm rid_x rid_r local count = 0 foreach k in rm x r{ forvalues m = 1/5{ replace rm_`k'_l`m'_coef = t[1,`= `m' + `count''] replace rm_`k'_l`m'_t = t[3,`= `m' + `count''] } local count = `count' +5 } local count = `count'+1 foreach k in rm x r{ forvalues m = 1/5{ replace x_`k'_l`m'_coef = t[1,`= `m' + `count''] replace x_`k'_l`m'_t = t[3,`= `m' + `count''] } local count = `count' +5 } local count = `count'+1 foreach k in rm x r{ forvalues m = 1/5{ replace r_`k'_l`m'_coef = t[1,`= `m' + `count''] replace r_`k'_l`m'_t = t[3,`= `m' + `count''] } local count = `count' +5 } end rangerun my_var, interval(mydate -5 0) by(id) verbose

where

Code:

bysort permno year: gen id = _n ( So id is unique stock and year index) gen mydate = _n

I understand your question about the interval (-5 0), but in VAR model, it estimate the coefficients for each variable in each lags per stock and per year with the dependent variable of t=0.
For example, in t= 6, the equation for sprtrn in VAR model is
sprtrn_6 = sprtrn_1 +... +sprtrn_5 + x_1 +... + x_5 +r_1 +...+ r_5 + residual
In such case, I am not sure whether I should set interval (-5 0) or (-5 -1).
But either seems weird for VAR model, since VAR model would estimate the yearly coefficients, it should not run a rolling window and just get the coefficients in that window.

Therefore I also try

Code:

rangerun var, interval(mydate . .) by(id) verbose

And the demonstrated time variable when the coding is running seem to match how VAR model works. It get mydate range for each id.

Yet with this updated interval, I bumped into some new problems.
1. Rangerun runs VAR model for each interval, which is not necessary. The coefficients are static within each id, so technically, it should run only once for each id. However, it runs for each mydate value in the id.
2. The error message shows up again

Code:

system limit exceeded - see manual

and the program just stops after the syntax for VAR model (the second row in the program).

Last edited by Wen-Hung Hsu; 22 Oct 2022, 03:04.
Comment
Wen-Hung Hsu

Join Date: Mar 2021

Posts: 37
#9

22 Oct 2022, 03:40

A little update:
I set a new variable to have the number in each id

Code:

bysort id: gen count_id = count(permno)

and then run the code

Code:

rangerun my_var, interval(mydate . count_id) by(id) verbose

This time, the time variable range input into VAR model for each id seems correct.
However, the problem of 1 still comes up. -rangerun- just run so many times within each id, but actually just 1 time is sufficient.
Problem of 2 is solved halfway. The error message is goen, but -rangerun- still stops at the syntax of VAR model. It does not keep on filling the results of VAR model into the loop-generated variables.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30078
#10

22 Oct 2022, 04:13

Code:

bysort id: gen count_id = count(permno)

is not legal syntax, because -count()- is an -egen- function, not a -gen- function. The code should actually break at that point and proceed no farther.

The problem of -rangerun- running repeatedly when once is enough is dealt with in the -rangerun- help file. There is an explanation of how to avoid unnecessary repetitions. Click on the "Controlling the sample - Median salary of non teammates" link near the top of the file and read that explanation and example.

As for -rangerun- not filling the results of the model into the loop-generated variables, the problem is that the filling in must be done with -gen- statements, not -replace- statements. (And you must not -gen- variables with those names before you get to the -rangerun- command. In other words, at the point where you call -rangerun-, those variables to be "filled in" must not yet exist, and inside your -program my_var- you must -generate-, not -replace-, them with the appropriate values.
Comment
Wen-Hung Hsu

Join Date: Mar 2021

Posts: 37
#11

22 Oct 2022, 07:29

Originally posted by Clyde Schechter View Post

Code:

bysort id: gen count_id = count(permno)

is not legal syntax, because -count()- is an -egen- function, not a -gen- function. The code should actually break at that point and proceed no farther..

Yes, I sorted it out. Now it's egen. Thanks for the reminding.

The problem of -rangerun- running repeatedly when once is enough is dealt with in the -rangerun- help file. There is an explanation of how to avoid unnecessary repetitions. Click on the "Controlling the sample - Median salary of non teammates" link near the top of the file and read that explanation and example..

I followed the instruction, and indeed I can assign the values of coefficients and t-values from VAR model to respective variables. What the help file instructs is to generate a variable where only the first observation in each id is considered. This works perfect for the yearly coefficients for each independent variable across lags.

However, in the meanwhile, I would also love to save the results from -predict- for each window (dependent variable at t=0, and independent variable from t-1 to t-5).

Code:

mat t = r(table) predict rid_rm, residuals equation (sprtrn) predict rid_x, residuals equation (sign_trading) predict rid_r, residuals equation (ret) gen resid_rm = rid_rm gen resid_x = rid_x gen resid_r = rid_r drop rid_rm rid_x rid_r

By the instruction of help file, VAR model will also only generate the predicted values for the first observation in each id, while I want to save the predicted values for each window, since predicted values are various within id, unlink coefficients. Is it possible that I can adjust -rangerun- to do such task?

I have also tried the original -rangerun- syntax

Code:

rangerun my_var, interval(mydate . count_id) by(id) verbose

But only to find that the predicted values from VAR model become yearly static, which are not desired. And I can not verify what the predicted values are, since in the VAR model, the first observation should not have predicted values of residual.

Last edited by Wen-Hung Hsu; 22 Oct 2022, 07:42.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30078
#12

22 Oct 2022, 10:37

By the instruction of help file, VAR model will also only generate the predicted values for the first observation in each id, while I want to save the predicted values for each window, since predicted values are various within id, unlink coefficients. Is it possible that I can adjust -rangerun- to do such task?

There is nothing you can do with -rangerun- to modify this behavior of -var-. I, myself, am not a user of -var- and only vaguely understand what it is supposed to do. There may be options applicable to the -var- command that will accomplish what you want. Or perhaps the results you are looking for are found somewhere in -e()- or -r()- and can be accessed from there. I wouldn't know. I can only say that if the original code in the -foreach- you showed in #1 did it, then -rangerun- can do it, too. Otherwise, not. All you really need to do is make the -program my_var- look like the code in the -foreach- loop of #1, with -replace- instead of -gen- and the -if id == `i'- clauses eliminated (plus the -xtset- command at the top) and you should get the same results that the -foreach- would have given you.

I'm sorry I can't be more specific, but we are now into the details of a Stata command that I know very little about.
Comment

Announcement

Running loop of VAR with gigantic daily data

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment