Speeding up Stata - Statalist

Nick Cox

Join Date: Mar 2014

Posts: 35726
#31

29 May 2018, 02:18

Code:

local todrop

empties (or equivalently deletes) the local macro todrop
Comment
Kristoffer Bjarkefur

Join Date: Feb 2016

Posts: 53
#32

29 May 2018, 02:35

Have you used compress on your data? See https://www.stata.com/help.cgi?compress

This mainly helps you to reduce the size of the data set on disk, but that also reduces the size of the data Stata holds in working memory. Stata's tool for importing data usually imports the data in the most efficient form, so you might not gain that much depending on how your .dta file was created. I am not sure if it is the case, but optimized variable types should be faster to process, but I do not understand the low level implementation of Stata to know how much of a difference that makes.

You do not want to run compress every time you run your file as compress can be slow, but you can run the code below in a separate do-file once and never have to do it again (unless the original data is updated). Compress tells you in the end if you saved data size.You can also save it under a different name, but you might not want to have two copies of this large data set. You never loose any information with compress so it should always be fine to do this:

Code:

use TF_Short_10Y_5BP_US_end.dta compress save TF_Short_10Y_5BP_US_end.dta, replace

You can also run this after you have generated the new variables. When you generate new variables without specifying variable type, Stata selects the one with highest precision but that means least efficient. You can run compress after you have generated the new variables. If your code later requires a variable to have more precision to not loose information, then Stata will change the variable for you. If this is required often, then you might loose waste run time doing that.

Finally, compress the data set before saving it. This is always a good practice before saving big data sets.

I can't guarantee that this will speed up things a lot, but you should probably at least get some disk space gains by using this.

Last edited by Kristoffer Bjarkefur; 29 May 2018, 02:44.
1 like
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#33

29 May 2018, 14:38

With the additional guidance from #30, I created the following demonstration dataset with 2 trades using the following code:

Code:

clear all
set seed 431
set maxvar 10000

input float(Tradenumb start_date duration)
 44 14444  444
187 14714 1621
end
format %tdDayDDmonCCYY start_date

expand duration
bysort Tradenumb : gen Date = start_date + _n - 1
format %tdDayDDmonCCYY Date
drop if inlist(dow(Date), 0, 6)  // Sunday and Saturday    
drop start_date duration

isid Tradenumb Date, sort

foreach v in Treasury Swap LIBOR LIBOR_discount Repo {
    gen `v' = runiform()
}

forvalues i = 0/3600 {
    gen DT__`i' = 1 / (1 + runiform(.055,.065)/360)^`i'
    gen DS__`i' = 1 / (1 + runiform(.065,.075)/360)^`i'
}
save "test_data.dta", replace

In principle, in order to get present values, all that is needed are the amounts to be discounted and the discount factors. The latter seem to have been pre-computed so I'll use these DT_* and DS_* variables but all this would be a piece of cake if the formula for creating these discount factors was known. I ran the code from #1, with the following modifications (the offset used in #1 is 1960 when it should be 1980)

Code:

replace Treasury_coupon_11_date = trade_start_date + 1980 if Tradenumb == `i'
replace Swap_coupon_11_date = trade_start_date + 1980 if Tradenumb == `i'

and saved the results in "slow_code.dta".

Here's how I would replicate these results using much simpler and faster code. Again, this would go much faster if I knew how to compute directly the desired discount factor.

Code:

use "test_data.dta", clear

* verify assumptions about the data
isid Tradenumb Date, sort

* the day of each observation relative to the first trade date
by Tradenumb: gen tday = Date - Date[1]

by Tradenumb: gen LIBOR_coupon_date = Date[1] + 90 * (ceil(tday/90))
by Tradenumb: gen REPO_reset_date = Date[_n-1]
format %td LIBOR_coupon_date REPO_reset_date

* copy relevant DT_* DS_* values at half-year coupon dates
qui forvalues i = 1/20 {
    local periods = `i' * 180
    by Tradenumb: gen target_`i' = `periods' - tday
    gen Treasury_disc_`i' = 1
    gen Swap_disc_`i'     = 1
    forvalues j = 1/`periods' {
        replace Treasury_disc_`i' = DT__`j' if target_`i' == `j'
        replace Swap_disc_`i'     = DS__`j' if target_`i' == `j'
    }
}

drop DT_* DS_*

save "faster.dta", replace

ds target_* tday, not
 
cf `r(varlist)' using slow_code.dta, all

and the results:

Code:

. cf `r(varlist)' using slow_code.dta, all
       Tradenumb:  match
            Date:  match
        Treasury:  match
            Swap:  match
           LIBOR:  match
  LIBOR_discount:  match
            Repo:  match
LIBOR_coupon_d~e:  2 mismatches
 REPO_reset_date:  match
 Treasury_disc_1:  match
     Swap_disc_1:  match
 Treasury_disc_2:  match
     Swap_disc_2:  match
 Treasury_disc_3:  match
     Swap_disc_3:  match
 Treasury_disc_4:  match
     Swap_disc_4:  match
 Treasury_disc_5:  match
     Swap_disc_5:  match
 Treasury_disc_6:  match
     Swap_disc_6:  match
 Treasury_disc_7:  match
     Swap_disc_7:  match
 Treasury_disc_8:  match
     Swap_disc_8:  match
 Treasury_disc_9:  match
     Swap_disc_9:  match
Treasury_disc_10:  match
    Swap_disc_10:  match
Treasury_disc_11:  match
    Swap_disc_11:  match
Treasury_disc_12:  match
    Swap_disc_12:  match
Treasury_disc_13:  match
    Swap_disc_13:  match
Treasury_disc_14:  match
    Swap_disc_14:  match
Treasury_disc_15:  match
    Swap_disc_15:  match
Treasury_disc_16:  match
    Swap_disc_16:  match
Treasury_disc_17:  match
    Swap_disc_17:  match
Treasury_disc_18:  match
    Swap_disc_18:  match
Treasury_disc_19:  match
    Swap_disc_19:  match
Treasury_disc_20:  match
    Swap_disc_20:  match

The difference for LIBOR_coupon_date is due to the non-regular binning of the first date for each trade in the #1 code. Note that I'm simply replicating the results generated from the #1 code using better coding strategies and I express no opinion as to whether or not any of this makes sense.

Comment

Raoul Eireiner

Join Date: May 2018

Posts: 17
#34

30 May 2018, 01:52

Dear Nick!

Thanks for the information about how to discard local macros. As I saw, this way anyways already included in your initial code, therefore I don't really know why the list was continued...

Thank you Kristoffer too, for the helpful tip with compress. I did use it on my data files and although it reduced their size just slightly, I think it contributed to the general outcome described further down below.

And thank you also Robert of the very detailed last post. Unfortuntately, I already found a working solution to the encountered problems just some hours earlier. And was just running it on some of the files to tell you later about it. I really appreciate all of your effort and try to use some parts to make my approach even a bit faster.

What I ultimately did now is pretty much identical as in #14, however I moved only one line and it had astonishing implications. I didn't figure out to how to solve the problem arising when running several blocks of code in a sequence executed by the same DO-file. Therefore, I just start each block of code individually now, rather than as a sequence. Although it is more time intensive and needs constant observation it yields what I was aiming for.

Code:

// US 10Y 10BP SHORT clear all quietly { use TF_Short_10Y_10BP_US_end.dta forval j = 1/19 { gen Treasury_coupon_`j' = Treasury } gen Treasury_coupon_20 = Treasury + 100 forval j = 1/19 { gen Swap_coupon_`j' = Swap } gen Swap_coupon_20 = Swap + 100 gen LIBOR_coupon = LIBOR + 100 forval j = 1/20 { gen Treasury_disc_`j' = 1 gen Swap_disc_`j' = 1 gen Treasury_coupon_`j'_date =. gen Swap_coupon_`j'_date =. } gen LIBOR_coupon_date = . gen REPO_reset_date =. quietly bysort Tradenumb (Date) : gen trade_start_date = Date[1] quietly bysort Tradenumb (Date) : gen trade_end_date = Date[_N] forval j = 1/20 { local J = 180 * `j' replace Treasury_coupon_`j'_date = trade_start_date + `J' replace Swap_coupon_`j'_date = trade_start_date + `J' } gen y = Date - trade_start_date gen z= 90*ceil(y / 90) replace z = 90 if z == 0 replace LIBOR_coupon_date = trade_start_date + z drop y z replace REPO_reset_date = Date[_n-1] if Date <= trade_end_date replace REPO_reset_date =. if Date == trade_start_date set tracedepth 1 set trace on forvalues j= 0/3600 { forval k = 1/20 { replace Treasury_disc_`k' = DT__`j' if Treasury_coupon_`k'_date - Date == `j' replace Swap_disc_`k' = DS__`j' if Swap_coupon_`k'_date - Date == `j' } local todrop `todrop' DT__`j' DS__`j' } drop `todrop' } save TF_Short_10Y_10BP_US_last.dta, replace

Thank you once again for all the help during the last week and all the helpful tips & solutions you proposed. I definitely couldn't have done it without all of you guys. Thanks so much
Comment

Announcement

Comment

Comment

Comment

Comment