A faster way to create several thousand time series variables

Tyson Banks

Join Date: Jul 2020

Posts: 14
#1

A faster way to create several thousand time series variables

17 Jul 2020, 13:24

I have a couple hundred thousand lines, each of which represents a contract with a certain dollar amount, start date, and end date. I've successfully created a single daily time series of the data, dividing the dollar amounts into equal parts for each day each contract lasts (so if a specific contract lasts for an eight day period, each of these eight days gets 1/8th the total dollar amount in the daily time series). The code I wrote for this is:

gen day = _n
gen double daily_amount = 0
forvalues i = 1/`=_N' {
replace daily_amount = daily_amount + (total_dollar_amount[`i']/(end_date[`i'] - start_date[`i'] + 1)) if day >= start_date[`i'] & end_date[`i'] >= day
}

However, I also need to redo this process by creating separate time series using only data from each city the contracts can place in. There are over 3,000 cities. The following code works:

forvalues j = 1/3489 {
gen double daily_amount_`j' = 0
gen temp_`j' = 0
replace temp_`j' = federal_action_obligation if city_num == `j'
forvalues i = 1/`=_N' {
replace daily_amount_`j' = daily_amount_`j' + (temp_`j'[`i']/(end_date[`i'] - start_date[`i'] + 1)) if day >= start_date[`i'] & end_date[`i'] >= day
}
drop temp_`j'
}

But, unsurprisingly, the double loop ensures the process takes an incredibly long amount of time. Letting it run for a few hours and extrapolating, it seems like it would take at least three months to finish. This obviously isn't viable.

Is there a faster way of doing this? Perhaps by incorporating by or bysort into the variable generation process?

Thank you.
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#2

17 Jul 2020, 14:09

Like many of us here, I don't happen to know anything about how time series of financial contract values work, so I can't say that I quite understand what you *want* to do here. My suspicion, though, is that you use explicit indexing [`i'] when it's not necessary. If I'm right, that might be radically increasing your run time, as explicit subscripting is slow in Stata and is always to be avoided when possible. And, it's usually possible to avoid it. Another and smaller point is that the inrange() function might be somewhat faster than your two inequality comparisons.

Hope I'm not off-base here.
Comment
Tyson Banks

Join Date: Jul 2020

Posts: 14
#3

17 Jul 2020, 15:07

What are some examples of things I can use in place of explicit indexing?

(Also worth noting that in the second block of code, "federal_action_obligation" should be total_dollar_amount. I changed it in the first example to be clearer, but forgot to change it in the second)
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

17 Jul 2020, 16:28

I agree with Mike Lacy - what you are doing is using Stata as if it were some other programming language. It will be easier to show you an example of how to use Stata effectively for your problem than to write an essay about how to do it.

From post #1

I have a couple hundred thousand lines, each of which represents a contract with a certain dollar amount, start date, and end date

Even the best descriptions of data are no substitute for an actual example of the data. There are many ways your data might be organized that are consistent with your description, and each would require a somewhat different approach. In order to get a helpful response, you need to show some example data, especially since you later tell us you also have a city_num variable.

Be sure to use the dataex command to do this. If you are running version 15.1 or later, or a fully updated version 14.2, dataex is already part of your official Stata installation. If not, run ssc install dataex to get it. Either way, run help dataex and read the simple instructions for using it. dataex will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use dataex.

I'm guessing you're a new user of Stata, and if so, you might find the following advice helpful in setting yourself on the right path.

I'm sympathetic to you as a new user of Stata - it's a lot to absorb. And even worse if perhaps you are under pressure to produce some output quickly. Nevertheless, I'd like to encourage you to take a step back from your immediate tasks.

When I began using Stata in a serious way, I started, as have others here, by reading my way through the Getting Started with Stata manual relevant to my setup. Chapter 18 then gives suggested further reading, much of which is in the Stata User's Guide, and I worked my way through much of that reading as well. There are a lot of examples to copy and paste into Stata's do-file editor to run yourself, and better yet, to experiment with changing the options to see how the results change.

All of these manuals are included as PDFs in the Stata installation and are accessible from within Stata - for example, through the PDF Documentation section of Stata's Help menu. The objective in doing the reading was not so much to master Stata as to be sure I'd become familiar with a wide variety of important basic techniques, so that when the time came that I needed them, I might recall their existence, if not the full syntax, and know how to find out more about them in the help files and PDF manuals.

Stata supplies exceptionally good documentation that amply repays the time spent studying it - there's just a lot of it. The path I followed surfaces the things you need to know to get started in a hurry and to work effectively.
1 like
Comment

Tyson Banks

Join Date: Jul 2020
Posts: 14

18 Jul 2020, 07:19

William Lisowski Thank you for telling me about dataex! I hope I've used it correctly:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(total_dollar_amount start_date end_date city_num day) double daily_amount
     15000 1870 1977  402  1 0
     10000 1875 1948  402  2 0
      5000 1960 1962  681  3 0
      1500 1917 1947 3166  4 0
       -30 1947 3496 3166  5 0
     12500 1912 1947 3166  6 0
     40000 1798 2163 3260  7 0
    125000 1798 2163 3287  8 0
    350000 1982 2163 3260  9 0
     18000 1883 2008 1804 10 0
     810.6 1960 2008  681 11 0
      1400 1958 1962  681 12 0
    145000 2096 2460 3287 13 0
      5000 1975 2100  166 14 0
  14594.33 2029 2100  402 15 0
       550 2046 2100  166 16 0
     15000 1910 2008  166 17 0
    500000 2344 2708 3287 18 0
-284633.63 2344 2708 3287 19 0
    321030 2344 2708 3287 20 0
      6963 1960 2038  681 21 0
     24630 1967 2008  681 22 0
 -13487.96 2830 2844  681 23 0
     25000 1960 2008  681 24 0
    6764.4 1648 1673  166 25 0
end

The variables start_date and end_date are denominated in number of days since 31 December, 2002. The variable day is just = _n. The last variable, daily_amount, is initially all zeroes, until it is populated by the code above.

What I need to do is create a different version of daily_amount for every value of city_num, in a way that doesn't take months to calculate like my double loop does.

Originally posted by William Lisowski View Post

I'm guessing you're a new user of Stata

Partially yes—I've used it in classes for my master's course and casually for personal projects for a couple years. This is my first time using it in a professional context. Thank you for hopefully being patient with me.

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

18 Jul 2020, 12:29

Perhaps this example will start you in a useful, faster direction in the absence of loops. The main limitation is that it generates lots of observations, but if the exceed the capacity of your copy of Stata, you can divide your data into small groups of cities, or can even process it one city at a time, appending the results to an output dataset after each batch.

Code:

// don't need these
drop day daily_amount
// convert to Stata Internal Format dates that we can read
replace start_date = start_date+td(31dec2002)
replace end_date = end_date+td(31dec2002)
format %td start_date end_date
assert end_date>=start_date
// distinct identifier for each observation
sort city_num start_date end_date
generate seq = 0
replace seq = cond(city_num!=city_num[_n-1],1,seq[_n-1]+1)
order city_num seq start_date end_date total_dollar_amount
// daily dollar amount
generate daily_dol = total_dollar_amount/(end_date-start_date+1)
format %12.2fc total_dollar_amount daily_dol
list if city_num==166, abbreviate(20)
// one observation for each day of range
expand end_date-start_date+1
bysort city_num seq: generate date = start_date - 1 + _n
format %td date
collapse (sum) daily_dol, by(city_num date)
// list observations where a jump occurs
bysort city_num (date): generate change = date!=date[_n-1]+1 | date!=date[_n+1]-1 | daily_dol!=daily_dol[_n-1]
list city_num date daily_dol if change & city_num==166, abbreviate(12) separator(0)

Code:

. list if city_num==166, abbreviate(20)

     +---------------------------------------------------------------------------+
     | city_num   seq   start_date    end_date   total_dollar_amount   daily_dol |
     |---------------------------------------------------------------------------|
  1. |      166     1    06jul2007   31jul2007              6,764.40      260.17 |
  2. |      166     2    24mar2008   30jun2008             15,000.00      151.52 |
  3. |      166     3    28may2008   30sep2008              5,000.00       39.68 |
  4. |      166     4    07aug2008   30sep2008                550.00       10.00 |
     +---------------------------------------------------------------------------+

Code:

. list city_num date daily_dol if change & city_num==166, abbreviate(12) separator(0)

      +----------------------------------+
      | city_num        date   daily_dol |
      |----------------------------------|
   1. |      166   06jul2007      260.17 |
  26. |      166   31jul2007      260.17 |
  27. |      166   24mar2008      151.52 |
  92. |      166   28may2008      191.20 |
 126. |      166   01jul2008       39.68 |
 163. |      166   07aug2008       49.68 |
 217. |      166   30sep2008       49.68 |
      +----------------------------------+

Comment

Tyson Banks

Join Date: Jul 2020

Posts: 14
#7

20 Jul 2020, 08:03

William Lisowski

First off, thank you very much for the detailed answer!

Like you predicted might be the case, hardware limitations mean the example code you supplied is not faster for me than what I wrote, particularly from the expand command which increases my number of entries from 122,000 to 26.4 million.

I'll look into ways of possibly breaking up the data set into pieces to still accomplish what I need. If I succeed, I'll post my resulting code in this thread for anyone who might be looking at it years from now with a similar issue.

Last edited by Tyson Banks; 20 Jul 2020, 08:32.
Comment
Bjarte Aagnes

Join Date: Apr 2014

Posts: 783
#8

20 Jul 2020, 09:08

What are some examples of things I can use in place of explicit indexing?

gen double daily_amount = 0
forvalues i = 1/`=_N' {
replace daily_amount = daily_amount + (total_dollar_amount[`i']/(end_date[`i'] - start_date[`i'] + 1)) if day >= start_date[`i'] & end_date[`i'] >= day
}
may be changed to (one -generate- not _N generate/replace)

Code:

keep if ( day >= start_date & end_date >= day ) generate daily_amount = total_dollar_amount / ( end_date - start_date + 1 )
Comment

Announcement