Rolling Averages with dynamic window size (rangestat)

Tim Hermes

Join Date: Apr 2020

Posts: 17
#1

Rolling Averages with dynamic window size (rangestat)

22 Apr 2020, 14:30

Hi everyone!

I have panel data: For each game_id (e.g. 10), I have daily prices over a period. The length of the period varies strongly: 5 years to about 6 month.

I am trying to calculate a moving average for the daily prices for each game_id (e.g. 10).

I tried the following:

Code:

rangestat (mean) price, interval(date -180 180) by(game_id)

This works fine, but obviously uses the same rolling interval (window size) for all game_ids.
However, for game_ids where I have smaller observed periods (meaning less days), a window size of [-180,180] is too large.
Is there any way to make the interval boundaries dependent on the length of the period (number of days of that game_id), so that games with a smaller period length have a smaller window sizes?

Maybe a loop that iterates through each group (game_id), first checks the number of days present and then uses the number of days to adapt the interval for that game_id?

Here are some exemplary rows:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input long game_id float(date price) 10 20054 3.74 10 20055 4.99 10 20056 4.99 10 20057 4.99 10 20058 4.99 10 20059 9.99 10 20060 9.99 10 20061 9.99 10 20062 9.99 10 20063 9.99 10 20064 9.99 10 20065 9.99 10 20066 9.99 10 20067 9.99 10 20068 9.99 10 20069 9.99 10 20070 9.99 10 20071 9.99 10 20072 9.99 10 20073 9.99 ... 22 20054 2.20 22 20055 5.30 22 20056 5.30 22 20057 6.20 22 20058 6.20 22 20059 2.20 22 20060 1.30 end format %td date

I am also open for other approaches if range stat is not the way to go here - although it takes care of a lot of cases that would be a pain to solve manually.

Thank you very much in advance!

Last edited by Tim Hermes; 22 Apr 2020, 15:07.
Tags: #movingaverage, #panel, #rangestat
Clyde Schechter

Join Date: Apr 2014

Posts: 30050
#2

22 Apr 2020, 19:31

So, you want the length of the window for the rolling average to be some function of the number of days that the game appears in the data set -- but you don't say what function. Just to illustrate the approach, I'll assume you want the window to be 1/4th of the total number of days the game appears in the data.

Code:

by game_id (date), sort: gen window_radius = (date[_N] - date[1]) / 8 gen lb = date - window_radius gen ub = date + window_radius rangestat (mean) price, by(game_id) interval(date lb ub)

Note that no loop is required for this. The -by- prefix does all the necessary iteration without any explicit looping.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35585
#3

23 Apr 2020, 02:37

#2 would be my answer too. rangestat (from SSC, as you are asked to explain) lets you specify variables for lower and upper bounds, which gives you flexibility.

That said, a smoothing recipe that varies is often harder to explain and even harder to defend against puzzled or sceptical readers.
1 like
Comment
Tim Hermes

Join Date: Apr 2020

Posts: 17
#4

23 Apr 2020, 05:04

Dear Clyde, dear Nick,

thank you very much for both your answers. I have to admit I am only a few days old with Stata (coming from Python). Therefore I took my time to digest the proposed solution.

Here is how I understand it:

Code:

by game_id (date), sort: gen window_radius = (date[_N] - date[1]) / 8

--> we sort the data by game_id & date. Then, for each group, we generate a new variable "window_radius" which is computed from the (maximum date - the first date) divided by 8. I understand that this value is "static" (the same) within each group.

Code:

gen lb = date - window_radius

--> we subtract two variables/columns here. The date variable that obviously always increases by one in each group and the "window_radius" that stays the same in each group. Thus, "gen lb" gives us a new variable/column that for each row has a value that is smaller by window_radius than the corresponding date value.

Code:

gen ub = date + window_radius

--> same as above

Code:

rangestat (mean) price, by(game_id) interval(date lb ub)

--> This is where I get confused. I appreciate the fact that we now have two variables defining lower and upper bound. However, I am surprised that they are not variables holding a single value for ub and lb, but two columns. I do understand that these two columns in each row consistently provide the desired interval by subtracting lb from ub. But how does rangestat handle an input for low and high that's each a column? Does it build a new interval with each row and since each row spans the same interval (from different values), it has a consistent interval for each game?

3 last general question:
(1) The -by- construct only applies to the first line of code, right?
(2) rangestat (mean) price, by(game_id) interval(date lb ub) is the same as rangestat (mean) price, interval(date lb ub) by(game_id) ?
(3) I am assuming that a loop would have caused more trouble since it would have tried to iteratively use rangestat on each group, creating a new variable "price_mean" each time, which throws an error. I guess rangestat wasn't constructed in a way that it can (or rather has to be) used in a loop easily. So thank you very much for this elegant solution (using rangestat only once but still with different intervals), @Clyde.

Lastly, thank you @Nick for pointing out the potential troubles with explain a dynamic time window. I will try it with a game-specific time-window and then with a fixed time window (where I kick out games with periods that are too small).

Best
Tim
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35585
#5

23 Apr 2020, 05:22

I think you're mostly there.

The interval() option of rangestat is flexible -- some might say confusing -- but its behaviour is documented. Given a variable and two numeric values say

Code:

interval(time -5 5)

then the window is from time -5 to time + 5 inclusive. But given three variables say

Code:

interval(time lower upper)

then the window is not from time + lower to time + upper but rather time within [lower, upper].

The rationale is that the latter syntax is more flexible -- and indeed what you want so far as I can see.

In general, or at least to hard-boiled experienced Stata programmers,

* variable in Stata programming always means "variable meaning column in dataset" and never anything else. Stata lets you do things that are like what in many languages could be as simple as

x = 42

but -- a little confusingly perhaps -- you can never say that in Stata! (You can say that in Mata.)

* observation always means "observation meaning row in dataset" and never anything else

Also, most Stata commands automatically mean a loop over the observations of the dataset.

The order in which you specify options is never material.
1 like
Comment
Tim Hermes

Join Date: Apr 2020

Posts: 17
#6

23 Apr 2020, 09:36

Thanks for the additional explanation, Nick!

In most other languages variables are more loosely defined - i.e. single values, lists, matrices etc.
Also, thanks for pointing out that most commands automatically tackle the whole variable - makes sense.
Comment

Announcement

Rolling Averages with dynamic window size (rangestat)

Comment

Comment

Comment

Comment

Comment