moving average by unit id

Naz Av

Join Date: Sep 2014

Posts: 20
#1

moving average by unit id

06 Dec 2015, 22:13

Hi all, I'm trying to create a moving average by unit id. My data is panel (country year). I'm trying to create a variable that records the average count of a variable by country (unit id). So for year 1993, if I had data from 1990, the average would be the mean count for 1990, 1991, 1992, excluding the current year.

so the data would look like this:
unitid count movave

1990 1 10

1991 1 5 10

1992 1 10 7.5

1993 1 20 8.333333

1994 1 0 8.75

1990 2 5

1991 2 5 5

1992 2 10 5

1993 2 10 7.5

1994 2 10 10

I've tried two things: egen movave=ma(count) but this gives the error for missing values in the count variable. I read the help and this should treat missings as zero in calculating the count (so year 1994 could be missing but treated as zero). I'm fine with this. However to get the command to work I had to recode the count (.=0). The major concern I have though is that the above is not by unit id--that is the moving average continues across time.

I also tried tssetting the data by tsset unitid year and then running the ma command

tssmooth ma movave = count , window(2 1 2)

However again this does not record the moving average by unitid. My second issue with this is that I do not have a theoretical reasoning of which polynomial I'd use to smooth out the average given as what I want is pretty straightforward. Tssmooth seems to me to be more suitable for say time series stock data that needs a smoothing function.

thus, I think the egen command is a better fit but I don't know how to finetune it to run despite missings and perhaps more importantly, create a moving average by unitid (by country).
I am guessing I'd have to write a loop and run foreach on unitid...but I'm not sure.

Any help would be appreciated! I should also mention that the panel is unbalanced b/c it's country data and there are new countries entering the data of course post-Cold War for instance. I'm mentioning this b/c there are shortcuts with lag specifiers if each panel had the same number of time points.

thanks much in advance!

apologies everyone the above is not what I want. I left it in there as that's a different problem: that is the past 3 years moving average. No what I'd like is a running average. Meaning that it'd calculate the average count from 1990 or whichever is the first date recorded per unitid up till the past year.

so it'd be like this

unitid count movave

1990 1 10

1991 1 5 10

1992 1 10 7.5

1993 1 20 8.333333

1994 1 0 9

1990 2 5

1991 2 5 5

1992 2 10 5

1993 2 10 6.666667

1994 2 10 7.5

Yes I think it does require a loop but I can't wrap my mind around it.

Last edited by Naz Av; 06 Dec 2015, 22:23.
Tags: None
Olena Onishchenko

Join Date: Oct 2015

Posts: 165
#2

07 Dec 2015, 02:13

Moving average perfectly works for time-series data. There is user written command designed for this purpose.
For your case it would be like that: mfd_ta_MA your variable, period (3) ma_type (sma). But, if you have missing values results are bised.

Maybe it's not the best way, but that's how I calculated MA for panel data after spending weeks on it:

bysort unitid (year): gen MA3=(count[_n-3]+count[_n-2]+count[_n-1])/3
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35717

07 Dec 2015, 03:11

Naz Av: Please note our requests that people use full real names and to use CODE delimiters to show code. See FAQ Advice http://www.statalist.org/forums/help

tsegen (SSC) is sufficiently general for your purposes. There are now many examples in threads here.

Unbalanced panels are not a problem. If your missings really mean zero (I can't follow what you are saying on this point) then you would indeed need to recode them first.

Code:

. clear

. input year unitid       count   wanted

          year     unitid      count     wanted
  1. 1990    1       10      .
  2. 1991    1       5       10
  3. 1992    1       10      7.5
  4. 1993    1       20      8.333333
  5. 1994    1       0       9
  6. 1990    2       5       .
  7. 1991    2       5       5
  8. 1992    2       10      5
  9. 1993    2       10      6.666667
 10. 1994    2       10      7.5
 11. end

. tsset unitid year
       panel variable:  unitid (strongly balanced)
        time variable:  year, 1990 to 1994
                delta:  1 unit

. tsegen calc = rowmean(L(1/3).count)
(2 missing values generated)

. l, sepby(unitid)

     +---------------------------------------------+
     | year   unitid   count     wanted       calc |
     |---------------------------------------------|
  1. | 1990        1      10          .          . |
  2. | 1991        1       5         10         10 |
  3. | 1992        1      10        7.5        7.5 |
  4. | 1993        1      20   8.333333   8.333333 |
  5. | 1994        1       0          9   11.66667 |
     |---------------------------------------------|
  6. | 1990        2       5          .          . |
  7. | 1991        2       5          5          5 |
  8. | 1992        2      10          5          5 |
  9. | 1993        2      10   6.666667   6.666667 |
 10. | 1994        2      10        7.5   8.333333 |
     +---------------------------------------------+

The differences here from what you say you want both appear to be basic arithmetic errors on your part as a hand check of (5 + 10 + 20)/3 and (5 + 10 + 10)/3 will confirm.

Olena: Also please note (FAQ Advice) that you are asked to document user-written commands you refer to.

In looking for mfd_ta_MA I got to http://stat.researchforprofit.com/20..._dm-for-stata/ but some kind of registration seems to be required so I went no further. My bias is that tsegen should be more than adequate here. Full disclosure: I am the second author.

The code

Code:

bysort unitid (year): gen MA3=(count[_n-3]+count[_n-2]+count[_n-1])/3

is not smart about missing values. From what you say that is true of mfd_ta_MA but if so why are you recommending it?

Last edited by Nick Cox; 07 Dec 2015, 03:14.

Comment

Naz Av

Join Date: Sep 2014

Posts: 20
#4

07 Dec 2015, 12:10

Many thanks both of you. I had used a similar formulation to Olena's. I wanted the missings to be treated as zero in the calculation. In Olena's code, I think the missing would be skipped altogether instead of being treated as zero.

Re:Nick Cox, thanks! I had previously requested the name to be changed; I will do so again.

Apologies about the calculation mistake-I did it in excel instead of calculating it myself. I think the confusion arises from another fact though: I was trying to calculate the average from time 1 instead of the past 3 years. so 9 there derives from 45/5 (10+5+10+20) and including current year divided by 5. In my first example I gave the moving average example of past 3 years.

On that note:
One further question: how would I modify this code if I wanted a running average-that is to do an average not on the past 3 lags but rather from time 1 for all units? If it were a balanced dataset and all countries in this data had 30 years, I could modify the code above to L(1/10) but that's not the case. That is why I mentioned unbalanced panel.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35717
#5

07 Dec 2015, 12:30

On Oleana's code: if you mean that

Code:

bysort unitid (year): gen MA3=(count[_n-3]+count[_n-2]+count[_n-1])/3

will "skip missings", that is quite wrong, as experiment will confirm. Missings in any previous value included in the calculation will result in missing moving averages.

If your missings really mean zero, you should just recode them as such before calculation. All analyses henceforth will be affected.

The way to get a moving average to date is just

Code:

bysort unitid (year) : gen ave_to_date = sum(count) / sum(count < .)

if missings mean what they say, or

Code:

bysort unitid (year : gen ave_to_date = sum(count) / _n

if missings mean zero. sum() ignores missings and its results are affected by zeros, but the results are not the same when observations are counted.

The definition used here is the same for balanced and unbalanced panels. If you want a different definition, you may have to spell out how the rules would differ.
Comment
Olena Onishchenko

Join Date: Oct 2015

Posts: 165
#6

07 Dec 2015, 12:31

Nick, before I used the code I suggested (bysort unitid (year): gen MA3=(count[_n-3]+count[_n-2]+count[_n-1])/3) I dropped all missing values as it would bias my calculations.

drop if var=.

After that it worked perfectly for me on unbalanced data.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35717
#7

07 Dec 2015, 12:35

You didn't explain that in #2. It is valid if and only if you really do want the mean of the previous three known values. It's not a general solution, as many people want to be sure that the moving average refers to a fixed and predictable time window.
Comment
Naz Av

Join Date: Sep 2014

Posts: 20
#8

07 Dec 2015, 13:44

Hi both, many thanks. I had used a similar code to Olena's.

Nick Cox; much appreciated. I think I've got this now. You are right, in general I would not want missings to be treated as zero. In this particular case, it's event data where if the event has not occurred, it will not be recorded in source data and hence we can "zero impute" -assume zero. In general though, I would use the first code. I understand it to mean, calculate it until the first missing value by unit id.
I will experiment with my hypothetical data to ensure it's doing what I want it to do! Thanks!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35717
#9

07 Dec 2015, 13:49

Sorry, what do you mean by "the first code"? Best to quote it exactly.
Comment
Olena Onishchenko

Join Date: Oct 2015

Posts: 165
#10

07 Dec 2015, 14:04

Sorry Nick

I am learning from all of your how to be more specific. Will consider it for the next time.
Comment

Announcement