Generate median value of past 5 years observations

Alberto Alvarez

Join Date: Jul 2016

Posts: 195
#1

Generate median value of past 5 years observations

07 May 2018, 08:03

I need to calculate the following. Generate variable “High” = 1 if median “inflation” over the past 5 years is higher than the median over the whole sample. For example, in the example below the median inflation for the sample is 15.5. For the year 2005 “High”=0 since median for the past 5 years (2000-2004) is less than the sample median (15.5). Basically, I don’t know how to calculate median of past 5 years inflation level for each year. Please, help me with this issue.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(gvkey fyear infl) 1 2000 10 1 2001 11 1 2002 12 1 2003 13 1 2004 14 1 2005 15 1 2006 16 1 2007 17 1 2008 18 1 2009 19 1 2010 20 1 2011 21 end
Tags: None
Jean-Claude Arbaut

Join Date: Jul 2017

Posts: 209
#2

07 May 2018, 08:31

Look for the -mvsumm- package on SSC (written by Nick Cox and Chris Baum).

Here is how you could do that:

Code:

ssc install mvsumm * Some random stuff clear set obs 30 gen yr=1979+_n tsset yr, yearly gen x=rexponential(1) * Compute the moving median mvsumm x, gen(y) s(med) w(5) gen z=L3.y * Compare with overall median sum x, detail gen high=z>r(p50) replace high=. in 1/5

Notice the lag in -gen z=L3.y-, so that the median is aligned with the observation following the window.

HTH

Jean-Claude Arbaut

Last edited by Jean-Claude Arbaut; 07 May 2018, 08:44.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

07 May 2018, 14:48

You can do this with rangestat (from SSC). I'll assume that you have panel data so you want high defined using observations within the same gvkey group.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(gvkey fyear infl)
1 2000 10
1 2001 11
1 2002 12
1 2003 13
1 2004 14
1 2005 15
1 2006 16
1 2007 17
1 2008 18
1 2009 19
1 2010 20
1 2011 21
end

egen whole_median = median(infl)
rangestat (median) infl (count) infl, interval(fyear -4 0) by(gvkey)
gen high = infl_median > whole_median if infl_count == 5

and the results:

Code:

. list, sepby(gvkey)

     +--------------------------------------------------------------+
     | gvkey   fyear   infl   whole_~n   infl_m~n   infl_c~t   high |
     |--------------------------------------------------------------|
  1. |     1    2000     10       15.5         10          1      . |
  2. |     1    2001     11       15.5       10.5          2      . |
  3. |     1    2002     12       15.5         11          3      . |
  4. |     1    2003     13       15.5       11.5          4      . |
  5. |     1    2004     14       15.5         12          5      0 |
  6. |     1    2005     15       15.5         13          5      0 |
  7. |     1    2006     16       15.5         14          5      0 |
  8. |     1    2007     17       15.5         15          5      0 |
  9. |     1    2008     18       15.5         16          5      1 |
 10. |     1    2009     19       15.5         17          5      1 |
 11. |     1    2010     20       15.5         18          5      1 |
 12. |     1    2011     21       15.5         19          5      1 |
     +--------------------------------------------------------------+

.

Comment

Attaullah Shah

Join Date: Aug 2014
Posts: 1669

08 May 2018, 01:54

For rolling window statistics, you can also use asrol (can be downloaded from SSC)

Code:

ssc install asrol
bys gvkey : asrol infl, stat(median) window(fyear 5)
. list gvkey fyear infl median5_infl

     +---------------------------------+
     | gvkey   fyear   infl   median~l |
     |---------------------------------|
  1. |     1    2000     10         10 |
  2. |     1    2001     11       10.5 |
  3. |     1    2002     12         11 |
  4. |     1    2003     13       11.5 |
  5. |     1    2004     14         12 |
     |---------------------------------|
  6. |     1    2005     15         13 |
  7. |     1    2006     16         14 |
  8. |     1    2007     17         15 |
  9. |     1    2008     18         16 |
 10. |     1    2009     19         17 |
     |---------------------------------|
 11. |     1    2010     20         18 |
 12. |     1    2011     21         19 |
     +---------------------------------+

Regards
--------------------------------------------------
Attaullah Shah, PhD.
Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
FinTechProfessor.com
https://asdocx.com
Check out my asdoc program, which sends outputs to MS Word.
For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.

Comment

Robert Picard

Join Date: Mar 2014

Posts: 1536
#5

08 May 2018, 12:00

Well if we are going for alternative solutions, you can also use tsegen (from SSC) to do the same:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(gvkey fyear infl) 1 2000 10 1 2001 11 1 2002 12 1 2003 13 1 2004 14 1 2005 15 1 2006 16 1 2007 17 1 2008 18 1 2009 19 1 2010 20 1 2011 21 end tsset gvkey fyear tsegen median5 = rowmedian(L(0/4).infl) list

and the results

Code:

. list +--------------------------------+ | gvkey fyear infl median5 | |--------------------------------| 1. | 1 2000 10 10 | 2. | 1 2001 11 10.5 | 3. | 1 2002 12 11 | 4. | 1 2003 13 11.5 | 5. | 1 2004 14 12 | |--------------------------------| 6. | 1 2005 15 13 | 7. | 1 2006 16 14 | 8. | 1 2007 17 15 | 9. | 1 2008 18 16 | 10. | 1 2009 19 17 | |--------------------------------| 11. | 1 2010 20 18 | 12. | 1 2011 21 19 | +--------------------------------+ .

As to which one to use, tsegen is going to be faster for shorter windows but I generally recommend rangestat (from SSC) as it provides a general solution to all problems of statistics over any data interval (of which a fixed window of time is a subset thereof).

With respect to asrol as an alternative, let me point out that the complete source code for both tsegen and rangestat is fully viewable by any user, including Attaullah Shah. For a while, asrol was simply an adaptation without attribution of tsegen's code under a different interface and I called him on that here. About a year after rangestat came out, fully coded in Mata, asrol moved to Mata code. Since Attaullah Shah has decided not to make the source code for asrol available, no one can know for sure how much of it is original but, as far as I know, asrol still does nothing more than copy some of the functionality of rangestat.

Attaullah Shah pulled the same stunt with asreg (from SSC), released after I showed many times on this list how to perform an OLS on a rolling window using and external Mata function and after I made it a built-in function in rangestat.

Please use rangestat or tsegen for these types of problems.
1 like
Comment

Attaullah Shah

Join Date: Aug 2014
Posts: 1669

08 May 2018, 14:20

First of all, the tone you have used and the blatant allegations you have levelled against me are not the netiquettes I have observed in this forum. My reason for not showing the code is the same as that of the Stata Corporation's, hiding the trade secret. But that never means that I have plagiarized your code. There are indirect ways to see how both the programs work and how they are different. asrol uses several routines for rolling window calculations. Each routine is intelligently picked by asrol according to the data structure, while rangestat uses a fixed method. Therefore, rangestat falls behind in speed when compared on the basis of different data structures. Consider the following example when we use a balanced panel data:

Code:

clear
set obs 1000
gen industry=_n
gen year=_n+1917
expand 5
bys industry: gen country=_n
expand 1000
bys ind: gen company=_n
gen profit=uniform()
timer clear

. timer on 1

. bys company: asrol profit, s(mean) w( year 100)

. timer off 1

.
. timer on 2

. rangestat (mean) profit, interval(year -99 0) by(company)

.
. timer off 2

. timer list
   1:     37.31 /        1 =      37.3110 // asrol time
   2:     60.45 /        1 =      60.4530 // rangestat time

This issue was also highlighted in this post, and then you people quickly took action to improve the rangestat speed for duplicate observations in its next update. Did I copy the idea before you could materialize it?

Even though you worked to improve the speed of rangestat for handling duplicate observations, it still falls behind asrol for duplicate data in the majority of cases. I present some tests and their results below.

Code:

 mat S = J(100,3,.)
loc r = 1
forv i = 100 (100) 10000 {
clear
set obs 2000
gen company = _n
expand `i'
bys company: gen year = _n+1900
gen profit = uniform()

timer clear
timer on 1
bys company: asrol profit, s(mean) w( year 100)
timer off 1

timer on 2
rangestat (mean) profit, interval(year -99 0) by(company)

timer off 2
timer list
mat S[`r',1] = `i'
mat S[`r',2] = `r(t1)'
mat S[`r',3] = `r(t2)'
loc `++r'
}

After 19, runs, I stopped the loop. Following are the results

                       Time in Seconds            time_duplicated
           asrol
      rangestat
  diff_percentage

100
1.208
2.023
-40.29%

200
2.741
4.387
-37.52%

300
4.247
6.818
-37.71%

400
5.87
9.254
-36.57%

500
7.332
11.693
-37.30%

600
8.892
15.53
-42.74%

700
10.664
17.033
-37.39%

800
12.395
19.63
-36.86%

900
14.012
21.898
-36.01%

1000
15.034
23.967
-37.27%

1100
16.764
27.073
-38.08%

1200
18.899
29.464
-35.86%

1300
20.151
31.938
-36.91%

1400
21.782
34.654
-37.14%

1500
23.232
36.856
-36.97%

1600
24.404
38.995
-37.42%

1700
26.065
41.685
-37.47%

1800
28.354
44.27
-35.95%

1900
29.468
46.428
-36.53%

Off course, more variations are possible, which I have not tested. However, the point is that asrol is not a copy.
Now coming to your allegation against asreg, there is nothing special that you have for me to copy from. You adopted the mata code from this blog post (you have mentioned yourself this several times) and so did I. There is no rocket science in making a mata code for OLS components. What is special about asreg is again its speed of rolling window, which is based on the asrol algorithm. Yes, asreg performs Fama and MacBeth procedure, newey standard errors, fitted values and residuals, which off course are not part of rangestat.

Please note that my programs and I will stay here, no matter how much bullying you have in mind for me. I would humbly request to stop all blatant allegations and let the users choose for themselves from the given alternatives. Let us maintain the pleasant atmosphere here. Live and let live. Kind regards

Last edited by Attaullah Shah; 08 May 2018, 14:39.

Comment

Robert Picard

Join Date: Mar 2014

Posts: 1536
#7

08 May 2018, 15:35

I said my piece and I have no interest in debating you on this.
1 like
Comment
Jean-Claude Arbaut

Join Date: Jul 2017

Posts: 209
#8

09 May 2018, 06:06

For completeness: you can get the moving median using egen mmed, found at net from http://www.stata.com/users/kcrow.

Using the same dataset as in my answer above, do

Code:

egen y2=mmed(x), lag(4) gen z2=L.y2 gen high2=z2>r(p50)

Note: egen mmed had a little bug until yesterday, making it fail on Stata 15. You will need to reinstall the package if you have an older version.

Hope this helps

Jean-Claude Arbaut
Comment

Announcement