Creating a median over a number of years for each ID

John Vourdas

Join Date: May 2016

Posts: 10
#1

Creating a median over a number of years for each ID

12 May 2016, 02:17

Hi,
I have a sample of several thousands IDs, over the period 2004 - 2015 inclusive. I have a variable which is constant over the years 2004-2012 inclusive and would like to calculate the median for each ID over this period, and report this number in each of the entries corresponding to 2004-2012 inclusive.

Based on the comparison of this median variable by ID with the median across all ID, 2004-2012 period I would like to create a dummy which is 1 if the median of the ID across the years 2004-2012 was lower than the median across all IDs and years 2004-2012 inclusive. Then I would like the dummy to apply to all periods of the dataset.

So in the end the dummy would be either 0 or 1 for all years, but differ by ID. The idea is to dichotomise all variables into no/yes for a particular variable before regulatory changes came into effect in 2013.

Any suggestions on how I can do this would be greatly appreciated.
Thanks,
John
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35675

12 May 2016, 02:28

You can use egen to get medians for groups and then compare with the overall median calculated by summarize, detail

See the help for both commands.

Here's a dopey example, absent any data example in your post (see http://www.statalist.org/forums/help#stata)

Code:

webuse grunfeld , clear
egen g_median = median(invest), by(company)
egen tag = tag(company)
su g_median if tag, detail
gen wanted = g_median > r(p50)
tabdisp company, c(g_median wanted)

----------------------------------
  company |   g_median      wanted
----------+-----------------------
        1 |     538.35           1
        2 |     419.55           1
        3 |      93.55           1
        4 |   71.08501           1
        5 |     60.385           1
        6 |      43.11           0
        7 |       44.2           0
        8 |      38.54           0
        9 |      38.11           0
       10 |      2.215           0
----------------------------------

For the business of calculating using some years, but spreading to all, see http://www.stata-journal.com/sjpdf.h...iclenum=dm0055 especially Sections 9 and 10.

Last edited by Nick Cox; 12 May 2016, 03:03.

Comment

John Vourdas

Join Date: May 2016

Posts: 10
#3

12 May 2016, 02:57

Thanks for the speedy response. Below you will find an example of the data I am using.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int(ID year) long spec float t1req double ea 1 2009 6 4 4.932 1 2010 6 4 4.725 1 2011 6 4 5.814 1 2012 6 4 5.054 1 2013 6 4.5 6.499 1 2014 6 5.5 6.665 2 2004 1 4 7.75 2 2005 1 4 7.161 2 2006 1 4 6.176 2 2007 1 4 5.752 2 2008 1 4 3.966 2 2009 1 4 5.738 2 2010 1 4 6.311 2 2011 1 4 6.801 2 2012 1 4 6.499 2 2013 1 4.5 6.996 2 2014 1 5.5 7.592 2 2015 1 6 8.197 3 2005 2 4 3.656 3 2006 2 4 3.806 3 2007 2 4 3.505 3 2008 2 4 2.841 3 2009 2 4 3.905 3 2010 2 4 4.285 3 2011 2 4 4.93 3 2012 2 4 4.357 3 2013 2 4.5 5.024 3 2014 2 5.5 4.509 3 2015 2 6 5.018 4 2004 3 4 4.662 4 2005 3 4 4.459 4 2006 3 4 4.53 4 2007 3 4 4.543 4 2008 3 4 3.868 4 2009 3 4 4.41 4 2010 3 4 4.474 4 2011 3 4 4.301 4 2012 3 4 4.084 4 2013 3 4.5 4.837 4 2014 3 5.5 5.202 4 2015 3 6 5.734 end label values spec spec label def spec 1 "Bank holdings & Holding companies", modify label def spec 2 "Commercial banks", modify label def spec 3 "Cooperative banks", modify label def spec 6 "Savings banks", modify

Actually what I also need to do is create a dummy equal to 1 if the observation of ea is below the 30th percentile of the distribution of ea by spec, averaged over the years 2004-2012 (for both the individual ID number, and the spec peer group over which we calculate the 30th percentile). Then for the subsequent years (2013,2014,2015) I simply use the percentile over the group.

Thus far I was specifying the median of each ID for 2004-2012, 2013,2014,2015 using the following commands:
egen mea= median(ea / (year >= 2004 & year <= 2012)), by(ID)
replace mea=ea if year>2012

I was then calculating the bottom 30th percentile with the following loop:

*Generate lower 30th percentile of ea/t1ratio over period 2004-2012, by specialisation
*================================================= =================
*Generate empty lowthreshspecea/t1ratio
g lowthreshspect1ratio=.
g lowthreshspecea=.

*Loop over specialisations
quietly foreach j in ea t1ratio {

quietly forvalues i=1/6 {
egen lowthreshspec`j'`i'= pctile(`j') if spec==`i'&year>=2004&year<=2012, p(30)
replace lowthreshspec`j'=lowthreshspec`j'`i' if lowthreshspec`j'==.
}
}
*

But I am not sure on how to do it for the remaining years as I was using egen and replace doesn't seem to work.

Thanks again for your help.
Cheers,
John
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35675
#4

12 May 2016, 03:09

You changed the question -- and you have lost me on why the problem appears so complicated.

I don't see why you need a loop for the 30th percentile or that starting with the median is necessary at all.

As I understand it you want:

1. Reference variable: 30th percentile for 2004-2012.

2. Indicator variable: is actual value above or below?

Code:

egen pc30 = pctile(ea / (year >= 2004 & year <= 2012)), by(ID) p(30) gen wanted = ea > pc30
Comment
John Vourdas

Join Date: May 2016

Posts: 10
#5

12 May 2016, 03:27

Apologies for not expressing myself clearly enough. I try to be clearer in the following.

I want to create the 30th percentile for 2004-2012 for each value of spec, which is a categorical variable taking values from 1 to 6. Then for the years 2013,2014,2015 I want the 30th percentile for each value of spec for each of these years separately. So for each ID with e.g. spec=1 there would be a sequence of equal numbers for 2004-2012, then other numbers for 2013,2014,2015. The loop is looping over spec values. This is the bit I am struggling with.

Then I would have the median for each ID number over the period 2004 in 2012 recorded in the observations corresponding to the years 2004-2012, then the actual value for 2013,2014,2015. This was the bit of code relating to the median. This part is OK.

Last edited by John Vourdas; 12 May 2016, 03:36.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35675
#6

12 May 2016, 03:43

Code:

egen pc30 = pctile(ea / (year >= 2004 & year <= 2012)), by(ID spec) p(30)

may be relevant.
1 like
Comment
John Vourdas

Join Date: May 2016

Posts: 10
#7

12 May 2016, 03:52

OK I wasn't aware I could put multiple groups in the by(). This works. Thanks very much!
1 like
Comment

Announcement

Creating a median over a number of years for each ID

Comment

Comment

Comment

Comment

Comment

Comment