Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating a median over a number of years for each ID

    Hi,
    I have a sample of several thousands IDs, over the period 2004 - 2015 inclusive. I have a variable which is constant over the years 2004-2012 inclusive and would like to calculate the median for each ID over this period, and report this number in each of the entries corresponding to 2004-2012 inclusive.

    Based on the comparison of this median variable by ID with the median across all ID, 2004-2012 period I would like to create a dummy which is 1 if the median of the ID across the years 2004-2012 was lower than the median across all IDs and years 2004-2012 inclusive. Then I would like the dummy to apply to all periods of the dataset.

    So in the end the dummy would be either 0 or 1 for all years, but differ by ID. The idea is to dichotomise all variables into no/yes for a particular variable before regulatory changes came into effect in 2013.

    Any suggestions on how I can do this would be greatly appreciated.
    Thanks,
    John

  • #2
    You can use egen to get medians for groups and then compare with the overall median calculated by summarize, detail

    See the help for both commands.

    Here's a dopey example, absent any data example in your post (see
    http://www.statalist.org/forums/help#stata)

    Code:
    webuse grunfeld , clear
    egen g_median = median(invest), by(company)
    egen tag = tag(company)
    su g_median if tag, detail
    gen wanted = g_median > r(p50)
    tabdisp company, c(g_median wanted)
    
    ----------------------------------
      company |   g_median      wanted
    ----------+-----------------------
            1 |     538.35           1
            2 |     419.55           1
            3 |      93.55           1
            4 |   71.08501           1
            5 |     60.385           1
            6 |      43.11           0
            7 |       44.2           0
            8 |      38.54           0
            9 |      38.11           0
           10 |      2.215           0
    ----------------------------------
    For the business of calculating using some years, but spreading to all, see http://www.stata-journal.com/sjpdf.h...iclenum=dm0055 especially Sections 9 and 10.
    Last edited by Nick Cox; 12 May 2016, 03:03.

    Comment


    • #3
      Thanks for the speedy response. Below you will find an example of the data I am using.


      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input int(ID year) long spec float t1req double ea
      1 2009 6   4 4.932
      1 2010 6   4 4.725
      1 2011 6   4 5.814
      1 2012 6   4 5.054
      1 2013 6 4.5 6.499
      1 2014 6 5.5 6.665
      2 2004 1   4  7.75
      2 2005 1   4 7.161
      2 2006 1   4 6.176
      2 2007 1   4 5.752
      2 2008 1   4 3.966
      2 2009 1   4 5.738
      2 2010 1   4 6.311
      2 2011 1   4 6.801
      2 2012 1   4 6.499
      2 2013 1 4.5 6.996
      2 2014 1 5.5 7.592
      2 2015 1   6 8.197
      3 2005 2   4 3.656
      3 2006 2   4 3.806
      3 2007 2   4 3.505
      3 2008 2   4 2.841
      3 2009 2   4 3.905
      3 2010 2   4 4.285
      3 2011 2   4  4.93
      3 2012 2   4 4.357
      3 2013 2 4.5 5.024
      3 2014 2 5.5 4.509
      3 2015 2   6 5.018
      4 2004 3   4 4.662
      4 2005 3   4 4.459
      4 2006 3   4  4.53
      4 2007 3   4 4.543
      4 2008 3   4 3.868
      4 2009 3   4  4.41
      4 2010 3   4 4.474
      4 2011 3   4 4.301
      4 2012 3   4 4.084
      4 2013 3 4.5 4.837
      4 2014 3 5.5 5.202
      4 2015 3   6 5.734
      end
      label values spec spec
      label def spec 1 "Bank holdings & Holding companies", modify
      label def spec 2 "Commercial banks", modify
      label def spec 3 "Cooperative banks", modify
      label def spec 6 "Savings banks", modify

      Actually what I also need to do is create a dummy equal to 1 if the observation of ea is below the 30th percentile of the distribution of ea by spec, averaged over the years 2004-2012 (for both the individual ID number, and the spec peer group over which we calculate the 30th percentile). Then for the subsequent years (2013,2014,2015) I simply use the percentile over the group.

      Thus far I was specifying the median of each ID for 2004-2012, 2013,2014,2015 using the following commands:
      egen mea= median(ea / (year >= 2004 & year <= 2012)), by(ID)
      replace mea=ea if year>2012

      I was then calculating the bottom 30th percentile with the following loop:

      *Generate lower 30th percentile of ea/t1ratio over period 2004-2012, by specialisation
      *================================================= =================
      *Generate empty lowthreshspecea/t1ratio
      g lowthreshspect1ratio=.
      g lowthreshspecea=.

      *Loop over specialisations
      quietly foreach j in ea t1ratio {

      quietly forvalues i=1/6 {
      egen lowthreshspec`j'`i'= pctile(`j') if spec==`i'&year>=2004&year<=2012, p(30)
      replace lowthreshspec`j'=lowthreshspec`j'`i' if lowthreshspec`j'==.
      }
      }
      *

      But I am not sure on how to do it for the remaining years as I was using egen and replace doesn't seem to work.

      Thanks again for your help.
      Cheers,
      John


      Comment


      • #4
        You changed the question -- and you have lost me on why the problem appears so complicated.

        I don't see why you need a loop for the 30th percentile or that starting with the median is necessary at all.

        As I understand it you want:

        1. Reference variable: 30th percentile for 2004-2012.

        2. Indicator variable: is actual value above or below?

        Code:
        egen pc30 = pctile(ea / (year >= 2004 & year <= 2012)), by(ID) p(30) 
        gen wanted = ea > pc30

        Comment


        • #5
          Apologies for not expressing myself clearly enough. I try to be clearer in the following.

          I want to create the 30th percentile for 2004-2012 for each value of spec, which is a categorical variable taking values from 1 to 6. Then for the years 2013,2014,2015 I want the 30th percentile for each value of spec for each of these years separately. So for each ID with e.g. spec=1 there would be a sequence of equal numbers for 2004-2012, then other numbers for 2013,2014,2015. The loop is looping over spec values. This is the bit I am struggling with.

          Then I would have the median for each ID number over the period 2004 in 2012 recorded in the observations corresponding to the years 2004-2012, then the actual value for 2013,2014,2015. This was the bit of code relating to the median. This part is OK.
          Last edited by John Vourdas; 12 May 2016, 03:36.

          Comment


          • #6
            Code:
             
             egen pc30 = pctile(ea / (year >= 2004 & year <= 2012)), by(ID spec) p(30)
            may be relevant.

            Comment


            • #7
              OK I wasn't aware I could put multiple groups in the by(). This works. Thanks very much!

              Comment

              Working...
              X