Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating percentiles for panel data

    Hello everyone,

    I am using panel data, with data per province per year/month for terrorist attacks. I would like to create variables to be able to discriminate province/year/month combinations according to the number of attacks. For example, being able to use only the top1% provinces in terms of number of attacks in a given month a given year. I would like to get the top 1%, top 5%, top 10%, top 25% and top 50% of the number of attacks every months.

    My database looks like this

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str5 ID_GADM_fine float(year month AttacksPPPM)
    "AFG4"  2008 12  0
    "AFG12" 2015 11  9
    "AFG28" 2012  4  2
    "AFG8"  2012  4  2
    "AFG1"  2009  6  0
    "AFG14" 2010  4  0
    "AFG14" 2009  6  3
    "AFG9"  2009  6  2
    "AFG34" 2009 10  1
    "AFG3"  2008 12  0
    "AFG14" 2012  4  2
    "AFG15" 2009  6  4
    "AFG12" 2010  4  0
    "AFG14" 2010  4  0
    "AFG14" 2009  6  3
    "AFG2"  2011  5  0
    "AFG14" 2009 10  2
    "AFG15" 2008 12  6
    "AFG2"  2015 11  2
    "AFG10" 2013  5  0
    "AFG3"  2010  4  0
    "AFG8"  2011  5  0
    "AFG24" 2009 10  0
    "AFG3"  2012  4  2
    "AFG15" 2011  5  1
    "AFG31" 2009  6  0
    "AFG22" 2015 11 17
    "AFG15" 2009  6  4
    "AFG12" 2011  5  0
    "AFG12" 2016  5 11
    "AFG14" 2011  5  2
    "AFG2"  2011  5  0
    "AFG12" 2009  6  1
    "AFG3"  2014  9  3
    "AFG18" 2014  9  5
    "AFG1"  2012  8  0
    "AFG14" 2008 12  5
    "AFG9"  2009 10  2
    "AFG14" 2009  6  3
    "AFG32" 2011  5  0
    "AFG18" 2016  5  6
    "AFG14" 2009 10  2
    "AFG22" 2014  9 16
    "AFG14" 2008 12  5
    "AFG12" 2009 10  6
    "AFG14" 2016  5  6
    "AFG4"  2009 10  0
    "AFG11" 2008 12  0
    "AFG26" 2009 10  0
    "AFG14" 2010  4  0
    "AFG22" 2009  6  2
    "AFG3"  2014  9  3
    "AFG14" 2010  4  0
    "AFG19" 2016  5  8
    "AFG2"  2009  6  0
    "AFG12" 2008 12  0
    "AFG19" 2012  4  1
    "AFG28" 2014  9  0
    "AFG14" 2008 12  5
    "AFG15" 2016  5  8
    "AFG14" 2013  5  1
    "AFG16" 2015 11  8
    "AFG14" 2013  5  1
    "AFG19" 2012  4  1
    "AFG24" 2009  6  1
    "AFG3"  2008 12  0
    "AFG13" 2013  5  6
    "AFG4"  2008 12  0
    "AFG14" 2015 11  8
    "AFG15" 2011  5  1
    "AFG14" 2009  6  3
    "AFG11" 2012  4 15
    "AFG12" 2015 11  9
    "AFG15" 2015 11  6
    "AFG9"  2010  4  1
    "AFG11" 2010  4  7
    "AFG19" 2014  9  9
    "AFG12" 2008 12  0
    "AFG9"  2009  6  2
    "AFG14" 2009  6  3
    "AFG33" 2010  4  1
    "AFG1"  2015 11  5
    "AFG15" 2015 11  6
    "AFG14" 2010  4  0
    "AFG14" 2009 10  2
    "AFG15" 2009 10  5
    "AFG33" 2014  9  5
    "AFG31" 2011  5  0
    "AFG13" 2014  9  9
    "AFG14" 2009 10  2
    "AFG9"  2010  4  1
    "AFG20" 2015 11  1
    "AFG14" 2014  9 13
    "AFG14" 2008 12  5
    "AFG19" 2015 11  2
    "AFG31" 2009  6  0
    "AFG12" 2009 10  6
    "AFG12" 2009  6  1
    "AFG7"  2014  9  8
    "AFG28" 2016  5  1
    end
    Also, I already looked on statalist and tried several things on my own, nothing gave the variables I am expecting. Please note that I have data for 100 countries, not only afghanistan, for a total of more than 700k observations. I hope my message is clear enough for you, and thank you in advance for your help.

    Best,
    Killian
    Last edited by Killian Foubert; 12 Mar 2019, 13:54.

  • #2

    Code:
    bysort  year month :  egen median = pctile(Attacks)
    gives you the median over Provinces, although you and/or your readers may prefer to make calculating the 50% point explicit with

    Code:
    bysort  year month :  egen median = pctile(Attacks)  , p(50)
    and so

    Code:
    gen top50 = Attacks >= median if Attacks < . 
    gives you an indicator for being greater than or equal to the median. -- where I have excluded missing values.

    But watch out: ties could mean that many more than 50% of provinces are greater than or equal to the median, just as

    Code:
    gen top50 = Attacks > median 
    could give you many less than 50%. for the same reason.

    I wouldn't even rule out, in advance of knowing more about the data here, zero being the median in some months in which case, without contradiction of anything fundamental, all of the provinces (or areas in your wider dataset) would so qualify.

    The other percentiles could be calculated by other applications of the same pctile() function with egen: see its help.

    This kind of binning is popular in many subfields of many disciplines -- and unpopular or deprecated in many others, for the reason I've given and several others.

    Naturally we can't comment on what you tried, and whether it is correct that, or why it didn't, give you what you were expecting.

    Comment


    • #3
      Thank you for your help, your codes are perfect for what I wanted

      Best,
      Killian

      Comment

      Working...
      X