Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Create for each time period the median of the 350 biggest values

    Hello everyone,

    I am currently sitting on an issue regarding taking the median of the 350 highest values per period. Basically I have the market equity (variable "me") for each year (variable "monthly_date") for multiple companies (variable "companies"), I want Stata to group all companies within that year, take the 350 companies with the highest market value and from the 350 companies with the highest market value I want to calculate the median.

    My data looks like this:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int company float(monthly_date me)
     2 677   73280
     2 689  109610
     2 701  244600
     2 713  563190
     3 641   36570
     3 653   44750
     3 665  143170
     3 677   54250
     3 689   18690
     5 605   76750
     5 617   81250
     5 629   81000
     5 641  112480
     6 425   18190
     6 473   27920
     6 485   28800
     6 497   22210
     6 509   24310
     6 521   16050
     6 533   17640
     6 545   11290
     6 557    5120
     6 569    5460
     6 581    3220
     6 593    2930
     6 605    4780
     6 617    1110
     7 581  157430
     7 593  240560
     7 605  440400
     7 617  757340
     7 629  827040
     7 641  904180
     7 653  761700
     7 665 1041450
     7 677 1560010
     7 689 1990410
     7 701 2735150
     7 713 3030950
     8 425   77470
     8 437   92210
     8 449   80130
     8 461  174640
     8 473  191140
     8 485  825560
     8 497  892010
     8 509  340390
     8 521   95670
     8 533  191290
     8 545  287880
     8 557  953860
     8 569 1232120
     8 581  949150
     8 593  986490
     8 605 1426260
     8 617 2553610
     8 629 2979150
     8 641 4589000
     8 653 5965620
     8 665 5379720
     8 677 3682250
     8 689 3980100
     9 689  102530
     9 701   19490
    10 641  393680
    10 653  843550
    10 665 1238050
    10 677 1849900
    10 689 1221640
    10 701  514660
    10 713  727080
    11 581    3160
    11 593    6980
    11 605   11150
    11 617    7070
    11 629    7680
    11 641    7640
    11 653    7930
    11 665   11710
    11 677   15030
    11 689   11860
    11 701   23300
    11 713   35440
    12 545     470
    12 557    1440
    12 569    5680
    12 581    3830
    12 593   10080
    12 605   18070
    12 617   31590
    12 629   52380
    12 641  111210
    12 653  104650
    12 665  121710
    12 677  253560
    12 689  392530
    12 701  680960
    12 713  189260
    13 569   95150
    13 581   48910
    end
    format %tm monthly_date
    I was thinking to do the command rowsort but I am not sure how to use it when I have a time variable that needs to be considered

    After collecting the data I would simply run this command:
    Code:
    bys monthly_date: egen size = pctile(me), p(50)

    Happy New Year to everyone and stay healthy

  • #2
    The median of any subset with an even number of values is just the midpoint of the two central values (the "co-medians").

    Code:
    gen year = year(dofm(monthly_date)) 
    
    gen OK  = !missing(me) 
    
    bysort OK  year (me) : gen wanted = (me[_N - 174] + me[_N - 175])/2
    If there are no missings, you don't have to ignore them explicitly.

    Comment


    • #3
      Hello Nick,

      thanks for the help. I just want to make sure that I did understand the code correctly.

      Code:
      gen OK = !missing(me)
      generates a 1 if me is not missing and a 0 if me is missing correct?

      Code:
      bys OK year (me): gen wanted = (me[_N - 174] + me[_N-175)/2
      gives me the median of the top 350 companies with the highest me?

      I am confused about how the _N - 17X works. Does it count from the top to the 174 and 175 position or from the bottom? Because my whole dataset contains like 1700 companies

      Besides this understanding issue I do have a follow up question. After collecting the median for each year I do need to calculate other breakpoints based on the 350 companies with the highest "me"

      For that I would use a similar method like you did above. However this time I do need to divide this variable in 30% and 70%

      Code:
      bys OK year (me) : egen bm30 = pctile(bm), p(30)
      bys OK year (me) : egen bm70 = pctile(bm), p(70)
      I guess I do have an issue on my second question because it does not consider the 350 highest "me" but takes all "me" per year, am I correct?


      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input int company float(monthly_date me year bm OK wanted)
       400 425  1140 1995   3.421875 1 73115
       870 425  1460 1995  1.2028985 1 73115
       714 425  1910 1995     .54363 1 73115
       594 425  2190 1995   11.70266 1 73115
      1555 425  2240 1995   1.781771 1 73115
      1699 425  2400 1995  1.2918367 1 73115
      1869 425  2520 1995   .7688797 1 73115
      1834 425  2540 1995  1.3891753 1 73115
       800 425  2590 1995  1.5831933 1 73115
      1308 425  2660 1995    .829638 1 73115
       854 425  2730 1995    8.94359 1 73115
      1319 425  2800 1995  1.0613793 1 73115
       853 425  2870 1995  1.1307229 1 73115
       647 425  2900 1995   .9397531 1 73115
       882 425  2930 1995   .3992537 1 73115
       257 425  3030 1995  1.3973404 1 73115
       234 425  3060 1995    .167028 1 73115
      1739 425  3460 1995  2.0029178 1 73115
      1792 425  3510 1995   .6257991 1 73115
       557 425  3600 1995   2.041154 1 73115
      1593 425  3610 1995   .4975052 1 73115
       317 425  3760 1995   .5827225 1 73115
       361 425  4040 1995   .7733634 1 73115
       387 425  4050 1995    .595914 1 73115
      1124 425  4050 1995  1.0543897 1 73115
       502 425  4190 1995   5.217081 1 73115
       691 425  4240 1995   .8116505 1 73115
      1324 425  4500 1995   .6756141 1 73115
       351 425  4520 1995   .9727129 1 73115
       253 425  4540 1995  .56383365 1 73115
      1788 425  4850 1995   .3164188 1 73115
       411 425  4930 1995  .16226415 1 73115
      1602 425  5170 1995  1.0855863 1 73115
       955 425  5400 1995   .7681521 1 73115
      1112 425  5620 1995   .7160237 1 73115
       548 425  6190 1995   .6262238 1 73115
      1032 425  6200 1995  1.1042227 1 73115
       763 425  6320 1995  1.8790613 1 73115
        50 425  6670 1995   .4432099 1 73115
      1625 425  7040 1995   .8115227 1 73115
      1957 425  7070 1995   1.486578 1 73115
      1266 425  7130 1995   1.069496 1 73115
      1532 425  7350 1995   .7292767 1 73115
       498 425  7400 1995   .6091393 1 73115
      1816 425  7490 1995   1.452857 1 73115
       418 425  7540 1995   .5669753 1 73115
       501 425  7860 1995    .742842 1 73115
       327 425  8230 1995  1.0772694 1 73115
       626 425  8320 1995   .8155639 1 73115
       395 425  8580 1995   .2493808 1 73115
       975 425  8890 1995  1.4188157 1 73115
      1796 425  9170 1995   .8614035 1 73115
      1083 425  9270 1995  1.5142704 1 73115
      1015 425  9350 1995  1.7922865 1 73115
      1139 425  9380 1995   .9163461 1 73115
      1918 425  9420 1995     .58775 1 73115
       813 425 10150 1995  .03742954 1 73115
       260 425 10180 1995   .8891249 1 73115
       391 425 10480 1995   .7968128 1 73115
      1671 425 10810 1995   .7357929 1 73115
       890 425 11400 1995   .9899588 1 73115
       196 425 11420 1995   .4636129 1 73115
       258 425 12130 1995  1.7336928 1 73115
      1448 425 12140 1995  1.3576636 1 73115
       194 425 12340 1995   .1129618 1 73115
       433 425 12410 1995   .6019747 1 73115
      1483 425 12440 1995   .3404339 1 73115
      1611 425 12500 1995   .4067064 1 73115
      1355 425 13250 1995  .52514005 1 73115
       754 425 13290 1995   .6709251 1 73115
      1873 425 13310 1995  .03820471 1 73115
       644 425 13950 1995   .3500248 1 73115
      1360 425 14270 1995  .16612904 1 73115
        61 425 14350 1995   .4520993 1 73115
      1540 425 15090 1995  1.7540174 1 73115
       905 425 15180 1995   .3617155 1 73115
      1859 425 15900 1995  .28906968 1 73115
      1575 425 16290 1995   .2946556 1 73115
       711 425 16380 1995   .7724118 1 73115
       360 425 16960 1995  1.1812977 1 73115
      1077 425 17310 1995   .7480596 1 73115
       488 425 17950 1995  1.0760279 1 73115
      1248 425 18140 1995  1.5931567 1 73115
         6 425 18190 1995   .9283818 1 73115
      1458 425 18620 1995    .986859 1 73115
       457 425 18730 1995   .6261889 1 73115
       292 425 19080 1995   .5548207 1 73115
      1640 425 19340 1995   .3314667 1 73115
      1098 425 20010 1995   .2486769 1 73115
       507 425 20050 1995    .840311 1 73115
      1012 425 20280 1995   .4948205 1 73115
       757 425 20390 1995   .1787037 1 73115
       877 425 22300 1995 .012960974 1 73115
        86 425 22480 1995  .08883929 1 73115
       830 425 24220 1995   .4337032 1 73115
       445 425 24570 1995   .3627119 1 73115
        37 425 24630 1995   .4689034 1 73115
       425 425 24820 1995   .5664821 1 73115
       588 425 24930 1995  .19008605 1 73115
       496 425 25030 1995    .405898 1 73115
      end
      format %tm monthly_date
      thanks

      Comment


      • #4
        Code:
        _N
        indexes the last observation in a group; hence data are sorted on variable y

        Code:
        y[_N]
        is the highest value and

        Code:
        y[_N-1]
        is the second highest value and

        Code:
        y[_N-174
        is the 175th highest value and so on.

        You can extend this easily to get approximations for other percentiles without firing up egen.

        Or you can identify a subset with


        Code:
         
          gen OK  = !missing(me)   bysort OK  year (me) : gen touse = (me < .) & _n > (_N - 350)   
         by OK year: egen bm30 = pctile(bm) if touse, p(30)

        Comment


        • #5
          Thank you that worked perfectly!

          I do have one more question about a similar problem. I want to remove the lowest 5% of "me". I wanted to do it by using the drop if command

          Code:
          drop if me <= pcentile(me), p(5)
          but this is not working than I tried following
          Code:
          _pctile me, p(95)
          local w2=r(r2)
          but this seems not to be the full code and I miss something.

          Comment


          • #6
            pcentile() is not a general Stata function or even an egen function. What led you to try that? Otherwise you need syntax like that you used in #3.

            Comment


            • #7
              Hello Nick,

              I found that pcentile here in the forum but it was used with keep if inrange (me, w1, w2).

              I am not sure if I do this right because what I try to do is to delete the lowest 5% of ME for a specific year +month

              I tried following code
              Code:
              keep if month == 6
              gen OK = !missing(me)
              bys OK monthly_date (me): egen wanted = pctile(me), p(95)
              I think this is not completly correct as it provides me with the 95% "me" per year month

              monthly_date = year+month








              Comment


              • #8
                As in #6 there is no pcentile() function in Stata. But it seems that you meant pctile().

                Indeed; the 95th percentile is not the 5th percentile. So it seems that all you need to do next is change 95 to 5.
                Last edited by Nick Cox; 02 Jan 2021, 05:11.

                Comment


                • #9
                  Sorry Nick my bad.

                  After applying this code I only need to one more step and the final code would look something like this:

                  Code:
                  keep if month == 6
                  gen OK = !missing(me)
                  bys OK monthly_date (me): egen wanted = pctile(me), p(5)
                  rep me = . if me <= wanted
                  drop if mi(me)
                  based on my intention to delete the lowest 5 percent per year month, is this correct?
                  Last edited by Marius Bauer; 02 Jan 2021, 06:33.

                  Comment


                  • #10
                    For what you want, this seems more direct:

                    Code:
                     
                     drop if me <= wanted | mi(me)

                    Comment

                    Working...
                    X