Create for each time period the median of the 350 biggest values

Marius Bauer

Join Date: Nov 2020
Posts: 55

Create for each time period the median of the 350 biggest values

31 Dec 2020, 09:51

Hello everyone,

I am currently sitting on an issue regarding taking the median of the 350 highest values per period. Basically I have the market equity (variable "me") for each year (variable "monthly_date") for multiple companies (variable "companies"), I want Stata to group all companies within that year, take the 350 companies with the highest market value and from the 350 companies with the highest market value I want to calculate the median.

My data looks like this:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int company float(monthly_date me)
 2 677   73280
 2 689  109610
 2 701  244600
 2 713  563190
 3 641   36570
 3 653   44750
 3 665  143170
 3 677   54250
 3 689   18690
 5 605   76750
 5 617   81250
 5 629   81000
 5 641  112480
 6 425   18190
 6 473   27920
 6 485   28800
 6 497   22210
 6 509   24310
 6 521   16050
 6 533   17640
 6 545   11290
 6 557    5120
 6 569    5460
 6 581    3220
 6 593    2930
 6 605    4780
 6 617    1110
 7 581  157430
 7 593  240560
 7 605  440400
 7 617  757340
 7 629  827040
 7 641  904180
 7 653  761700
 7 665 1041450
 7 677 1560010
 7 689 1990410
 7 701 2735150
 7 713 3030950
 8 425   77470
 8 437   92210
 8 449   80130
 8 461  174640
 8 473  191140
 8 485  825560
 8 497  892010
 8 509  340390
 8 521   95670
 8 533  191290
 8 545  287880
 8 557  953860
 8 569 1232120
 8 581  949150
 8 593  986490
 8 605 1426260
 8 617 2553610
 8 629 2979150
 8 641 4589000
 8 653 5965620
 8 665 5379720
 8 677 3682250
 8 689 3980100
 9 689  102530
 9 701   19490
10 641  393680
10 653  843550
10 665 1238050
10 677 1849900
10 689 1221640
10 701  514660
10 713  727080
11 581    3160
11 593    6980
11 605   11150
11 617    7070
11 629    7680
11 641    7640
11 653    7930
11 665   11710
11 677   15030
11 689   11860
11 701   23300
11 713   35440
12 545     470
12 557    1440
12 569    5680
12 581    3830
12 593   10080
12 605   18070
12 617   31590
12 629   52380
12 641  111210
12 653  104650
12 665  121710
12 677  253560
12 689  392530
12 701  680960
12 713  189260
13 569   95150
13 581   48910
end
format %tm monthly_date

I was thinking to do the command rowsort but I am not sure how to use it when I have a time variable that needs to be considered

After collecting the data I would simply run this command:

Code:

bys monthly_date: egen size = pctile(me), p(50)

Happy New Year to everyone and stay healthy

Tags: None

Nick Cox

Join Date: Mar 2014

Posts: 35693
#2

31 Dec 2020, 10:23

The median of any subset with an even number of values is just the midpoint of the two central values (the "co-medians").

Code:

gen year = year(dofm(monthly_date)) gen OK = !missing(me) bysort OK year (me) : gen wanted = (me[_N - 174] + me[_N - 175])/2

If there are no missings, you don't have to ignore them explicitly.
Comment

Marius Bauer

Join Date: Nov 2020
Posts: 55

01 Jan 2021, 07:10

Hello Nick,

thanks for the help. I just want to make sure that I did understand the code correctly.

Code:

gen OK = !missing(me)

generates a 1 if me is not missing and a 0 if me is missing correct?

Code:

bys OK year (me): gen wanted = (me[_N - 174] + me[_N-175)/2

gives me the median of the top 350 companies with the highest me?

I am confused about how the _N - 17X works. Does it count from the top to the 174 and 175 position or from the bottom? Because my whole dataset contains like 1700 companies

Besides this understanding issue I do have a follow up question. After collecting the median for each year I do need to calculate other breakpoints based on the 350 companies with the highest "me"

For that I would use a similar method like you did above. However this time I do need to divide this variable in 30% and 70%

Code:

bys OK year (me) : egen bm30 = pctile(bm), p(30)
bys OK year (me) : egen bm70 = pctile(bm), p(70)

I guess I do have an issue on my second question because it does not consider the 350 highest "me" but takes all "me" per year, am I correct?

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int company float(monthly_date me year bm OK wanted)
 400 425  1140 1995   3.421875 1 73115
 870 425  1460 1995  1.2028985 1 73115
 714 425  1910 1995     .54363 1 73115
 594 425  2190 1995   11.70266 1 73115
1555 425  2240 1995   1.781771 1 73115
1699 425  2400 1995  1.2918367 1 73115
1869 425  2520 1995   .7688797 1 73115
1834 425  2540 1995  1.3891753 1 73115
 800 425  2590 1995  1.5831933 1 73115
1308 425  2660 1995    .829638 1 73115
 854 425  2730 1995    8.94359 1 73115
1319 425  2800 1995  1.0613793 1 73115
 853 425  2870 1995  1.1307229 1 73115
 647 425  2900 1995   .9397531 1 73115
 882 425  2930 1995   .3992537 1 73115
 257 425  3030 1995  1.3973404 1 73115
 234 425  3060 1995    .167028 1 73115
1739 425  3460 1995  2.0029178 1 73115
1792 425  3510 1995   .6257991 1 73115
 557 425  3600 1995   2.041154 1 73115
1593 425  3610 1995   .4975052 1 73115
 317 425  3760 1995   .5827225 1 73115
 361 425  4040 1995   .7733634 1 73115
 387 425  4050 1995    .595914 1 73115
1124 425  4050 1995  1.0543897 1 73115
 502 425  4190 1995   5.217081 1 73115
 691 425  4240 1995   .8116505 1 73115
1324 425  4500 1995   .6756141 1 73115
 351 425  4520 1995   .9727129 1 73115
 253 425  4540 1995  .56383365 1 73115
1788 425  4850 1995   .3164188 1 73115
 411 425  4930 1995  .16226415 1 73115
1602 425  5170 1995  1.0855863 1 73115
 955 425  5400 1995   .7681521 1 73115
1112 425  5620 1995   .7160237 1 73115
 548 425  6190 1995   .6262238 1 73115
1032 425  6200 1995  1.1042227 1 73115
 763 425  6320 1995  1.8790613 1 73115
  50 425  6670 1995   .4432099 1 73115
1625 425  7040 1995   .8115227 1 73115
1957 425  7070 1995   1.486578 1 73115
1266 425  7130 1995   1.069496 1 73115
1532 425  7350 1995   .7292767 1 73115
 498 425  7400 1995   .6091393 1 73115
1816 425  7490 1995   1.452857 1 73115
 418 425  7540 1995   .5669753 1 73115
 501 425  7860 1995    .742842 1 73115
 327 425  8230 1995  1.0772694 1 73115
 626 425  8320 1995   .8155639 1 73115
 395 425  8580 1995   .2493808 1 73115
 975 425  8890 1995  1.4188157 1 73115
1796 425  9170 1995   .8614035 1 73115
1083 425  9270 1995  1.5142704 1 73115
1015 425  9350 1995  1.7922865 1 73115
1139 425  9380 1995   .9163461 1 73115
1918 425  9420 1995     .58775 1 73115
 813 425 10150 1995  .03742954 1 73115
 260 425 10180 1995   .8891249 1 73115
 391 425 10480 1995   .7968128 1 73115
1671 425 10810 1995   .7357929 1 73115
 890 425 11400 1995   .9899588 1 73115
 196 425 11420 1995   .4636129 1 73115
 258 425 12130 1995  1.7336928 1 73115
1448 425 12140 1995  1.3576636 1 73115
 194 425 12340 1995   .1129618 1 73115
 433 425 12410 1995   .6019747 1 73115
1483 425 12440 1995   .3404339 1 73115
1611 425 12500 1995   .4067064 1 73115
1355 425 13250 1995  .52514005 1 73115
 754 425 13290 1995   .6709251 1 73115
1873 425 13310 1995  .03820471 1 73115
 644 425 13950 1995   .3500248 1 73115
1360 425 14270 1995  .16612904 1 73115
  61 425 14350 1995   .4520993 1 73115
1540 425 15090 1995  1.7540174 1 73115
 905 425 15180 1995   .3617155 1 73115
1859 425 15900 1995  .28906968 1 73115
1575 425 16290 1995   .2946556 1 73115
 711 425 16380 1995   .7724118 1 73115
 360 425 16960 1995  1.1812977 1 73115
1077 425 17310 1995   .7480596 1 73115
 488 425 17950 1995  1.0760279 1 73115
1248 425 18140 1995  1.5931567 1 73115
   6 425 18190 1995   .9283818 1 73115
1458 425 18620 1995    .986859 1 73115
 457 425 18730 1995   .6261889 1 73115
 292 425 19080 1995   .5548207 1 73115
1640 425 19340 1995   .3314667 1 73115
1098 425 20010 1995   .2486769 1 73115
 507 425 20050 1995    .840311 1 73115
1012 425 20280 1995   .4948205 1 73115
 757 425 20390 1995   .1787037 1 73115
 877 425 22300 1995 .012960974 1 73115
  86 425 22480 1995  .08883929 1 73115
 830 425 24220 1995   .4337032 1 73115
 445 425 24570 1995   .3627119 1 73115
  37 425 24630 1995   .4689034 1 73115
 425 425 24820 1995   .5664821 1 73115
 588 425 24930 1995  .19008605 1 73115
 496 425 25030 1995    .405898 1 73115
end
format %tm monthly_date

thanks

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35693
#4

01 Jan 2021, 07:38

Code:

_N

indexes the last observation in a group; hence data are sorted on variable y

Code:

y[_N]

is the highest value and

Code:

y[_N-1]

is the second highest value and

Code:

y[_N-174

is the 175th highest value and so on.

You can extend this easily to get approximations for other percentiles without firing up egen.

Or you can identify a subset with

Code:

gen OK = !missing(me) bysort OK year (me) : gen touse = (me < .) & _n > (_N - 350) by OK year: egen bm30 = pctile(bm) if touse, p(30)
Comment
Marius Bauer

Join Date: Nov 2020

Posts: 55
#5

01 Jan 2021, 10:16

Thank you that worked perfectly!

I do have one more question about a similar problem. I want to remove the lowest 5% of "me". I wanted to do it by using the drop if command

Code:

drop if me <= pcentile(me), p(5)

but this is not working than I tried following

Code:

_pctile me, p(95) local w2=r(r2)

but this seems not to be the full code and I miss something.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35693
#6

01 Jan 2021, 10:31

pcentile() is not a general Stata function or even an egen function. What led you to try that? Otherwise you need syntax like that you used in #3.
Comment
Marius Bauer

Join Date: Nov 2020

Posts: 55
#7

02 Jan 2021, 04:01

Hello Nick,

I found that pcentile here in the forum but it was used with keep if inrange (me, w1, w2).

I am not sure if I do this right because what I try to do is to delete the lowest 5% of ME for a specific year +month

I tried following code

Code:

keep if month == 6 gen OK = !missing(me) bys OK monthly_date (me): egen wanted = pctile(me), p(95)

I think this is not completly correct as it provides me with the 95% "me" per year month

monthly_date = year+month
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35693
#8

02 Jan 2021, 04:49

As in #6 there is no pcentile() function in Stata. But it seems that you meant pctile().

Indeed; the 95th percentile is not the 5th percentile. So it seems that all you need to do next is change 95 to 5.

Last edited by Nick Cox; 02 Jan 2021, 05:11.
1 like
Comment
Marius Bauer

Join Date: Nov 2020

Posts: 55
#9

02 Jan 2021, 05:37

Sorry Nick my bad.

After applying this code I only need to one more step and the final code would look something like this:

Code:

keep if month == 6 gen OK = !missing(me) bys OK monthly_date (me): egen wanted = pctile(me), p(5) rep me = . if me <= wanted drop if mi(me)

based on my intention to delete the lowest 5 percent per year month, is this correct?

Last edited by Marius Bauer; 02 Jan 2021, 06:33.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35693
#10

02 Jan 2021, 06:51

For what you want, this seems more direct:

Code:

drop if me <= wanted | mi(me)
Comment

Announcement

Create for each time period the median of the 350 biggest values

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment