Top/Bottom 5% firms

Marco Errico

Join Date: Apr 2020
Posts: 187

Top/Bottom 5% firms

22 Apr 2023, 13:12

Dear All,

I have a panel dataset of firms.
Firms are either categorized "R1000" or "R2000" (Index variable)

I need to create an indicator that tells me what are the top 5% of "R2000" firms and what are the bottom 5%of group "R1000" firms based on a size variable (market_c). This categorization has to be done only in a specific year (2017)

The identifier of my firm is cusip_id
Size variable is "market_c"

Thank you (below a dataex)

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float cusip_id str5 Index float market_c
  2 "R2000"   214.848
  5 "R2000" 1923.9242
  6 "R2000"  1551.458
 12 "R2000" 1301.5448
 39 "R2000"  2749.161
 57 "R1000"  5682.542
 70 "R2000"   535.949
 71 "R1000"  7900.347
 75 "R1000"  7152.002
 81 "R2000"  1782.249
 85 "R2000" 128.16531
129 "R2000"  451.5998
131 "R1000"  3303.477
132 "R2000"  1927.379
143 "R2000" 2338.4392
154 "R2000"  750.9069
164 "R2000"  115.4283
165 "R2000"  3351.809
181 "R1000" 238700.84
199 "R2000"  553.4622
213 "R2000"  885.5967
224 "R2000"  372.3606
228 "R2000"  2787.583
232 "R2000"  1060.425
239 "R2000"  1602.933
243 "R1000"  99507.37
246 "R1000" 153974.98
248 "R2000" 1412.3184
285 "R1000"  12912.68
293 "R2000"  408.0894
297 "R2000"   205.092
317 "R2000" 1434.9253
319 "R1000"  2840.768
320 "R1000" 3746.0154
321 "R2000"  2290.251
328 "R2000" 1458.6588
344 "R2000"  1920.877
355 "R2000"   397.765
363 "R2000"  464.9523
368 "R2000" 103.07047
369 "R2000"  412.7687
371 "R2000"  397.1347
372 "R2000"  2654.589
380 "R2000"  760.9336
389 "R2000"  995.8163
392 "R2000"  938.1848
397 "R2000" 294.49908
419 "R1000"  47965.21
422 "R1000"  7402.551
423 "R2000"  497.3124
424 "R2000" 1570.0172
438 "R2000"  790.3486
448 "R2000"   183.483
458 "R2000"  404.7936
461 "R2000" 1198.1132
480 "R1000"  89149.31
482 "R2000" 2366.9795
492 "R1000"   7370.68
537 "R1000"  5798.643
590 "R2000"  825.5341
593 "R2000"  2207.643
597 "R1000" 1282.4198
598 "R2000"  171.5525
616 "R2000"  315.8169
618 "R2000"   2296.32
621 "R2000"   9940.76
623 "R2000" 1462.7284
625 "R2000" 2118.4746
635 "R2000"  2672.478
645 "R2000" 1303.0405
669 "R2000"   71.1984
678 "R1000"  21903.96
683 "R2000" 331.56155
684 "R2000"   278.034
686 "R2000"  91.96034
688 "R1000"  2791.382
692 "R2000"  1594.897
698 "R2000"  372.5234
723 "R2000" 1932.2616
732 "R1000"  4983.182
733 "R1000"  33018.28
736 "R2000"  1366.579
760 "R1000"  11049.84
761 "R1000"  4031.683
762 "R2000"  708.0053
785 "R2000" 1301.8427
792 "R2000" 1781.8755
793 "R1000"  9046.214
800 "R2000" 1978.6287
802 "R1000" 14137.855
816 "R1000"  9976.778
819 "R2000"  776.8024
828 "R2000"  1367.582
830 "R2000"  2021.606
831 "R1000" 13030.792
832 "R1000"  26584.86
841 "R2000" 281.36188
842 "R1000" 17784.088
873 "R2000"  3038.188
880 "R2000" 2486.2134
end

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30174
#2

22 Apr 2023, 13:21

Code:

assert !missing(market_c) by Index (market_c), sort: gen byte top_five_pct = _n >= 0.95*_N by Index (market_c): gen byte bottom_five_pct = _n <= 0.05*_N list if top_five_pct & Index == "R2000" | bottom_five_pct & Index == "R1000"

Added: For the code to work correctly, there cannot be any observations with missing values of market_c. This assumption, which holds true in the example, is verified in the code.

Also added: It is possible that there will be observations that have the same value for market_c and fall at the 5th or 95th percentile. In this situation, the code above selects as many of them, at random, as needed so that a total of 5% (rounded down) of the total observations are selected. This random selection is not reproducible on repeated runs of the code. If this is not how you want ties to be handled, post back describing what you would like to do, and the code can be modified accordingly.

Last edited by Clyde Schechter; 22 Apr 2023, 13:26.
Comment
Marco Errico

Join Date: Apr 2020

Posts: 187
#3

22 Apr 2023, 13:29

Thanks, Clyde Schechter

Assert is telling me the following
"21,673 contradictions in 155,064 observations
assertion is false"

because I do have some missing values based on market_c that needs to be excluded.

How can I tackle the issue?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30174
#4

22 Apr 2023, 14:49

The following code will tolerate missing values for market_c, of which you apparently have many. It excludes them from the calculation altogether (which is to say that they do not count towards the N which determines just how many are in the top or bottom 5%). It is as if they were dropped from the data set altogether, although they actually do remain.

I have also changed the code so that the number of observations counted for the top or bottom 5% is based .05*_N or .95*_N rounded to the nearest integer, rather than rounding down as before.

You have said nothing about ties, so I'm inferring that your happy with the method of tie-breaking described in #2.

Code:

gen byte populated = !missing(market_c) by Index populated (market_c), sort: gen byte top_five_pct = _n >= round(0.95*_N) /// if populated by Index populated (market_c): gen byte bottom_five_pct = _n <= round(0.05*_N) /// if populated list if top_five_pct == 1 & Index == "R2000" | bottom_five_pct == 1 & Index == "R1000"
Comment
Marco Errico

Join Date: Apr 2020

Posts: 187
#5

22 Apr 2023, 15:08

Hi Clyde,

Thanks for your message and kindly helping me.

Just an additional question. These indicators (i.e. top 5 R2000 and bottom 5 R1000 percent firms ) needs to be done as I am looking to identify those in year 2017.
To do this distinction in that specific year is it correct to add to you codes the following adjustmnet?

Code:

gen byte populated = !missing(market_c) by Index populated (market_c), sort: gen byte top_five_pct = _n >= round(0.95*_N) /// if populated & year ==2017 by Index populated (market_c): gen byte bottom_five_pct = _n <= round(0.05*_N) /// if populated & year ==2017 list if top_five_pct == 1 & Index == "R2000" | bottom_five_pct == 1 & Index == "R1000"
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30174
#6

22 Apr 2023, 15:12

Yes, I believe that will correctly restrict the calculations to observations with year == 2017. I didn't pay attention to that part of the problem because the example data didn't even include a year variable, so it would not have been possible to code for it.

I appreciate that you used -dataex- to show your example data--it makes life so much easier! In the future, when you choose your example, including all variables that are relevant to your problem, and data that is fully representative of your complete data set will enhance your chances of getting a working, correct solution the first time.
Comment
Marco Errico

Join Date: Apr 2020

Posts: 187
#7

22 Apr 2023, 15:23

Thank you, Clyde Schechter .
My fault in not including all the variables.

You have been truly helpful!
Comment
Marco Errico

Join Date: Apr 2020

Posts: 187
#8

23 Apr 2023, 10:12

Hi Clyde Schechter

Just a quick follow up regarding this thread.

The final goal is to create an indicator which is equal to if firms are in the top 5% of R2000 and 0 if the firms are in the bottom 5% of R1000 (I am looking to adopt an RDD design for my research).

Therefore, after the code that you suggested in #4, I create the following indicator

Code:

gen rdd_treat=1 if top_five_pct==1 & Index=="R2000" replace rdd_treat=0 if bottom_five_pct==1 & Index=="R1000"

Then, when I tab how many "treated" and "control" firms I have, the following statistics surprised me:

Code:

tab rdd_treat rdd_treat=1 ; 114 obs rdd_treat=0 ; 21 obs

May I ask yo uwhat could be the driver of the following statistics tilted toward treated firm (i.e. top R2000 firms)

Thank you!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30174
#9

23 Apr 2023, 10:24

Well, in your example data, 72% of all the observations are R2000. If your example is representative of your entire data set, then I would expect the kind of results you got. 5% of a larger number is going to be larger than 5% of a smaller number.
Comment
Marco Errico

Join Date: Apr 2020

Posts: 187
#10

23 Apr 2023, 10:28

Mine was silly point. It is indeed the case that my whole sample is larger for R2000 : )
Comment

Announcement

Top/Bottom 5% firms

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment