Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Top/Bottom 5% firms

    Dear All,

    I have a panel dataset of firms.
    Firms are either categorized "R1000" or "R2000" (Index variable)

    I need to create an indicator that tells me what are the top 5% of "R2000" firms and what are the bottom 5%of group "R1000" firms based on a size variable (market_c). This categorization has to be done only in a specific year (2017)

    The identifier of my firm is cusip_id
    Size variable is "market_c"

    Thank you (below a dataex)


    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float cusip_id str5 Index float market_c
      2 "R2000"   214.848
      5 "R2000" 1923.9242
      6 "R2000"  1551.458
     12 "R2000" 1301.5448
     39 "R2000"  2749.161
     57 "R1000"  5682.542
     70 "R2000"   535.949
     71 "R1000"  7900.347
     75 "R1000"  7152.002
     81 "R2000"  1782.249
     85 "R2000" 128.16531
    129 "R2000"  451.5998
    131 "R1000"  3303.477
    132 "R2000"  1927.379
    143 "R2000" 2338.4392
    154 "R2000"  750.9069
    164 "R2000"  115.4283
    165 "R2000"  3351.809
    181 "R1000" 238700.84
    199 "R2000"  553.4622
    213 "R2000"  885.5967
    224 "R2000"  372.3606
    228 "R2000"  2787.583
    232 "R2000"  1060.425
    239 "R2000"  1602.933
    243 "R1000"  99507.37
    246 "R1000" 153974.98
    248 "R2000" 1412.3184
    285 "R1000"  12912.68
    293 "R2000"  408.0894
    297 "R2000"   205.092
    317 "R2000" 1434.9253
    319 "R1000"  2840.768
    320 "R1000" 3746.0154
    321 "R2000"  2290.251
    328 "R2000" 1458.6588
    344 "R2000"  1920.877
    355 "R2000"   397.765
    363 "R2000"  464.9523
    368 "R2000" 103.07047
    369 "R2000"  412.7687
    371 "R2000"  397.1347
    372 "R2000"  2654.589
    380 "R2000"  760.9336
    389 "R2000"  995.8163
    392 "R2000"  938.1848
    397 "R2000" 294.49908
    419 "R1000"  47965.21
    422 "R1000"  7402.551
    423 "R2000"  497.3124
    424 "R2000" 1570.0172
    438 "R2000"  790.3486
    448 "R2000"   183.483
    458 "R2000"  404.7936
    461 "R2000" 1198.1132
    480 "R1000"  89149.31
    482 "R2000" 2366.9795
    492 "R1000"   7370.68
    537 "R1000"  5798.643
    590 "R2000"  825.5341
    593 "R2000"  2207.643
    597 "R1000" 1282.4198
    598 "R2000"  171.5525
    616 "R2000"  315.8169
    618 "R2000"   2296.32
    621 "R2000"   9940.76
    623 "R2000" 1462.7284
    625 "R2000" 2118.4746
    635 "R2000"  2672.478
    645 "R2000" 1303.0405
    669 "R2000"   71.1984
    678 "R1000"  21903.96
    683 "R2000" 331.56155
    684 "R2000"   278.034
    686 "R2000"  91.96034
    688 "R1000"  2791.382
    692 "R2000"  1594.897
    698 "R2000"  372.5234
    723 "R2000" 1932.2616
    732 "R1000"  4983.182
    733 "R1000"  33018.28
    736 "R2000"  1366.579
    760 "R1000"  11049.84
    761 "R1000"  4031.683
    762 "R2000"  708.0053
    785 "R2000" 1301.8427
    792 "R2000" 1781.8755
    793 "R1000"  9046.214
    800 "R2000" 1978.6287
    802 "R1000" 14137.855
    816 "R1000"  9976.778
    819 "R2000"  776.8024
    828 "R2000"  1367.582
    830 "R2000"  2021.606
    831 "R1000" 13030.792
    832 "R1000"  26584.86
    841 "R2000" 281.36188
    842 "R1000" 17784.088
    873 "R2000"  3038.188
    880 "R2000" 2486.2134
    end

  • #2
    Code:
    assert !missing(market_c)
    by Index (market_c), sort: gen byte top_five_pct = _n >= 0.95*_N
    by Index (market_c): gen byte bottom_five_pct = _n <= 0.05*_N
    
    list if top_five_pct & Index == "R2000" | bottom_five_pct & Index == "R1000"
    Added: For the code to work correctly, there cannot be any observations with missing values of market_c. This assumption, which holds true in the example, is verified in the code.

    Also added: It is possible that there will be observations that have the same value for market_c and fall at the 5th or 95th percentile. In this situation, the code above selects as many of them, at random, as needed so that a total of 5% (rounded down) of the total observations are selected. This random selection is not reproducible on repeated runs of the code. If this is not how you want ties to be handled, post back describing what you would like to do, and the code can be modified accordingly.
    Last edited by Clyde Schechter; 22 Apr 2023, 13:26.

    Comment


    • #3
      Thanks, Clyde Schechter

      Assert is telling me the following
      "21,673 contradictions in 155,064 observations
      assertion is false"

      because I do have some missing values based on market_c that needs to be excluded.

      How can I tackle the issue?

      Comment


      • #4
        The following code will tolerate missing values for market_c, of which you apparently have many. It excludes them from the calculation altogether (which is to say that they do not count towards the N which determines just how many are in the top or bottom 5%). It is as if they were dropped from the data set altogether, although they actually do remain.

        I have also changed the code so that the number of observations counted for the top or bottom 5% is based .05*_N or .95*_N rounded to the nearest integer, rather than rounding down as before.

        You have said nothing about ties, so I'm inferring that your happy with the method of tie-breaking described in #2.

        Code:
        gen byte populated = !missing(market_c)
        by Index populated (market_c), sort: gen byte top_five_pct = _n >= round(0.95*_N) ///
            if populated
        by Index populated (market_c): gen byte bottom_five_pct = _n <= round(0.05*_N) ///
            if populated
        
        list if top_five_pct == 1 & Index == "R2000" | bottom_five_pct == 1 & Index == "R1000"

        Comment


        • #5
          Hi Clyde,

          Thanks for your message and kindly helping me.

          Just an additional question. These indicators (i.e. top 5 R2000 and bottom 5 R1000 percent firms ) needs to be done as I am looking to identify those in year 2017.
          To do this distinction in that specific year is it correct to add to you codes the following adjustmnet?
          Code:
           
           gen byte populated = !missing(market_c) by Index populated (market_c), sort: gen byte top_five_pct = _n >= round(0.95*_N) ///     if populated & year ==2017 by Index populated (market_c): gen byte bottom_five_pct = _n <= round(0.05*_N) ///     if populated & year ==2017  list if top_five_pct == 1 & Index == "R2000" | bottom_five_pct == 1 & Index == "R1000"

          Comment


          • #6
            Yes, I believe that will correctly restrict the calculations to observations with year == 2017. I didn't pay attention to that part of the problem because the example data didn't even include a year variable, so it would not have been possible to code for it.

            I appreciate that you used -dataex- to show your example data--it makes life so much easier! In the future, when you choose your example, including all variables that are relevant to your problem, and data that is fully representative of your complete data set will enhance your chances of getting a working, correct solution the first time.

            Comment


            • #7
              Thank you, Clyde Schechter .
              My fault in not including all the variables.

              You have been truly helpful!

              Comment


              • #8
                Hi Clyde Schechter

                Just a quick follow up regarding this thread.

                The final goal is to create an indicator which is equal to if firms are in the top 5% of R2000 and 0 if the firms are in the bottom 5% of R1000 (I am looking to adopt an RDD design for my research).

                Therefore, after the code that you suggested in #4, I create the following indicator

                Code:
                gen rdd_treat=1 if top_five_pct==1 & Index=="R2000"
                replace rdd_treat=0 if bottom_five_pct==1 & Index=="R1000"
                Then, when I tab how many "treated" and "control" firms I have, the following statistics surprised me:

                Code:
                tab rdd_treat
                rdd_treat=1 ; 114 obs
                rdd_treat=0 ; 21 obs
                May I ask yo uwhat could be the driver of the following statistics tilted toward treated firm (i.e. top R2000 firms)


                Thank you!

                Comment


                • #9
                  Well, in your example data, 72% of all the observations are R2000. If your example is representative of your entire data set, then I would expect the kind of results you got. 5% of a larger number is going to be larger than 5% of a smaller number.

                  Comment


                  • #10
                    Mine was silly point. It is indeed the case that my whole sample is larger for R2000 : )

                    Comment

                    Working...
                    X