Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Problem with ranking observations in a uniform distribution

    Dear all,

    I have a dataset where i have monthly observations of firms during multiple years. The following data is a part of it (also doesn't represent al the data of that month):

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long ISIN float(date AG)
    2735 439   -.4516252
    1234 439    -.436528
     815 439  -.42150295
     930 439   -.3979278
    2437 439   -.3829988
    1706 439   -.3513278
    2914 439   .17786545
    1959 439   .17994183
    2589 439   .18265775
    1362 439   .18334074
     136 439           .
    1134 439           .
    1524 439           .
     730 439           .
    1057 439           .
    2457 439           .
     790 439           .
    1668 439           .
    2735 440   -.4516252
     815 440  -.42150295
     930 440   -.3979278
    2437 440   -.3829988
     779 440   .05364185
    2063 440   .05568719
    2030 440   .05700605
    1258 440   .05852029
    1849 440   .56396616
    1281 440    .6302893
     695 440    .6348755
    2839 440    .7798654
    1680 440    .7868747
     729 440   1.2353185
    1319 440   1.2591548
    2318 440   1.3253646
    1980 440   1.4670322
    1190 440     1.76784
     605 440           .
     363 440           .
     460 440           .
    2794 440           .
    end
    format %tm date
    I want to create a variable that gives a uniformly distributed rank over the interval of (0,1] based on the nonmissing values in each month of "AG" where the highest value of AG has the highest rank. And is there also a way to be able to turn the ranking around? Where the highest value has the lowest ranking.
    I have looked at using the function of rank() but that uses integer values and with further searching on these forums I have not been able to find a solution.

    I'd be grateful for some suggestions to solve this issue.

    Kind regards,

    Max

  • #2
    I'm not entirely sure I understand what you want, but I think this does it:

    Code:
    by date (AG), sort: gen rank = _n if !missing(AG)
    by date (AG): egen max_rank = max(rank)
    gen rev_rank = max_rank + 1 - rank
    replace rank = rank/max_rank
    replace rev_rank = rev_rank/max_rank
    Note: You said nothing about how you want to resolve ties. The above code breaks ties randomly and irreproducibly.

    Comment


    • #3
      Code:
      bysort date: egen rank = rank(-AG)
      by date: egen max = max(rank)
      by date : replace rank = rank/max
      would seem to satisfy the criteria and assigns the same rank to the same values when ties are present.

      NB that the manual entry for egen mentions the negation trick.

      Comment


      • #4
        Thank you Clyde for your help.
        Your code indeed solves my issue. What do you mean exactly with ties?

        Comment


        • #5
          Suppose that two different ISINs have the same value of AG in a given month, say both tied for 5th highest. The code in #2 will arbitrarily select one of them as 5th highest and call the other one 6th highest (and then those integer ranks will be scaled to the (0,1] interval). The choice of which is 5th and which is 6th is made at random, and, if you re-run the code on the same data, you will not get the same choice made each time. The code offered by Nick in #3 is different: it will assign the same rank to all ISINs having the same value of AG.

          Comment


          • #6
            I understand what u mean now, thank you.

            I've checked my dataset and I indeed have some ties in the data. I will look more towards the code of Nick.

            Again thank you both for your help.

            Comment

            Working...
            X