Problem with ranking observations in a uniform distribution

Max den Reijer

Join Date: Jul 2019
Posts: 4

Problem with ranking observations in a uniform distribution

18 Jul 2019, 10:38

Dear all,

I have a dataset where i have monthly observations of firms during multiple years. The following data is a part of it (also doesn't represent al the data of that month):

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long ISIN float(date AG)
2735 439   -.4516252
1234 439    -.436528
 815 439  -.42150295
 930 439   -.3979278
2437 439   -.3829988
1706 439   -.3513278
2914 439   .17786545
1959 439   .17994183
2589 439   .18265775
1362 439   .18334074
 136 439           .
1134 439           .
1524 439           .
 730 439           .
1057 439           .
2457 439           .
 790 439           .
1668 439           .
2735 440   -.4516252
 815 440  -.42150295
 930 440   -.3979278
2437 440   -.3829988
 779 440   .05364185
2063 440   .05568719
2030 440   .05700605
1258 440   .05852029
1849 440   .56396616
1281 440    .6302893
 695 440    .6348755
2839 440    .7798654
1680 440    .7868747
 729 440   1.2353185
1319 440   1.2591548
2318 440   1.3253646
1980 440   1.4670322
1190 440     1.76784
 605 440           .
 363 440           .
 460 440           .
2794 440           .
end
format %tm date

I want to create a variable that gives a uniformly distributed rank over the interval of (0,1] based on the nonmissing values in each month of "AG" where the highest value of AG has the highest rank. And is there also a way to be able to turn the ranking around? Where the highest value has the lowest ranking.
I have looked at using the function of rank() but that uses integer values and with further searching on these forums I have not been able to find a solution.

I'd be grateful for some suggestions to solve this issue.

Kind regards,

Max

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#2

18 Jul 2019, 10:49

I'm not entirely sure I understand what you want, but I think this does it:

Code:

by date (AG), sort: gen rank = _n if !missing(AG) by date (AG): egen max_rank = max(rank) gen rev_rank = max_rank + 1 - rank replace rank = rank/max_rank replace rev_rank = rev_rank/max_rank

Note: You said nothing about how you want to resolve ties. The above code breaks ties randomly and irreproducibly.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35637
#3

18 Jul 2019, 10:57

Code:

bysort date: egen rank = rank(-AG) by date: egen max = max(rank) by date : replace rank = rank/max

would seem to satisfy the criteria and assigns the same rank to the same values when ties are present.

NB that the manual entry for egen mentions the negation trick.
Comment
Max den Reijer

Join Date: Jul 2019

Posts: 4
#4

18 Jul 2019, 11:01

Thank you Clyde for your help.
Your code indeed solves my issue. What do you mean exactly with ties?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#5

18 Jul 2019, 11:05

Suppose that two different ISINs have the same value of AG in a given month, say both tied for 5th highest. The code in #2 will arbitrarily select one of them as 5th highest and call the other one 6th highest (and then those integer ranks will be scaled to the (0,1] interval). The choice of which is 5th and which is 6th is made at random, and, if you re-run the code on the same data, you will not get the same choice made each time. The code offered by Nick in #3 is different: it will assign the same rank to all ISINs having the same value of AG.
Comment
Max den Reijer

Join Date: Jul 2019

Posts: 4
#6

18 Jul 2019, 11:15

I understand what u mean now, thank you.

I've checked my dataset and I indeed have some ties in the data. I will look more towards the code of Nick.

Again thank you both for your help.
Comment

Announcement

Problem with ranking observations in a uniform distribution

Comment

Comment

Comment

Comment

Comment