Creating a rank for string variables with several equal observations

Thiago Soares

Join Date: Jun 2017
Posts: 6

Creating a rank for string variables with several equal observations

28 Jun 2017, 02:48

I have a dataset with a string variable "x" that assumes the same value several times. I want to create two new variables:

A count variable "count_x": frequency that a certain value of "x" appears
A ranking variable "ranking_x": ranking of appearances for a certain value of "x"

Thus far I managed to create count_x successfully. However, I don't know how to create ranking_x so that it gives me the output I desire.
This is a simple example of the logic I used:

use dataset
keep x
bysort x: egen count_x=count(x)
egen ranking_x=rank(count_x), field
list x count_x ranking_x

Below an example of the obtained and desired values I got:

x	count_x	ranking_x	desired ranking_x
CESAR	3	11	3
CESAR	3	11	3
CESAR	3	11	3
JOHN	6	1	1
JOHN	6	1	1
JOHN	6	1	1
JOHN	6	1	1
JOHN	6	1	1
JOHN	6	1	1
MAX	1	14	4
PAUL	4	7	2
PAUL	4	7	2
PAUL	4	7	2
PAUL	4	7	2

I understand that it would work if I used following logic:

use dataset
keep x
bysort x: egen count_x=count(x)
bysort x: keep if _n == 1
egen ranking_x=rank(count_x), field
list x count_x ranking_x

However, since I'm using further variables besides x in my analysis, I'm looking for a solution that allows me to keep all the observations.

Thanks for your help!

Last edited by Thiago Soares; 28 Jun 2017, 03:00.

Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35699

28 Jun 2017, 03:04

Thanks for the data example, although using dataex (SSC) as requested in FAQ Advice #12 would have been even better.

You need to rank just one observation of each distinct category and then spread that rank to other observations. Here's one way to do it:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str5 x
"CESAR"
"CESAR"
"CESAR"
"JOHN" 
"JOHN" 
"JOHN" 
"JOHN" 
"JOHN" 
"JOHN" 
"MAX"  
"PAUL" 
"PAUL" 
"PAUL" 
"PAUL" 
end

bysort x : gen count = _N
egen tag = tag(x)
egen rank = rank(count) if tag, field 
bysort x (rank) : replace rank = rank[1]
list, sepby(x)

     +----------------------------+
     |     x   count   tag   rank |
     |----------------------------|
  1. | CESAR       3     1      3 |
  2. | CESAR       3     0      3 |
  3. | CESAR       3     0      3 |
     |----------------------------|
  4. |  JOHN       6     1      1 |
  5. |  JOHN       6     0      1 |
  6. |  JOHN       6     0      1 |
  7. |  JOHN       6     0      1 |
  8. |  JOHN       6     0      1 |
  9. |  JOHN       6     0      1 |
     |----------------------------|
 10. |   MAX       1     1      4 |
     |----------------------------|
 11. |  PAUL       4     1      2 |
 12. |  PAUL       4     0      2 |
 13. |  PAUL       4     0      2 |
 14. |  PAUL       4     0      2 |
     +----------------------------+

See the help for egen for its tag() function (first so written by me in 1999, but the idea was long since part of Stata folklore).

Comment

Thiago Soares

Join Date: Jun 2017

Posts: 6
#3

28 Jun 2017, 03:14

Thank you for your quick answer! It solved my problem :-)
Next time I'll use dataex (SSC), sorry about that.
Comment

Announcement

Creating a rank for string variables with several equal observations

Comment

Comment