Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating a rank for string variables with several equal observations

    I have a dataset with a string variable "x" that assumes the same value several times. I want to create two new variables:
    1. A count variable "count_x": frequency that a certain value of "x" appears
    2. A ranking variable "ranking_x": ranking of appearances for a certain value of "x"
    Thus far I managed to create count_x successfully. However, I don't know how to create ranking_x so that it gives me the output I desire.
    This is a simple example of the logic I used:

    use dataset
    keep x
    bysort x: egen count_x=count(x)
    egen ranking_x=rank(count_x), field
    list x count_x ranking_x


    Below an example of the obtained and desired values I got:
    x count_x ranking_x desired ranking_x
    CESAR 3 11 3
    CESAR 3 11 3
    CESAR 3 11 3
    JOHN 6 1 1
    JOHN 6 1 1
    JOHN 6 1 1
    JOHN 6 1 1
    JOHN 6 1 1
    JOHN 6 1 1
    MAX 1 14 4
    PAUL 4 7 2
    PAUL 4 7 2
    PAUL 4 7 2
    PAUL 4 7 2
    I understand that it would work if I used following logic:

    use dataset
    keep x
    bysort x: egen count_x=count(x)
    bysort x: keep if _n == 1
    egen ranking_x=rank(count_x), field
    list x count_x ranking_x

    However, since I'm using further variables besides x in my analysis, I'm looking for a solution that allows me to keep all the observations.

    Thanks for your help!
    Last edited by Thiago Soares; 28 Jun 2017, 03:00.

  • #2
    Thanks for the data example, although using dataex (SSC) as requested in FAQ Advice #12 would have been even better.

    You need to rank just one observation of each distinct category and then spread that rank to other observations. Here's one way to do it:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str5 x
    "CESAR"
    "CESAR"
    "CESAR"
    "JOHN" 
    "JOHN" 
    "JOHN" 
    "JOHN" 
    "JOHN" 
    "JOHN" 
    "MAX"  
    "PAUL" 
    "PAUL" 
    "PAUL" 
    "PAUL" 
    end
    
    bysort x : gen count = _N
    egen tag = tag(x)
    egen rank = rank(count) if tag, field 
    bysort x (rank) : replace rank = rank[1]
    list, sepby(x)
    
         +----------------------------+
         |     x   count   tag   rank |
         |----------------------------|
      1. | CESAR       3     1      3 |
      2. | CESAR       3     0      3 |
      3. | CESAR       3     0      3 |
         |----------------------------|
      4. |  JOHN       6     1      1 |
      5. |  JOHN       6     0      1 |
      6. |  JOHN       6     0      1 |
      7. |  JOHN       6     0      1 |
      8. |  JOHN       6     0      1 |
      9. |  JOHN       6     0      1 |
         |----------------------------|
     10. |   MAX       1     1      4 |
         |----------------------------|
     11. |  PAUL       4     1      2 |
     12. |  PAUL       4     0      2 |
     13. |  PAUL       4     0      2 |
     14. |  PAUL       4     0      2 |
         +----------------------------+
    See the help for egen for its tag() function (first so written by me in 1999, but the idea was long since part of Stata folklore).

    Comment


    • #3
      Thank you for your quick answer! It solved my problem :-)
      Next time I'll use dataex (SSC), sorry about that.

      Comment

      Working...
      X