Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • summarizing multiple variables for each observation -- ties broken randomly

    I have 16 variables, each is continuous with values ranging from 0 to 10. I want to reduce it to a single variable that indicates which of the 16 variables has the largest value within that observation. I have 10,600 observations.

    When ties are present, pick between the tying variables randomly. Values of 0 are common, but if all of the 16 variables have a value of 0, return a value of 0.

    For example, here is a sample of what ten observations might look like, the same example in code, and the desired output:
    ID cat1 cat2 cat3 cat4 cat5 cat6 cat7 cat8 cat9 cat10 cat11 cat12 cat13 cat14 cat15 cat16
    1001 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 0
    1002 0 0 0 0 0 3.5 0 0 0 5 0 0 0 0 0 0
    1003 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    1004 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    1005 0 0 2 0 0 0 0 3 0 0 0 0 0 0 0 0
    1006 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    1007 0 0 4 0 0 0 0 0 0 0 0 4 0 0 0 0
    1008 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    1009 0 3.75 0 0 0 0 0 0 0 0 0 0 3.75 8 0 0
    1010 0 0 0 0 0 0 0 5 0 0 0 0 0 0 0 5

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int id byte cat1 float cat2 byte(cat3 cat4 cat5) float cat6 byte(cat7 cat8 cat9 cat10 cat11 cat12) float cat13 byte(cat14 cat15 cat16)
    1001 0    0 0 3 0   0 0 0 0 0 0 0    0 0 0 0
    1002 0    0 0 0 0 3.5 0 0 0 5 0 0    0 0 0 0
    1003 0    0 0 0 0   0 0 0 0 0 0 0    0 0 0 0
    1004 0    0 0 0 0   0 0 0 0 0 0 0    0 0 0 0
    1005 0    0 2 0 0   0 0 3 0 0 0 0    0 0 0 0
    1006 0    0 0 0 0   0 0 0 0 0 0 0    0 0 0 0
    1007 0    0 4 0 0   0 0 0 0 0 0 4    0 0 0 0
    1008 0    0 0 0 0   0 0 0 0 0 0 0    0 0 0 0
    1009 0 3.75 0 0 0   0 0 0 0 0 0 0 3.75 8 0 0
    1010 0    0 0 0 0   0 0 5 0 0 0 0    0 0 0 5
    end
    Desired Output
    ID catSummary
    1001 4
    1002 10
    1003 0
    1004 0
    1005 8
    1006 0
    1007 12
    1008 0
    1009 14
    1010 8

  • #2
    If I understood correcltly, to start, I think you need to deal with - egen - and some functions, such as rowmax().
    Best regards,

    Marcos

    Comment


    • #3
      Thanks, Marcos! I don't actually need to capture the maximum value of the row, but to capture the name of the variable that held the max value. In my example above I tried to do this by capturing the number in the name of the variable (i.e., "cat8" is recorded in the summary variable as "8").

      After doing more searching based on your suggestion, I found this thread which shows that rowmax might hold a lot of promise: https://www.statalist.org/forums/for...ariables-names
      Last edited by Tyler Ames; 28 May 2019, 12:26.

      Comment


      • #4
        I have adapted the code from the link that I pasted into my prior response that uses rowmax, as you suggested, Marcos. The code is below. But it misses on one essential element: ties are not decided randomly. It gives preference to variables listed at the end of the local, and which are therefore run through the loop after the variables listed at the beginning of the local. Here is my adapted code:

        Code:
        *Generating a last category to mark observations with all 0's
        gen cat17 = .
        
        local vars " cat1 cat2 cat3 cat4 cat5 cat6 cat7 cat8 cat9 cat10 cat11 cat12 cat13 cat14 cat15 cat16 cat17 "
        
        foreach var of local vars {
        recode `var' 0=.
        }
        
        egen m2= rowmax( `vars' )
        gen catSummary=""
        
        foreach var of local vars {
        replace catSummary = "`var'" if m2==`var'
        }
        drop m2
        Any thoughts on how to break the ties randomly?

        Comment


        • #5
          Solved. It isn't efficient, but just in case anyone else in the future finds this thread, here was my solution:

          Code:
          set seed 132518
          
          clear
          input int id byte cat1 float cat2 byte(cat3 cat4 cat5) float cat6 byte(cat7 cat8 cat9 cat10 cat11 cat12) float cat13 byte(cat14 cat15 cat16)
          1001 0    0 0 3 0   0 0 0 0 0 0 0    0 0 0 0
          1002 0    0 0 0 0 3.5 0 0 0 5 0 0    0 0 0 0
          1003 0    0 0 0 0   0 0 0 0 0 0 0    0 0 0 0
          1004 0    0 0 0 0   0 0 0 0 0 0 0    0 0 0 0
          1005 0    0 2 0 0   0 0 3 0 0 0 0    0 0 0 0
          1006 0    0 0 0 0   0 0 0 0 0 0 0    0 0 0 0
          1007 0    0 4 0 0   0 0 0 0 0 0 4    0 0 0 0
          1008 0    0 0 0 0   0 0 0 0 0 0 0    0 0 0 0
          1009 0 3.75 0 0 0   0 0 0 0 0 0 0 3.75 8 0 0
          1010 0    0 0 0 0   0 0 5 0 0 0 0    0 0 0 5
          end
          
          
          gen randomnum = runiform(1, 16)
          
          local vars " cat1 cat2 cat3 cat4 cat5 cat6 cat7 cat8 cat9 cat10 cat11 cat12 cat13 cat14 cat15 cat16 "
          
          foreach var of local vars {
          recode `var' 0=.
          }
          
          egen m2= rowmax( `vars' )
          
          
          *gen most=""
          *foreach var of local vars {
          *replace most = "`var'" if m2==`var'
          *}
          
          
          reshape long cat, i(id) string
          
          gen highest = _j if cat == m2 & m2!=.
          destring highest, replace
          
          *random elements
          gen diff1 = highest - randomnum
          replace diff1 = . if diff1 < 0
          
          by id (diff1 highest), sort: replace highest = highest[_n-1] if _n>1
          
          drop m2 diff1 randomnum cat _j
          duplicates drop
          
          recode highest .=0
          rename highest catSummary

          Comment

          Working...
          X