Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Count Number of Unique Values Across Variables

    Hello,
    I am trying to figure out how I can count the number of unique values across several variables. Specifically, in my data I have a question that asks "Who helps you with [activity]" for several activities. The response is given as a numeric person identifier. I want to count across all of these questions to see how many unique individuals help take care of a person, across all of the activities. I've been trying to figure out if there's a way to do this with egen and one of the row functions, but can't seem to figure it out.

    For example, for the respondent below, I would like the variable to indicate that there are 4 unique helpers (eg, I only want to count 36 once, because that person helps with both bathing and paying bills).
    id year bathing dressing eating toileting transfers walking medication management cooking paying bills
    23348919 1998 36 401 101 104 36
    Any help is appreciated! Thank you!

    Emily
    Last edited by Emily Ellis; 03 Feb 2023, 14:37.

  • #2
    Since you mentioned using -egen-'s row functions, I take it you prefer to keep the data in wide layout.

    Code:
    //  METHOD 1
    rowsort bathing-payingbills, gen(v1-v9)
    forvalues i = 2/9 {
        gen byte d`i' = v`i' != v`=`i'-1' & !missing(v`i')
    }
    egen wanted = rowtotal(d2-d9)
    replace wanted = wanted + 1
    drop d2-d9 v1-v9
    list, noobs clean
    Myself, I probably would want to switch the data to long layout in any case, as most Stata data analysis and management is easier that way. So that would look like this:
    Code:
    //  METHOD 2
    rename (bathing-payingbills) person=
    reshape long person, i(id year) j(activity) string
    by id year (person), sort: gen wanted = sum(person != person[_n-1]) & !missing(person)
    by id year (person): replace wanted = wanted[_N]
    list, noobs clean
    And, after that, if you needed to go back to the wide layout, you could just run -reshape wide- (no variables or options needed) to get there, and the new variable would go along for the ride. Then -rename person* person- gets you back to where you started + the wanted count of distinct persons.

    Given the nature of these variables, my instinct is that you are better off just going to long layout, using Method 2, and keeping it that way. But it does depend on what else you will want to do with this data, and given what they mean, I can envision the possibility, though I think it isn't likely, that keeping them wide would be more effective.

    Added comment: One advantage of the second method is that you do not need to know, or compute, how many activity variables there are. When I wrote the code, I originally overlooked the bathing variable, and then when I discovered that, for method 1, I not only had to put that into the initial -rowsort- command, but I also had to change several other lines in the code that were predicated on there being 8 activities, not 9. With the long layout approach in method 2, all I had to do was correct the -rename- command to include bathing. The rest of the code didn't care how many activities there were.
    Last edited by Clyde Schechter; 03 Feb 2023, 15:08.

    Comment


    • #3
      Clyde,
      Thanks so much for your fast and thoughtful reply! I'll give these a try. And yes, I have had to do lots of reshaping for this analysis, merging in different kinds of variables. I haven't quite figured out what's best, so I'll think more on this given your suggestion. Thank you!

      Emily

      Comment


      • #4
        See Section 7 of my 2009 paper Rowwise in the Stata Journal in 2009 detailing dedicated egen functions in the egenmore package on SSC.

        No need to reshape

        #2 Clyde Schechter mentioned rowsort ​​​​​​​which is explained in the same paper.
        Last edited by Nick Cox; 03 Feb 2023, 16:58.

        Comment


        • #5
          Here is #4 expanded. You need to install egenmore first.

          Code:
          ssc install egenmore
          Here is an extract from
          help egenmore

          rownvals(numvarlist) [ , missing ] returns the number of distinct values in each observation for a set of numeric variables numvarlist. Thus
          if the values in one observation for five numeric variables are 1, 1, 2, 2, 3 the function returns 3 for that observation. Missing values,
          i.e. any of . .a ... .z, are ignored unless the missing option is specified. (Stata 9 required.)

          rowsvals(strvarlist) [ , missing ] returns the number of distinct values in each observation for a set of string variables strvarlist. Thus if
          the values in one observation for five string variables are "frog", "frog", "toad", "toad", "newt" the function returns 3 for that
          observation. Missing values, i.e. empty strings "", are ignored unless the missing option is specified. (Stata 9 required.)





          Code:
          * Example generated by -dataex-. For more info, type help dataex
          clear
          input long id int year byte bathing int(dressing eating) byte(toileting transfers walking medicationmanagement) int cooking byte payingbills
          23348919 1998 36 401 101 . . . . 104 36
          end
          
          . egen ndistinct = rownvals(bathing-payingbills)
          
          . list 
          
               +-----------------------------------------------------------------------------------------+
            1. |      id | year | bathing | dressing | eating | toilet~g | transf~s | walking | medica~t |
               | 2.3e+07 | 1998 |      36 |      401 |    101 |        . |        . |       . |        . |
               |-----------------------------------------------------------------------------------------|
               |          cooking           |          paying~s           |           ndisti~t           |
               |              104           |                36           |                  4           |
               +-----------------------------------------------------------------------------------------+
          The 2009 paper is at https://journals.sagepub.com/doi/pdf...867X0900900107

          Comment

          Working...
          X