Count Number of Unique Values Across Variables

Emily Ellis

Join Date: May 2016

Posts: 8
#1

Count Number of Unique Values Across Variables

03 Feb 2023, 14:34

Hello,
I am trying to figure out how I can count the number of unique values across several variables. Specifically, in my data I have a question that asks "Who helps you with [activity]" for several activities. The response is given as a numeric person identifier. I want to count across all of these questions to see how many unique individuals help take care of a person, across all of the activities. I've been trying to figure out if there's a way to do this with egen and one of the row functions, but can't seem to figure it out.

For example, for the respondent below, I would like the variable to indicate that there are 4 unique helpers (eg, I only want to count 36 once, because that person helps with both bathing and paying bills).

id year bathing dressing eating toileting transfers walking medication management cooking paying bills

23348919 1998 36 401 101 104 36

Any help is appreciated! Thank you!

Emily

Last edited by Emily Ellis; 03 Feb 2023, 14:37.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#2

03 Feb 2023, 14:56

Since you mentioned using -egen-'s row functions, I take it you prefer to keep the data in wide layout.

Code:

// METHOD 1 rowsort bathing-payingbills, gen(v1-v9) forvalues i = 2/9 { gen byte d`i' = v`i' != v`=`i'-1' & !missing(v`i') } egen wanted = rowtotal(d2-d9) replace wanted = wanted + 1 drop d2-d9 v1-v9 list, noobs clean

Myself, I probably would want to switch the data to long layout in any case, as most Stata data analysis and management is easier that way. So that would look like this:

Code:

// METHOD 2 rename (bathing-payingbills) person= reshape long person, i(id year) j(activity) string by id year (person), sort: gen wanted = sum(person != person[_n-1]) & !missing(person) by id year (person): replace wanted = wanted[_N] list, noobs clean

And, after that, if you needed to go back to the wide layout, you could just run -reshape wide- (no variables or options needed) to get there, and the new variable would go along for the ride. Then -rename person* person- gets you back to where you started + the wanted count of distinct persons.

Given the nature of these variables, my instinct is that you are better off just going to long layout, using Method 2, and keeping it that way. But it does depend on what else you will want to do with this data, and given what they mean, I can envision the possibility, though I think it isn't likely, that keeping them wide would be more effective.

Added comment: One advantage of the second method is that you do not need to know, or compute, how many activity variables there are. When I wrote the code, I originally overlooked the bathing variable, and then when I discovered that, for method 1, I not only had to put that into the initial -rowsort- command, but I also had to change several other lines in the code that were predicated on there being 8 activities, not 9. With the long layout approach in method 2, all I had to do was correct the -rename- command to include bathing. The rest of the code didn't care how many activities there were.

Last edited by Clyde Schechter; 03 Feb 2023, 15:08.
Comment
Emily Ellis

Join Date: May 2016

Posts: 8
#3

03 Feb 2023, 15:14

Clyde,
Thanks so much for your fast and thoughtful reply! I'll give these a try. And yes, I have had to do lots of reshaping for this analysis, merging in different kinds of variables. I haven't quite figured out what's best, so I'll think more on this given your suggestion. Thank you!

Emily
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35726
#4

03 Feb 2023, 16:48

See Section 7 of my 2009 paper Rowwise in the Stata Journal in 2009 detailing dedicated egen functions in the egenmore package on SSC.

No need to reshape

#2 Clyde Schechter mentioned rowsort which is explained in the same paper.

Last edited by Nick Cox; 03 Feb 2023, 16:58.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35726
#5

04 Feb 2023, 02:03

Here is #4 expanded. You need to install egenmore first.

Code:

ssc install egenmore

Here is an extract from help egenmore

rownvals(numvarlist) [ , missing ] returns the number of distinct values in each observation for a set of numeric variables numvarlist. Thus
if the values in one observation for five numeric variables are 1, 1, 2, 2, 3 the function returns 3 for that observation. Missing values,
i.e. any of . .a ... .z, are ignored unless the missing option is specified. (Stata 9 required.)

rowsvals(strvarlist) [ , missing ] returns the number of distinct values in each observation for a set of string variables strvarlist. Thus if
the values in one observation for five string variables are "frog", "frog", "toad", "toad", "newt" the function returns 3 for that
observation. Missing values, i.e. empty strings "", are ignored unless the missing option is specified. (Stata 9 required.)

Code:

* Example generated by -dataex-. For more info, type help dataex clear input long id int year byte bathing int(dressing eating) byte(toileting transfers walking medicationmanagement) int cooking byte payingbills 23348919 1998 36 401 101 . . . . 104 36 end . egen ndistinct = rownvals(bathing-payingbills) . list +-----------------------------------------------------------------------------------------+ 1. | id | year | bathing | dressing | eating | toilet~g | transf~s | walking | medica~t | | 2.3e+07 | 1998 | 36 | 401 | 101 | . | . | . | . | |-----------------------------------------------------------------------------------------| | cooking | paying~s | ndisti~t | | 104 | 36 | 4 | +-----------------------------------------------------------------------------------------+

The 2009 paper is at https://journals.sagepub.com/doi/pdf...867X0900900107
Comment

id	year	bathing	dressing	eating	toileting	transfers	walking	medication management	cooking	paying bills
23348919	1998	36	401	101					104	36

Announcement

Count Number of Unique Values Across Variables

Comment

Comment

Comment

Comment