Count unique appearance of values across multiple variables in wide format

Leigh Forgues

Join Date: Mar 2023

Posts: 7
#1

Count unique appearance of values across multiple variables in wide format

26 Mar 2023, 20:38

Hello!

It has been a while since I have used Stata and I need a bit of help.

I have a wide dataset with variables activ1-active27. Across these variables, the values I am interested in are 3001-3008 (these values represent specific categories of activities a participant has engaged in).

I would like to count:
1) the unique appearance of 3001-3008 across activ1-active27 for each idpriv
2) the total count of any value within the range 3001-3008 across activ1-activ27 for each idpriv

I have tried a number of loops without success. While I would love to try -egenmore-, I cannot add packages because I am working on a disconnected server.
Tags: None
Chen Samulsion

Join Date: Jan 2018

Posts: 914
#2

26 Mar 2023, 20:51

Hope this is what you want:

Code:

/*For the question 1)*/ foreach n of numlist 3001/3008 { by idpriv: egen count`n'=anycount(activ1-active27), values(`n') } /*For the question 2)*/ by idpriv: egen total_count=anycount(activ1-active27), values(3001/3008)

edit: sorry I can't delete the answer, but the egen anycount function may not be combined with by, you can use Clyde Schechter's code instead.

Last edited by Chen Samulsion; 26 Mar 2023, 20:59. Reason: wrong
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#3

26 Mar 2023, 20:53

2) the total count of any value within the range 3001-3008 across activ1-activ27 for each idpriv

The -egen, anycount()- function is part of official Stata and does not require -egenmore-.

Code:

egen wanted1 = anycount(activ1-activ27), values(3001(1)3008)

I'm not sure what you mean when you say you want this "for each idpriv." The above will count the number of occurrences of any value 3001 through 3008 in all of the variables activ1-activ27 in each observation. If you have multiple observations per idpriv, then it isn't clear what you want. Perhaps you want to then take the total of each of those:

Code:

by idpriv: egen wanted2 = total(wanted1)

I have not responded to your first request because I do not understand what it means.

Added: Crossed with #2. I don't think the code proposed there will work. According to the -egen- help file, -egen, anycount()- cannot be combined with -by-.

Last edited by Clyde Schechter; 26 Mar 2023, 20:56.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35696

27 Mar 2023, 01:02

I suspect that Chen Samulsion is pointing in the right direction but we lack a data example -- and as explained in FAQ Advice #12 a fake example with the right flavour can be just as helpful as real data. See https://www.statalist.org/forums/help#stata

The word unique can mean something more or less, erm, unique to each problem. See e.g. Section 2 of https://journals.sagepub.com/doi/pdf...867X0800800408

Please consider this and then tell us if your real problem is qualitatively different. For example, are there multiple observations for each identifier?

Code:

clear 
set obs 3
gen idpriv = _n 

forval j = 1/3 {
    gen activ`j' = runiformint(1, 9)
}

forval a = 4/6  {
    egen any`a' = anycount(activ*), value(`a')
}

egen all = rowtotal(any*)

list 
     +--------------------------------------------------------------+
     | idpriv   activ1   activ2   activ3   any4   any5   any6   all |
     |--------------------------------------------------------------|
  1. |      1        1        6        7      0      0      1     1 |
  2. |      2        5        8        8      0      1      0     1 |
  3. |      3        5        5        8      0      2      0     2 |
     +--------------------------------------------------------------+

Comment

Leigh Forgues

Join Date: Mar 2023

Posts: 7
#5

29 Mar 2023, 11:42

Thank you all for your help. It is much appreciated!

For the initial suggestion, I received an "invalid name" error. That's my fault for not including a data example.

Re: the data, it looks exactly like what Nick shared above (without value labels attached).

idpriv activ1 activ2 activ3 Unique Music Grps Total Music Grps

1 3001 5031 3001 1 2

2 1011 3008 3002 2 1

3 2031 3005 . 1 1

activ1 is the first club activity a high schooler participated in, activ2 is the second, and so on for activ3-activ27

Labels for 3001-3008 are unique music-related groups (band, chorus, etc)

I want to count unique music groups for each high schooler (unique appearances of 3001-3008). I also want to check if the unique count matches the total count for those groups.

Thank you again!
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35696
#6

29 Mar 2023, 12:01

Identifier 2 looks like 2 and 2 to me.

If you check out the anymatch() function of egen as well as anycount() and rowtotal() I think you have all the tools you need.

Code:

help egen
Comment
Leigh Forgues

Join Date: Mar 2023

Posts: 7
#7

29 Mar 2023, 14:07

Thank you!
Comment

Duncan Mphande

Join Date: Jul 2022
Posts: 22

30 Apr 2023, 10:04

Please help me count unique appearance of ID across variable. If the ID appears multiple times, i want to count only once for any combination. I want to later aggregate the created column and find a count of how many unique Investor_ID are therein my data set

I tried to use #6, my code was as below and i got values 1 in the new colum totalIDs that is created. So i get a huge aggregate for different combinations. My aim is to count once the var investor_ID for any combination of the other variable.

Code:

 egen totaIDs = anycount(investor_id), values (1(1)113)

My sample data as given by dataex is as below

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input int investor_id byte investor_type str3 ticker float qdate
  2 2 "PUG" 206
  4 3 "BGK" 234
  4 3 "PUG" 191
  4 3 "BGK" 234
  4 3 "BGK" 234
  5 1 "BGK" 237
  5 1 "BGK" 236
  5 1 "BGK" 235
  5 1 "BGK" 235
  5 1 "BGK" 236
  5 1 "BGK" 238
  5 1 "BGK" 238
  5 1 "BGK" 238
  5 1 "BGK" 236
  5 1 "BGK" 235
  5 1 "BGK" 235
  7 2 "PUG" 209
  7 2 "BGK" 208
 15 1 "BGK" 236
 15 1 "BGK" 236
 15 1 "BGK" 214
 15 1 "BGK" 231
 15 1 "BGK" 238
 15 1 "BGK" 231
 15 1 "BGK" 207
 15 1 "BGK" 237
 15 1 "BGK" 239
 15 1 "BGK" 240
 15 1 "BGK" 239
 34 5 "PUG" 208
 35 1 "BGK" 228
 35 1 "BGK" 224
 35 1 "BGK" 221
 35 1 "BGK" 225
 35 1 "BGK" 225
 40 5 "BGK" 203
 40 5 "BGK" 207
 40 5 "BGK" 207
 40 5 "BGK" 207
 40 5 "BGK" 208
 41 2 "PUG" 191
 41 2 "BGK" 208
 41 2 "PUG" 191
 50 6 "PUG" 206
 56 2 "PUG" 206
 56 2 "BGK" 208
 70 2 "BGK" 238
 71 2 "BGK" 239
 71 2 "BGK" 233
 71 2 "BGK" 239
 71 2 "BGK" 233
 71 2 "BGK" 233
 71 2 "BGK" 233
 71 2 "BGK" 239
 72 5 "PUG" 206
 72 5 "PUG" 206
 72 5 "PUG" 206
 72 5 "PUG" 206
 73 2 "PUG" 206
 74 6 "PUG" 206
 81 3 "PUG" 206
 82 4 "BGK" 207
 83 4 "PUG" 206
 83 4 "BGK" 200
 86 4 "PUG" 206
 93 7 "PUG" 206
 96 2 "BGK" 237
 97 2 "PUG" 206
 97 2 "BGK" 230
 97 2 "BGK" 208
107 2 "BGK" 208
107 2 "PUG" 206
109 6 "PUG" 193
109 6 "BGK" 190
110 2 "BGK" 208
113 7 "PUG" 206
end
format %tq qdate

I will be grateful for your kind feedback

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35696

30 Apr 2023, 10:22

#8 is nothing to do with the thread focus on data in wide format (layout). You have long layout, and solutions are quite different.

The number of distinct investors is obtainable from

Code:

tab investor_id 

di r(r)

provided it's not too large. Otherwise consult this 2008 paper and its updates to see if it offers a solution. dm0042 is thus a search term to find related threads here.

Code:

. search dm0042, entry

Search of official help files, FAQs, Examples, and Stata Journals

SJ-20-4 dm0042_3  . . . . . . . . . . . . . . . . Software update for distinct
        (help distinct if installed)  . . . . . .  N. J. Cox and G. M. Longton
        Q4/20   SJ 20(4):1028--1030
        sort() option has been added

SJ-15-3 dm0042_2  . . . . . . . . . . . . . . . . Software update for distinct
        (help distinct if installed)  . . . . . .  N. J. Cox and G. M. Longton
        Q3/15   SJ 15(3):899
        improved table format and display of large numbers of
        observations

SJ-12-2 dm0042_1  . . . . . . . . . . . . . . . . Software update for distinct
        (help distinct if installed)  . . . . . .  N. J. Cox and G. M. Longton
        Q2/12   SJ 12(2):352
        options added to restrict output to variables with a minimum
        or maximum of distinct values

SJ-8-4  dm0042  . . . . . . . . . . . .  Speaking Stata: Distinct observations
        (help distinct if installed)  . . . . . .  N. J. Cox and G. M. Longton
        Q4/08   SJ 8(4):557--568
        shows how to answer questions about distinct observations
        from first principles; provides a convenience command

idpriv	activ1	activ2	activ3	Unique Music Grps	Total Music Grps
1	3001	5031	3001	1	2
2	1011	3008	3002	2	1
3	2031	3005	.	1	1

Announcement