Selecting variables with a target number of levels (categories)

Sonnen Blume

Join Date: Aug 2018

Posts: 342
#1

Selecting variables with a target number of levels (categories)

30 Dec 2022, 12:33

Hi,

This might be a rather unhappy question for the happy new year...

For better handling a large number of variables, I'd like to know the variables that are dichotomous, trichotomous and so on. For instance, I want to tabulate only the variables with 5 levels (very happy, happy, neutral, unhappy, very unhappy), so it'd be easier if they could identify all at once.

Thank you in advance!

Here is an example data:

clear
input float(v1 v2 v4 v5 v6 v7)
4 45 2 1 12 1
3 24 4 2 21 1
2 29 3 2 21 1
3 43 4 1 21 1
1 24 1 2 21 1
2 44 1 2 21 1
4 22 4 2 97 1
5 18 8 2 21 1
4 21 1 2 21 1
5 37 8 1 21 1
5 24 6 1 97 1
1 33 7 1 21 1
3 30 6 1 21 1
3 27 4 2 21 2
2 46 5 2 21 1
5 31 2 1 11 1
5 22 7 2 21 1
3 23 2 1 97 1
1 18 1 2 21 1
2 38 7 2 21 1
2 30 5 2 21 1
4 39 4 1 21 1
3 47 2 2 21 1
2 21 2 2 21 1
4 38 4 2 21 1
3 24 7 1 21 1
2 25 3 1 21 1
2 18 8 2 21 1
2 48 6 2 21 1
1 32 2 1 21 1
3 32 4 2 21 1
5 19 2 1 21 1
5 21 4 2 21 2
3 24 7 2 21 1
3 31 1 2 21 1
3 26 7 2 97 1
3 34 4 2 51 2
2 24 4 2 21 1
4 17 3 2 21 1
4 24 8 1 97 1
1 43 1 2 21 1
5 24 7 1 21 1
3 34 1 2 21 1
3 19 6 2 21 1
5 28 2 1 71 2
1 34 2 2 21 1
4 32 2 2 21 1
2 15 8 2 97 1
3 46 7 1 21 1
4 47 3 2 21 2
3 49 6 2 21 1
3 45 5 2 21 1
5 38 1 1 21 1
3 16 6 2 21 1
3 24 2 2 21 1
3 17 7 1 21 1
2 38 4 2 21 1
5 49 8 1 14 1
4 36 2 1 21 1
5 33 7 1 21 1
5 34 4 2 21 1
4 30 6 1 21 2
4 30 6 1 21 1
3 18 7 2 21 1
5 24 6 1 21 1
3 23 3 2 21 1
1 29 5 2 21 1
3 22 6 1 21 2
2 31 5 2 21 1
3 40 4 2 21 1
5 36 3 1 11 1
2 32 7 2 21 1
3 35 2 1 21 1
2 22 6 1 21 1
5 22 3 1 21 1
2 26 1 2 21 1
3 31 5 2 21 1
1 24 7 2 21 2
3 43 3 2 21 1
3 28 2 2 21 1
3 30 3 2 21 1
4 48 1 2 21 1
3 42 7 2 21 1
3 40 2 2 21 1
4 35 2 2 21 1
2 40 6 2 21 1
2 26 2 1 21 1
3 34 6 1 21 1
3 33 6 1 21 1
3 34 7 1 21 1
3 32 6 1 21 1
1 19 5 2 97 1
2 34 8 2 43 1
4 43 7 1 21 1
1 25 5 2 21 1
2 25 4 2 21 1
3 38 5 2 21 1
3 21 2 2 21 1
2 37 5 2 21 1
4 21 6 1 21 1
end
[/CODE]
Tags: None

Clyde Schechter

Join Date: Apr 2014
Posts: 30078

30 Dec 2022, 14:34

Code:

local desired_levels 2 // OR HOWEVER MANY LEVELS YOU WANT

gen long obs_no = _n  // TO NOTE ORIGINAL DATA ORDER

local vbles_with_desired_levels
foreach v of varlist _all {
    sort `v'
    gen long n_levels = sum(`v' != `v'[_n-1])
    if n_levels[_N] == `desired_levels' {
        local vbles_with_desired_levels `vbles_with_desired_levels' `v'
    }
    drop n_levels
}

sort obs_no // TO RESTORE ORIGINAL DATA ORDER

display "`vbles_with_desired_levels'"

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35664
#3

30 Dec 2022, 14:47

See also distinct from the Stata Journal which can give you a single sorted table.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35664
#4

31 Dec 2022, 05:08

To the point of #3 here is a run on the data example in #1.

Code:

. distinct, min(5) max(5) --------------------------- | total distinct ----+---------------------- v1 | 100 5 ---------------------------

distinct is profoundly empirical. It can't identify variables where 5 distinct values are possible but do not all occur.

But if you have 5 distinct value labels defined, see findname from the Stata Journal as updated in 2020.

dm0048_4: Finding variables. N. J. Cox. Stata Journal 15: 605; 12: 167; 10: 691; 10: 281–296.

New options include columns() to find variables according to column position in the dataset, so, for example, columns(1 -1) finds the first and last variables in the dataset; and four new options to find variables with specified text in value labels, or specified numbers of value labels, either defined as value labels or used within the data. Options to do with value labels used within the data can be combined with if or in, or both.
1 like
Comment

Sonnen Blume

Join Date: Aug 2018
Posts: 342

31 Dec 2022, 08:17

Originally posted by Clyde Schechter View Post

Code:

local desired_levels 2 // OR HOWEVER MANY LEVELS YOU WANT

gen long obs_no = _n // TO NOTE ORIGINAL DATA ORDER

local vbles_with_desired_levels
foreach v of varlist _all {
sort `v'
gen long n_levels = sum(`v' != `v'[_n-1])
if n_levels[_N] == `desired_levels' {
local vbles_with_desired_levels `vbles_with_desired_levels' `v'
}
drop n_levels
}

sort obs_no // TO RESTORE ORIGINAL DATA ORDER

display "`vbles_with_desired_levels'"

Thanks so much Clyde! Mighty solutions to might problems, as always! Happy new year!

Comment

Sonnen Blume

Join Date: Aug 2018

Posts: 342
#6

31 Dec 2022, 08:25

Originally posted by Nick Cox View Post

To the point of #3 here is a run on the data example in #1.

Code:

. distinct, min(5) max(5) --------------------------- | total distinct ----+---------------------- v1 | 100 5 ---------------------------

distinct is profoundly empirical. It can't identify variables where 5 distinct values are possible but do not all occur.

But if you have 5 distinct value labels defined, see findname from the Stata Journal as updated in 2020.

Thanks so much, professor, for this wonderful package! It gives a perfect view of what I was looking for.
Comment

Announcement

Selecting variables with a target number of levels (categories)

Comment

Comment

Comment

Comment

Comment