Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Selecting variables with a target number of levels (categories)

    Hi,

    This might be a rather unhappy question for the happy new year...

    For better handling a large number of variables, I'd like to know the variables that are dichotomous, trichotomous and so on. For instance, I want to tabulate only the variables with 5 levels (very happy, happy, neutral, unhappy, very unhappy), so it'd be easier if they could identify all at once.

    Thank you in advance!


    Here is an example data:

    clear
    input float(v1 v2 v4 v5 v6 v7)
    4 45 2 1 12 1
    3 24 4 2 21 1
    2 29 3 2 21 1
    3 43 4 1 21 1
    1 24 1 2 21 1
    2 44 1 2 21 1
    4 22 4 2 97 1
    5 18 8 2 21 1
    4 21 1 2 21 1
    5 37 8 1 21 1
    5 24 6 1 97 1
    1 33 7 1 21 1
    3 30 6 1 21 1
    3 27 4 2 21 2
    2 46 5 2 21 1
    5 31 2 1 11 1
    5 22 7 2 21 1
    3 23 2 1 97 1
    1 18 1 2 21 1
    2 38 7 2 21 1
    2 30 5 2 21 1
    4 39 4 1 21 1
    3 47 2 2 21 1
    2 21 2 2 21 1
    4 38 4 2 21 1
    3 24 7 1 21 1
    2 25 3 1 21 1
    2 18 8 2 21 1
    2 48 6 2 21 1
    1 32 2 1 21 1
    3 32 4 2 21 1
    5 19 2 1 21 1
    5 21 4 2 21 2
    3 24 7 2 21 1
    3 31 1 2 21 1
    3 26 7 2 97 1
    3 34 4 2 51 2
    2 24 4 2 21 1
    4 17 3 2 21 1
    4 24 8 1 97 1
    1 43 1 2 21 1
    5 24 7 1 21 1
    3 34 1 2 21 1
    3 19 6 2 21 1
    5 28 2 1 71 2
    1 34 2 2 21 1
    4 32 2 2 21 1
    2 15 8 2 97 1
    3 46 7 1 21 1
    4 47 3 2 21 2
    3 49 6 2 21 1
    3 45 5 2 21 1
    5 38 1 1 21 1
    3 16 6 2 21 1
    3 24 2 2 21 1
    3 17 7 1 21 1
    2 38 4 2 21 1
    5 49 8 1 14 1
    4 36 2 1 21 1
    5 33 7 1 21 1
    5 34 4 2 21 1
    4 30 6 1 21 2
    4 30 6 1 21 1
    3 18 7 2 21 1
    5 24 6 1 21 1
    3 23 3 2 21 1
    1 29 5 2 21 1
    3 22 6 1 21 2
    2 31 5 2 21 1
    3 40 4 2 21 1
    5 36 3 1 11 1
    2 32 7 2 21 1
    3 35 2 1 21 1
    2 22 6 1 21 1
    5 22 3 1 21 1
    2 26 1 2 21 1
    3 31 5 2 21 1
    1 24 7 2 21 2
    3 43 3 2 21 1
    3 28 2 2 21 1
    3 30 3 2 21 1
    4 48 1 2 21 1
    3 42 7 2 21 1
    3 40 2 2 21 1
    4 35 2 2 21 1
    2 40 6 2 21 1
    2 26 2 1 21 1
    3 34 6 1 21 1
    3 33 6 1 21 1
    3 34 7 1 21 1
    3 32 6 1 21 1
    1 19 5 2 97 1
    2 34 8 2 43 1
    4 43 7 1 21 1
    1 25 5 2 21 1
    2 25 4 2 21 1
    3 38 5 2 21 1
    3 21 2 2 21 1
    2 37 5 2 21 1
    4 21 6 1 21 1
    end
    [/CODE]

  • #2
    Code:
    local desired_levels 2 // OR HOWEVER MANY LEVELS YOU WANT
    
    gen long obs_no = _n  // TO NOTE ORIGINAL DATA ORDER
    
    local vbles_with_desired_levels
    foreach v of varlist _all {
        sort `v'
        gen long n_levels = sum(`v' != `v'[_n-1])
        if n_levels[_N] == `desired_levels' {
            local vbles_with_desired_levels `vbles_with_desired_levels' `v'
        }
        drop n_levels
    }
    
    sort obs_no // TO RESTORE ORIGINAL DATA ORDER
    
    display "`vbles_with_desired_levels'"

    Comment


    • #3
      See also distinct from the Stata Journal which can give you a single sorted table.

      Comment


      • #4
        To the point of #3 here is a run on the data example in #1.


        Code:
        . distinct, min(5) max(5)
        
        ---------------------------
            |     total   distinct
        ----+----------------------
         v1 |       100          5
        ---------------------------
        distinct is profoundly empirical. It can't identify variables where 5 distinct values are possible but do not all occur.


        But if you have 5 distinct value labels defined, see findname from the Stata Journal as updated in 2020.

        dm0048_4: Finding variables. N. J. Cox. Stata Journal 15: 605; 12: 167; 10: 691; 10: 281–296.

        New options include columns() to find variables according to column position in the dataset, so, for example, columns(1 -1) finds the first and last variables in the dataset; and four new options to find variables with specified text in value labels, or specified numbers of value labels, either defined as value labels or used within the data. Options to do with value labels used within the data can be combined with if or in, or both.

        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          Code:
          local desired_levels 2 // OR HOWEVER MANY LEVELS YOU WANT
          
          gen long obs_no = _n // TO NOTE ORIGINAL DATA ORDER
          
          local vbles_with_desired_levels
          foreach v of varlist _all {
          sort `v'
          gen long n_levels = sum(`v' != `v'[_n-1])
          if n_levels[_N] == `desired_levels' {
          local vbles_with_desired_levels `vbles_with_desired_levels' `v'
          }
          drop n_levels
          }
          
          sort obs_no // TO RESTORE ORIGINAL DATA ORDER
          
          display "`vbles_with_desired_levels'"
          Thanks so much Clyde! Mighty solutions to might problems, as always! Happy new year!

          Comment


          • #6
            Originally posted by Nick Cox View Post
            To the point of #3 here is a run on the data example in #1.


            Code:
            . distinct, min(5) max(5)
            
            ---------------------------
            | total distinct
            ----+----------------------
            v1 | 100 5
            ---------------------------
            distinct is profoundly empirical. It can't identify variables where 5 distinct values are possible but do not all occur.


            But if you have 5 distinct value labels defined, see findname from the Stata Journal as updated in 2020.
            Thanks so much, professor, for this wonderful package! It gives a perfect view of what I was looking for.

            Comment

            Working...
            X