Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Create lists of each unique subset of a list

    I have several lists containing variables that vary in length. For each of these lists, I want to create a new list for every possible unique (nonempty) subset.

    For example, for a list "a b c" I want to creat the lists "a", "b", "c", "ab", "ac", "bc", "abc". This is easy for a small list, but I want this to be as general as possible for lists at least up to length 6. I know that I could write a complicated nested loop in which I would "atomize" the original list and rebuild new lists for each i<="number of elements in original list" (and then this would not be a Stata question, but a general programming exercise), but maybe there is a better and direct way for me to proceed here with a function in Stata I am not aware of?

  • #2
    I don't know what you mean by length here as in Stata all variables have the same number of observations, although they can easily differ in the number of non-missing values.

    I will take a guess here and interpret this as meaning that you have up k = 6 indicator variables and you're interested in looking systematically at the up to 2^k possible distinct (not unique) subsets of their combination, so that given variables a b c d e f then A is shorthand for a = 1 and all others 0 and BC is shorthand for b = 1. c = 1 and all others 0, and so on.

    If so, then
    groups has been available for this purpose since 2003. although there is no point in seeking the original paper and an otherwise unpredictable better search term is


    Code:
    . search st0496, entry
    
    Search of official help files, FAQs, Examples, SJs, and STBs
    
    SJ-18-1 st0496_1  . . . . . . . . . . . . . . . . . Software update for groups
            (help groups if installed)  . . . . . . . . . . . . . . . .  N. J. Cox
            Q1/18   SJ 18(1):291
            groups exited with an error message if weights were specified;
            this has been corrected
    
    SJ-17-3 st0496  . . . . .  Speaking Stata: Tables as lists: The groups command
            (help groups if installed)  . . . . . . . . . . . . . . . .  N. J. Cox
            Q3/17   SJ 17(3):760--773
            presents command for listing group frequencies and percents and
            cumulations thereof; for various subsetting and ordering by
            frequencies, percents, and so on; for reordering of columns;
            and for saving tabulated data to new datasets
    See also https://www.statalist.org/forums/for...updated-on-ssc

    Here is a demonstration:

    Code:
    clear
    set obs 1000 
    set seed 2803 
    
    tokenize a b c d 
    
    local prob 0.4 0.2 0.1 0.05
    
    forval j = 1/4 { 
        local p : word `j' of `prob' 
        gen ``j'' = runiform() < `p' 
    } 
    
    groups a b c d, order(high) sep(0) 
    
    groups a b c d, order(high) fillin sep(0)
    Code:
    . groups a b c d, order(high) sep(0) 
    
      +---------------------------------+
      | a   b   c   d   Freq.   Percent |
      |---------------------------------|
      | 0   0   0   0     407     40.70 |
      | 1   0   0   0     274     27.40 |
      | 0   1   0   0     114     11.40 |
      | 1   1   0   0      62      6.20 |
      | 0   0   1   0      47      4.70 |
      | 1   0   1   0      39      3.90 |
      | 0   0   0   1      23      2.30 |
      | 0   1   1   0      12      1.20 |
      | 1   1   1   0       7      0.70 |
      | 1   0   0   1       6      0.60 |
      | 0   1   0   1       4      0.40 |
      | 1   0   1   1       2      0.20 |
      | 1   1   0   1       2      0.20 |
      | 0   0   1   1       1      0.10 |
      +---------------------------------+
    
    . 
    . groups a b c d, order(high) fillin sep(0) 
    
      +---------------------------------+
      | a   b   c   d   Freq.   Percent |
      |---------------------------------|
      | 0   0   0   0     407     40.70 |
      | 1   0   0   0     274     27.40 |
      | 0   1   0   0     114     11.40 |
      | 1   1   0   0      62      6.20 |
      | 0   0   1   0      47      4.70 |
      | 1   0   1   0      39      3.90 |
      | 0   0   0   1      23      2.30 |
      | 0   1   1   0      12      1.20 |
      | 1   1   1   0       7      0.70 |
      | 1   0   0   1       6      0.60 |
      | 0   1   0   1       4      0.40 |
      | 1   0   1   1       2      0.20 |
      | 1   1   0   1       2      0.20 |
      | 0   0   1   1       1      0.10 |
      | 0   1   1   1       0      0.00 |
      | 1   1   1   1       0      0.00 |
      +---------------------------------+
    Tim Morris and I have a project on visualization of such data and are close to posting code on SSC.

    Comment


    • #3
      Thank you very much for your reply! I believe I might not have stated my problem clearly, so I will try it again with some more context:

      When I wrote of length, I meant the length of the list / macro and not the specific variable referenced, e.g. the macros

      Code:
      local varlist a b c d
      local varlist a b c de
      both have the length "4".

      Now, I want to generate one new macro / variable list for each possible (distinct) subset. As an outcome, I want to have - in the case of the list above - 15 new macros (16 - empty set). If done by hand, this would look something like this:

      Code:
      local varlist1_a a
      local varlist1_b b
      ...
      local varlist3_abc a b c
      local varlist3_abd a b d
      ...
      local varlist4_abcd a b c d

      However, the more I think about this, the more I become convinced that I have to write that nested loop.

      Comment


      • #4
        No need to write new code here. The package -tuples-, a community-contributed package of which Nick Cox is a contributing author, can produce variable lists as desired here. See -ssc describe tuples-.
        Code:
        sysuse auto
        tuples price-weight, display min(1) max(6)

        Comment


        • #5
          Oh, fine: it is the lists that vary in length, not the variables. That wasn't unclear on your part; I just mapped it mentally to a current project of mine. Your real problem too is a problem I've looked at, but the baton has long since passed to younger, faster runners.

          Code:
          ssc desc tuples

          Comment


          • #6
            I have just finished writing the code for n_max = 4, thank you for saving me maybe hours of work! On the bright side, I've learned a lot about loops in Stata...

            Comment

            Working...
            X