Create lists of each unique subset of a list

Alexander Busch

Join Date: Oct 2022

Posts: 17
#1

Create lists of each unique subset of a list

26 Oct 2022, 07:10

I have several lists containing variables that vary in length. For each of these lists, I want to create a new list for every possible unique (nonempty) subset.

For example, for a list "a b c" I want to creat the lists "a", "b", "c", "ab", "ac", "bc", "abc". This is easy for a small list, but I want this to be as general as possible for lists at least up to length 6. I know that I could write a complicated nested loop in which I would "atomize" the original list and rebuild new lists for each i<="number of elements in original list" (and then this would not be a Stata question, but a general programming exercise), but maybe there is a better and direct way for me to proceed here with a function in Stata I am not aware of?
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35724

26 Oct 2022, 07:35

I don't know what you mean by length here as in Stata all variables have the same number of observations, although they can easily differ in the number of non-missing values.

I will take a guess here and interpret this as meaning that you have up k = 6 indicator variables and you're interested in looking systematically at the up to 2^k possible distinct (not unique) subsets of their combination, so that given variables a b c d e f then A is shorthand for a = 1 and all others 0 and BC is shorthand for b = 1. c = 1 and all others 0, and so on.

If so, then groups has been available for this purpose since 2003. although there is no point in seeking the original paper and an otherwise unpredictable better search term is

Code:

. search st0496, entry

Search of official help files, FAQs, Examples, SJs, and STBs

SJ-18-1 st0496_1  . . . . . . . . . . . . . . . . . Software update for groups
        (help groups if installed)  . . . . . . . . . . . . . . . .  N. J. Cox
        Q1/18   SJ 18(1):291
        groups exited with an error message if weights were specified;
        this has been corrected

SJ-17-3 st0496  . . . . .  Speaking Stata: Tables as lists: The groups command
        (help groups if installed)  . . . . . . . . . . . . . . . .  N. J. Cox
        Q3/17   SJ 17(3):760--773
        presents command for listing group frequencies and percents and
        cumulations thereof; for various subsetting and ordering by
        frequencies, percents, and so on; for reordering of columns;
        and for saving tabulated data to new datasets

See also https://www.statalist.org/forums/for...updated-on-ssc

Here is a demonstration:

Code:

clear
set obs 1000 
set seed 2803 

tokenize a b c d 

local prob 0.4 0.2 0.1 0.05

forval j = 1/4 { 
    local p : word `j' of `prob' 
    gen ``j'' = runiform() < `p' 
} 

groups a b c d, order(high) sep(0) 

groups a b c d, order(high) fillin sep(0)

Code:

. groups a b c d, order(high) sep(0) 

  +---------------------------------+
  | a   b   c   d   Freq.   Percent |
  |---------------------------------|
  | 0   0   0   0     407     40.70 |
  | 1   0   0   0     274     27.40 |
  | 0   1   0   0     114     11.40 |
  | 1   1   0   0      62      6.20 |
  | 0   0   1   0      47      4.70 |
  | 1   0   1   0      39      3.90 |
  | 0   0   0   1      23      2.30 |
  | 0   1   1   0      12      1.20 |
  | 1   1   1   0       7      0.70 |
  | 1   0   0   1       6      0.60 |
  | 0   1   0   1       4      0.40 |
  | 1   0   1   1       2      0.20 |
  | 1   1   0   1       2      0.20 |
  | 0   0   1   1       1      0.10 |
  +---------------------------------+

. 
. groups a b c d, order(high) fillin sep(0) 

  +---------------------------------+
  | a   b   c   d   Freq.   Percent |
  |---------------------------------|
  | 0   0   0   0     407     40.70 |
  | 1   0   0   0     274     27.40 |
  | 0   1   0   0     114     11.40 |
  | 1   1   0   0      62      6.20 |
  | 0   0   1   0      47      4.70 |
  | 1   0   1   0      39      3.90 |
  | 0   0   0   1      23      2.30 |
  | 0   1   1   0      12      1.20 |
  | 1   1   1   0       7      0.70 |
  | 1   0   0   1       6      0.60 |
  | 0   1   0   1       4      0.40 |
  | 1   0   1   1       2      0.20 |
  | 1   1   0   1       2      0.20 |
  | 0   0   1   1       1      0.10 |
  | 0   1   1   1       0      0.00 |
  | 1   1   1   1       0      0.00 |
  +---------------------------------+

Tim Morris and I have a project on visualization of such data and are close to posting code on SSC.

Comment

Alexander Busch

Join Date: Oct 2022

Posts: 17
#3

26 Oct 2022, 08:17

Thank you very much for your reply! I believe I might not have stated my problem clearly, so I will try it again with some more context:

When I wrote of length, I meant the length of the list / macro and not the specific variable referenced, e.g. the macros

Code:

local varlist a b c d local varlist a b c de

both have the length "4".

Now, I want to generate one new macro / variable list for each possible (distinct) subset. As an outcome, I want to have - in the case of the list above - 15 new macros (16 - empty set). If done by hand, this would look something like this:

Code:

local varlist1_a a local varlist1_b b ... local varlist3_abc a b c local varlist3_abd a b d ... local varlist4_abcd a b c d

However, the more I think about this, the more I become convinced that I have to write that nested loop.
1 like
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2417
#4

26 Oct 2022, 08:27

No need to write new code here. The package -tuples-, a community-contributed package of which Nick Cox is a contributing author, can produce variable lists as desired here. See -ssc describe tuples-.

Code:

sysuse auto tuples price-weight, display min(1) max(6)
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#5

26 Oct 2022, 08:29

Oh, fine: it is the lists that vary in length, not the variables. That wasn't unclear on your part; I just mapped it mentally to a current project of mine. Your real problem too is a problem I've looked at, but the baton has long since passed to younger, faster runners.

Code:

ssc desc tuples
1 like
Comment
Alexander Busch

Join Date: Oct 2022

Posts: 17
#6

26 Oct 2022, 09:00

I have just finished writing the code for n_max = 4, thank you for saving me maybe hours of work! On the bright side, I've learned a lot about loops in Stata...
Comment

Announcement