groups updated on SSC

Nick Cox

Join Date: Mar 2014
Posts: 35651

groups updated on SSC

23 Jul 2017, 05:21

Thanks as always to Kit Baum, an update of groups is available on SSC.

groups is a basic command for listing group frequencies and percents. An early version was
described briefly in Stata Journal 3: 420--439 (2003)

The main underlying idea is that many tables are easily and helpfully presented as lists.
Specifically, the list command is a convenient engine for producing tables in list form.
That command underlies groups. The idea also deserves a good push for other applications
you may have.

More than that: the summary in the opening sentence understates what groups can do. You
can also suppress any display of frequencies and percents. That being so, groups can be
used for almost any table. You may need to calculate beforehand what you want to tabulate,
but that is generally true. All two-way and higher tabulations are reduced to one-way
listings. The argument is that this constraint on layout is not so much of a loss as at
first may appear.

Some simple examples will give the flavour.

Code:

. sysuse auto, clear

First we look at one-way tables. The resemblance between the results of groups foreign and
of tabulate foreign will be clear:

Code:

. groups foreign

  +-------------------------------------+
  |  foreign   Freq.   Percent     % <= |
  |-------------------------------------|
  | Domestic      52     70.27    70.27 |
  |  Foreign      22     29.73   100.00 |
  +-------------------------------------+

. tabulate foreign

   Car type |      Freq.     Percent        Cum.
------------+-----------------------------------
   Domestic |         52       70.27       70.27
    Foreign |         22       29.73      100.00
------------+-----------------------------------
      Total |         74      100.00

When we turn to two-way tables, some key differences appear.

Code:

. groups foreign rep78

  +------------------------------------+
  |  foreign   rep78   Freq.   Percent |
  |------------------------------------|
  | Domestic       1       2      2.90 |
  | Domestic       2       8     11.59 |
  | Domestic       3      27     39.13 |
  | Domestic       4       9     13.04 |
  | Domestic       5       2      2.90 |
  |------------------------------------|
  |  Foreign       3       3      4.35 |
  |  Foreign       4       9     13.04 |
  |  Foreign       5       9     13.04 |
  +------------------------------------+

. tabulate foreign rep78

           |             Repair Record 1978
  Car type |         1          2          3          4 |     Total
-----------+--------------------------------------------+----------
  Domestic |         2          8         27          9 |        48
   Foreign |         0          0          3          9 |        21
-----------+--------------------------------------------+----------
     Total |         2          8         30         18 |        69


           |   Repair
           |   Record
           |    1978
  Car type |         5 |     Total
-----------+-----------+----------
  Domestic |         2 |        48
   Foreign |         9 |        21
-----------+-----------+----------
     Total |        11 |        69

The leading idea for two-way tables is simple. Instead of a layout of rows by columns
defining cells inside the table, groups gives you a listing in which each item is

row identifier, column identifier, cell variable

The example just given was chosen to make a point. Even with a simple table of 2 rows and
5 columns, tabulate is wrapping output awkwardly. Trivially in this case, you could just
swap variables and have 5 rows and 2 columns. But it is easy to get tables that are
awkward both ways. The major problem is too many columns for comfort. (Too many rows for
comfort is awkward with any command.)

The problem of space is usually compounded with three-way and higher tables. Even if
there is enough space, the sparsity (many zeroes) of some tables often makes other kinds
of tabulation attractive. Broadly, it is with such tables with three or more variables
that groups does especially well. A list structure is clearly general enough to extend to

one or more identifiers, one or more cell variables

By default, groups will not list cross-combinations of two or more identifiers that are
not present in the data. Note how in the last groups output, there are no lines for cars
that are foreign with repair record 1 or 2: as is explicit in the tabulate output, there
are no such cars. Conversely, a fillin option is available:

.

Code:

 groups foreign rep78, fillin

  +------------------------------------+
  |  foreign   rep78   Freq.   Percent |
  |------------------------------------|
  | Domestic       1       2      2.90 |
  | Domestic       2       8     11.59 |
  | Domestic       3      27     39.13 |
  | Domestic       4       9     13.04 |
  | Domestic       5       2      2.90 |
  |------------------------------------|
  |  Foreign       1       0      0.00 |
  |  Foreign       2       0      0.00 |
  |  Foreign       3       3      4.35 |
  |  Foreign       4       9     13.04 |
  |  Foreign       5       9     13.04 |
  +------------------------------------+

The default for groups is that percents are calculated with reference to all observations
analysed. Thus the 2.90% of observations that are Domestic and have repair record 1 is
2/69 or 2.90% of the observations that have non-missing values on foreign and rep78. So
the percents have as base or denominator the total frequency of all observations reported
in the last table.

In turn you can override the default, and there are two ways to do that.

First: groups can be issued under the aegis of by:, which implies output as a series of
separate tables, and also that percents are calculated separately for each table.

Code:

. bysort foreign: groups rep78

------------------------------------------------------------------------------
-> foreign = Domestic

  +----------------------------------+
  | rep78   Freq.   Percent      %<= |
  |----------------------------------|
  |     1       2      4.17     4.17 |
  |     2       8     16.67    20.83 |
  |     3      27     56.25    77.08 |
  |     4       9     18.75    95.83 |
  |     5       2      4.17   100.00 |
  +----------------------------------+

------------------------------------------------------------------------------
-> foreign = Foreign

  +----------------------------------+
  | rep78   Freq.   Percent      %<= |
  |----------------------------------|
  |     3       3     14.29    14.29 |
  |     4       9     42.86    57.14 |
  |     5       9     42.86   100.00 |
  +----------------------------------+

Using by: gives a clear separation of subtables. For some tastes or purposes, the
downside is taking up more space. That leads to the next way to get separate calculation
of percents.

Second: use the percent() option.

Code:

. groups foreign rep78, percent(foreign)

  +------------------------------------+
  |  foreign   rep78   Freq.   Percent |
  |------------------------------------|
  | Domestic       1       2      4.17 |
  | Domestic       2       8     16.67 |
  | Domestic       3      27     56.25 |
  | Domestic       4       9     18.75 |
  | Domestic       5       2      4.17 |
  |------------------------------------|
  |  Foreign       3       3     14.29 |
  |  Foreign       4       9     42.86 |
  |  Foreign       5       9     42.86 |
  +------------------------------------+

The percent() option lets you specify the variable or variables that determine groups for
calculation of percents.

The frequencies and percents shown by default are

1. frequencies and percents for one or more variables in varlist

2. cumulative percents for one variable in varlist.

The surmise is that cumulatives are rather more arbitrary with two or more variables,
being necessarily dependent on the order of variables. That is not the law, however, and
a show() option allows you to have none or one or two or three of those.

.

Code:

 groups foreign rep78, percent(foreign) show(f p P)

  +---------------------------------------------+
  |  foreign   rep78   Freq.   Percent      %<= |
  |---------------------------------------------|
  | Domestic       1       2      4.17     4.17 |
  | Domestic       2       8     16.67    20.83 |
  | Domestic       3      27     56.25    77.08 |
  | Domestic       4       9     18.75    95.83 |
  | Domestic       5       2      4.17   100.00 |
  |---------------------------------------------|
  |  Foreign       3       3     14.29    14.29 |
  |  Foreign       4       9     42.86    57.14 |
  |  Foreign       5       9     42.86   100.00 |
  +---------------------------------------------+

Indeed, cumulative frequencies are also available on request.

.

Code:

 groups mpg, show(f F)

  +-------------------+
  | mpg   Freq.   #<= |
  |-------------------|
  |  12       2     2 |
  |  14       6     8 |
  |  15       2    10 |
  |  16       4    14 |
  |  17       4    18 |
  |-------------------|
  |  18       9    27 |
  |  19       8    35 |
  |  20       3    38 |
  |  21       5    43 |
  |  22       5    48 |
  |-------------------|
  |  23       3    51 |
  |  24       4    55 |
  |  25       5    60 |
  |  26       3    63 |
  |  28       3    66 |
  |-------------------|
  |  29       1    67 |
  |  30       2    69 |
  |  31       1    70 |
  |  34       1    71 |
  |  35       2    73 |
  |-------------------|
  |  41       1    74 |
  +-------------------+

Here f stands for frequency and F stands for cumulative frequency. In addition, reverse
cumulatives (# or % > value rather than # or % <= value) are also available. This is not a
complete list, but I should mention show(none), which is one of the most useful choices:
more on that in a moment.

A further option select() lets you select which groups are to be listed, for example by a
condition on the frequencies. select(f == 1) selects those groups that occur precisely
once, in which case there is no need to see a frequency of column of 1s, and the percents
and cumulative percents are possibly of no use or interest:

Code:

. groups mpg, select(f == 1) show(none)

  +-----+
  | mpg |
  |-----|
  |  29 |
  |  31 |
  |  34 |
  |  41 |
  +-----+

The select() option can be used in another way. select(5)
says: list just the first five of the groups which would otherwise have
been listed. By default, with just one variable specified, that is just
the five lowest groups of values of the variable. Each group,
naturally, could occur more than once:

Code:

. groups mpg, select(5)

  +-------------------------------+
  | mpg   Freq.   Percent     %<= |
  |-------------------------------|
  |  12       2      2.70    2.70 |
  |  14       6      8.11   10.81 |
  |  15       2      2.70   13.51 |
  |  16       4      5.41   18.92 |
  |  17       4      5.41   24.32 |
  +-------------------------------+

You can guess that select(-5) starts at the other end:

Code:

. groups mpg, select(-5)

  +--------------------------------+
  | mpg   Freq.   Percent      %<= |
  |--------------------------------|
  |  30       2      2.70    93.24 |
  |  31       1      1.35    94.59 |
  |  34       1      1.35    95.95 |
  |  35       2      2.70    98.65 |
  |  41       1      1.35   100.00 |
  +--------------------------------+

So these commands give you pictures of the tails of a distribution. (For single variables,
extremes on SSC offers another way to do something similar.)

You can specify order(high) or order(low). In either case, that option specifies listing
in order of the frequencies, not the values of the variables in each group. In the first
case, select(5) gives you the 5 groups which are most frequent. That gives you a stab at
showing especially common values, otherwise often called modes.

Code:

. groups mpg, select(5) order(h)

  +-------------------------------+
  | mpg   Freq.   Percent     %<= |
  |-------------------------------|
  |  18       9     12.16   12.16 |
  |  19       8     10.81   22.97 |
  |  14       6      8.11   31.08 |
  |  21       5      6.76   37.84 |
  |  22       5      6.76   44.59 |
  +-------------------------------+

If you specify fillin with two or more variables, cross-combinations with zero
frequencies are shown explicitly. These are in cells that would be shown by 0s by
tabulate or by blanks by table. select()ing zeroes gives you a listing of the cells not
present in your dataset. As such values do not exist in the dataset, we do not need to be
told that their frequencies are all zero, so show(none) can be specified.

Code:

. groups foreign rep78, fillin select(f == 0) show(none)

  +-----------------+
  | foreign   rep78 |
  |-----------------|
  | Foreign       1 |
  | Foreign       2 |
  +-----------------+

groups is just sitting on the shoulders of the giant list, so there are several ways to
tweak appearances. Here is one:

Code:

. groups foreign rep78, sepby(foreign)

  +------------------------------------+
  |  foreign   rep78   Freq.   Percent |
  |------------------------------------|
  | Domestic       1       2      2.90 |
  | Domestic       2       8     11.59 |
  | Domestic       3      27     39.13 |
  | Domestic       4       9     13.04 |
  | Domestic       5       2      2.90 |
  |------------------------------------|
  |  Foreign       3       3      4.35 |
  |  Foreign       4       9     13.04 |
  |  Foreign       5       9     13.04 |
  +------------------------------------+

We did get the same appearance earlier, but that was just fortuitous. The default of
list separating every 5 lines happened to give a sensible answer.

groups has further unusual options for table commands. colorder() reorders columns in the
table from what it would have been. Typically this is an option for the second or later
pass at tabulation. Suppose we want to highlight which combinations of categories are
most common.

To impart a little variety, we will look at a different dataset.

Code:

. webuse nlswork
(National Longitudinal Survey.  Young Women 14-26 years of age in 1968)

. groups collgrad not_smsa c_city south, order(high) sep(0)

  +--------------------------------------------------------+
  | collgrad   not_smsa   c_city   south   Freq.   Percent |
  |--------------------------------------------------------|
  |        0          0        0       0    5742     20.13 |
  |        0          0        1       0    4941     17.32 |
  |        0          1        0       1    3982     13.96 |
  |        0          0        1       1    3455     12.11 |
  |        0          1        0       0    3086     10.82 |
  |        0          0        0       1    2527      8.86 |
  |        1          0        0       0    1412      4.95 |
  |        1          0        1       0    1096      3.84 |
  |        1          0        1       1     698      2.45 |
  |        1          0        0       1     598      2.10 |
  |        1          1        0       0     566      1.98 |
  |        1          1        0       1     423      1.48 |
  +--------------------------------------------------------+

We might want the frequencies and percents to be more prominent. Say that existing columns
5 and 6 should be moved to the left. It is sufficient to say colorder(5 6). Specifying 5
6 is equivalent to specifying 5 6 1 2 3 4.

Code:

. groups collgrad not_smsa c_city south, order(high) sep(0) colorder(5 6)

  +--------------------------------------------------------+
  | Freq.   Percent   collgrad   not_smsa   c_city   south |
  |--------------------------------------------------------|
  |  5742     20.13          0          0        0       0 |
  |  4941     17.32          0          0        1       0 |
  |  3982     13.96          0          1        0       1 |
  |  3455     12.11          0          0        1       1 |
  |  3086     10.82          0          1        0       0 |
  |  2527      8.86          0          0        0       1 |
  |  1412      4.95          1          0        0       0 |
  |  1096      3.84          1          0        1       0 |
  |   698      2.45          1          0        1       1 |
  |   598      2.10          1          0        0       1 |
  |   566      1.98          1          1        0       0 |
  |   423      1.48          1          1        0       1 |
  +--------------------------------------------------------+

You may have noticed the list option sep(0) being used just now to switch off separator
lines.

Another option that is unusual is saving(), which saves the data being tabulated to a new
dataset. In essence (and minor details aside), they are data, at least temporarily, as
otherwise list would not be able to show them. That should help in tabulation in forms
not supported by groups, for graphics or other analyses, or in export to other software.

There's more, but this post is already long, so I'll just point to the help. A longer version of this
will appear in the Stata Journal.

Tags: None

Nick Cox

Join Date: Mar 2014

Posts: 35651
#2

17 Jan 2018, 17:21

The Stata Journal version appeared in

SJ-17-3 st0496 . . . . . Speaking Stata: Tables as lists: The groups command
(help groups if installed) . . . . . . . . . . . . . . . . N. J. Cox
Q3/17 SJ 17(3):760--773
presents command for listing group frequencies and percents and
cumulations thereof; for various subsetting and ordering by
frequencies, percents, and so on; for reordering of columns;
and for saving tabulated data to new datasets
3 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35651
#3

24 Feb 2018, 10:53

Bug fix announcement:

Thanks as always to Kit Baum, a fixed version is now up on SSC.

Code:

ssc type groups.ado *! NJC 1.4.1 20 February 2018 * NJC 1.4.0 30 June 2017

The revision will also be announced via Stata Journal 18(1), the next issue.

This arose only when weights were specified. Users would have seen an error message, so no one would have got incorrect results.

Thanks to Fahim Ahmad Yousufzai for a bug report.
Comment

Announcement

groups updated on SSC

Comment

Comment