Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • groups updated on SSC

    Thanks as always to Kit Baum, an update of groups is available on SSC.

    groups is a basic command for listing group frequencies and percents. An early version was
    described briefly in Stata Journal 3: 420--439 (2003)

    The main underlying idea is that many tables are easily and helpfully presented as lists.
    Specifically, the list command is a convenient engine for producing tables in list form.
    That command underlies groups. The idea also deserves a good push for other applications
    you may have.

    More than that: the summary in the opening sentence understates what groups can do. You
    can also suppress any display of frequencies and percents. That being so, groups can be
    used for almost any table. You may need to calculate beforehand what you want to tabulate,
    but that is generally true. All two-way and higher tabulations are reduced to one-way
    listings. The argument is that this constraint on layout is not so much of a loss as at
    first may appear.

    Some simple examples will give the flavour.

    Code:
    . sysuse auto, clear
    First we look at one-way tables. The resemblance between the results of groups foreign and
    of tabulate foreign will be clear:

    Code:
    . groups foreign
    
      +-------------------------------------+
      |  foreign   Freq.   Percent     % <= |
      |-------------------------------------|
      | Domestic      52     70.27    70.27 |
      |  Foreign      22     29.73   100.00 |
      +-------------------------------------+
    
    . tabulate foreign
    
       Car type |      Freq.     Percent        Cum.
    ------------+-----------------------------------
       Domestic |         52       70.27       70.27
        Foreign |         22       29.73      100.00
    ------------+-----------------------------------
          Total |         74      100.00
    When we turn to two-way tables, some key differences appear.

    Code:
    . groups foreign rep78
    
      +------------------------------------+
      |  foreign   rep78   Freq.   Percent |
      |------------------------------------|
      | Domestic       1       2      2.90 |
      | Domestic       2       8     11.59 |
      | Domestic       3      27     39.13 |
      | Domestic       4       9     13.04 |
      | Domestic       5       2      2.90 |
      |------------------------------------|
      |  Foreign       3       3      4.35 |
      |  Foreign       4       9     13.04 |
      |  Foreign       5       9     13.04 |
      +------------------------------------+
    
    . tabulate foreign rep78
    
               |             Repair Record 1978
      Car type |         1          2          3          4 |     Total
    -----------+--------------------------------------------+----------
      Domestic |         2          8         27          9 |        48
       Foreign |         0          0          3          9 |        21
    -----------+--------------------------------------------+----------
         Total |         2          8         30         18 |        69
    
    
               |   Repair
               |   Record
               |    1978
      Car type |         5 |     Total
    -----------+-----------+----------
      Domestic |         2 |        48
       Foreign |         9 |        21
    -----------+-----------+----------
         Total |        11 |        69
    The leading idea for two-way tables is simple. Instead of a layout of rows by columns
    defining cells inside the table, groups gives you a listing in which each item is

    row identifier, column identifier, cell variable

    The example just given was chosen to make a point. Even with a simple table of 2 rows and
    5 columns, tabulate is wrapping output awkwardly. Trivially in this case, you could just
    swap variables and have 5 rows and 2 columns. But it is easy to get tables that are
    awkward both ways. The major problem is too many columns for comfort. (Too many rows for
    comfort is awkward with any command.)

    The problem of space is usually compounded with three-way and higher tables. Even if
    there is enough space, the sparsity (many zeroes) of some tables often makes other kinds
    of tabulation attractive. Broadly, it is with such tables with three or more variables
    that groups does especially well. A list structure is clearly general enough to extend to

    one or more identifiers, one or more cell variables

    By default, groups will not list cross-combinations of two or more identifiers that are
    not present in the data. Note how in the last groups output, there are no lines for cars
    that are foreign with repair record 1 or 2: as is explicit in the tabulate output, there
    are no such cars. Conversely, a fillin option is available:

    .
    Code:
     groups foreign rep78, fillin
    
      +------------------------------------+
      |  foreign   rep78   Freq.   Percent |
      |------------------------------------|
      | Domestic       1       2      2.90 |
      | Domestic       2       8     11.59 |
      | Domestic       3      27     39.13 |
      | Domestic       4       9     13.04 |
      | Domestic       5       2      2.90 |
      |------------------------------------|
      |  Foreign       1       0      0.00 |
      |  Foreign       2       0      0.00 |
      |  Foreign       3       3      4.35 |
      |  Foreign       4       9     13.04 |
      |  Foreign       5       9     13.04 |
      +------------------------------------+
    The default for groups is that percents are calculated with reference to all observations
    analysed. Thus the 2.90% of observations that are Domestic and have repair record 1 is
    2/69 or 2.90% of the observations that have non-missing values on foreign and rep78. So
    the percents have as base or denominator the total frequency of all observations reported
    in the last table.

    In turn you can override the default, and there are two ways to do that.

    First: groups can be issued under the aegis of by:, which implies output as a series of
    separate tables, and also that percents are calculated separately for each table.

    Code:
    . bysort foreign: groups rep78
    
    ------------------------------------------------------------------------------
    -> foreign = Domestic
    
      +----------------------------------+
      | rep78   Freq.   Percent      %<= |
      |----------------------------------|
      |     1       2      4.17     4.17 |
      |     2       8     16.67    20.83 |
      |     3      27     56.25    77.08 |
      |     4       9     18.75    95.83 |
      |     5       2      4.17   100.00 |
      +----------------------------------+
    
    ------------------------------------------------------------------------------
    -> foreign = Foreign
    
      +----------------------------------+
      | rep78   Freq.   Percent      %<= |
      |----------------------------------|
      |     3       3     14.29    14.29 |
      |     4       9     42.86    57.14 |
      |     5       9     42.86   100.00 |
      +----------------------------------+
    Using by: gives a clear separation of subtables. For some tastes or purposes, the
    downside is taking up more space. That leads to the next way to get separate calculation
    of percents.

    Second: use the percent() option.

    Code:
    . groups foreign rep78, percent(foreign)
    
      +------------------------------------+
      |  foreign   rep78   Freq.   Percent |
      |------------------------------------|
      | Domestic       1       2      4.17 |
      | Domestic       2       8     16.67 |
      | Domestic       3      27     56.25 |
      | Domestic       4       9     18.75 |
      | Domestic       5       2      4.17 |
      |------------------------------------|
      |  Foreign       3       3     14.29 |
      |  Foreign       4       9     42.86 |
      |  Foreign       5       9     42.86 |
      +------------------------------------+
    The percent() option lets you specify the variable or variables that determine groups for
    calculation of percents.

    The frequencies and percents shown by default are

    1. frequencies and percents for one or more variables in varlist

    2. cumulative percents for one variable in varlist.

    The surmise is that cumulatives are rather more arbitrary with two or more variables,
    being necessarily dependent on the order of variables. That is not the law, however, and
    a show() option allows you to have none or one or two or three of those.

    .
    Code:
     groups foreign rep78, percent(foreign) show(f p P)
    
      +---------------------------------------------+
      |  foreign   rep78   Freq.   Percent      %<= |
      |---------------------------------------------|
      | Domestic       1       2      4.17     4.17 |
      | Domestic       2       8     16.67    20.83 |
      | Domestic       3      27     56.25    77.08 |
      | Domestic       4       9     18.75    95.83 |
      | Domestic       5       2      4.17   100.00 |
      |---------------------------------------------|
      |  Foreign       3       3     14.29    14.29 |
      |  Foreign       4       9     42.86    57.14 |
      |  Foreign       5       9     42.86   100.00 |
      +---------------------------------------------+
    Indeed, cumulative frequencies are also available on request.

    .
    Code:
     groups mpg, show(f F)
    
      +-------------------+
      | mpg   Freq.   #<= |
      |-------------------|
      |  12       2     2 |
      |  14       6     8 |
      |  15       2    10 |
      |  16       4    14 |
      |  17       4    18 |
      |-------------------|
      |  18       9    27 |
      |  19       8    35 |
      |  20       3    38 |
      |  21       5    43 |
      |  22       5    48 |
      |-------------------|
      |  23       3    51 |
      |  24       4    55 |
      |  25       5    60 |
      |  26       3    63 |
      |  28       3    66 |
      |-------------------|
      |  29       1    67 |
      |  30       2    69 |
      |  31       1    70 |
      |  34       1    71 |
      |  35       2    73 |
      |-------------------|
      |  41       1    74 |
      +-------------------+
    Here f stands for frequency and F stands for cumulative frequency. In addition, reverse
    cumulatives (# or % > value rather than # or % <= value) are also available. This is not a
    complete list, but I should mention show(none), which is one of the most useful choices:
    more on that in a moment.

    A further option select() lets you select which groups are to be listed, for example by a
    condition on the frequencies. select(f == 1) selects those groups that occur precisely
    once, in which case there is no need to see a frequency of column of 1s, and the percents
    and cumulative percents are possibly of no use or interest:

    Code:
    . groups mpg, select(f == 1) show(none)
    
      +-----+
      | mpg |
      |-----|
      |  29 |
      |  31 |
      |  34 |
      |  41 |
      +-----+
    The select() option can be used in another way. select(5)
    says: list just the first five of the groups which would otherwise have
    been listed. By default, with just one variable specified, that is just
    the five lowest groups of values of the variable. Each group,
    naturally, could occur more than once:

    Code:
    . groups mpg, select(5)
    
      +-------------------------------+
      | mpg   Freq.   Percent     %<= |
      |-------------------------------|
      |  12       2      2.70    2.70 |
      |  14       6      8.11   10.81 |
      |  15       2      2.70   13.51 |
      |  16       4      5.41   18.92 |
      |  17       4      5.41   24.32 |
      +-------------------------------+
    You can guess that select(-5) starts at the other end:
    Code:
    . groups mpg, select(-5)
    
      +--------------------------------+
      | mpg   Freq.   Percent      %<= |
      |--------------------------------|
      |  30       2      2.70    93.24 |
      |  31       1      1.35    94.59 |
      |  34       1      1.35    95.95 |
      |  35       2      2.70    98.65 |
      |  41       1      1.35   100.00 |
      +--------------------------------+
    So these commands give you pictures of the tails of a distribution. (For single variables,
    extremes on SSC offers another way to do something similar.)

    You can specify order(high) or order(low). In either case, that option specifies listing
    in order of the frequencies, not the values of the variables in each group. In the first
    case, select(5) gives you the 5 groups which are most frequent. That gives you a stab at
    showing especially common values, otherwise often called modes.

    Code:
    . groups mpg, select(5) order(h)
    
      +-------------------------------+
      | mpg   Freq.   Percent     %<= |
      |-------------------------------|
      |  18       9     12.16   12.16 |
      |  19       8     10.81   22.97 |
      |  14       6      8.11   31.08 |
      |  21       5      6.76   37.84 |
      |  22       5      6.76   44.59 |
      +-------------------------------+
    If you specify fillin with two or more variables, cross-combinations with zero
    frequencies are shown explicitly. These are in cells that would be shown by 0s by
    tabulate or by blanks by table. select()ing zeroes gives you a listing of the cells not
    present in your dataset. As such values do not exist in the dataset, we do not need to be
    told that their frequencies are all zero, so show(none) can be specified.

    Code:
    . groups foreign rep78, fillin select(f == 0) show(none)
    
      +-----------------+
      | foreign   rep78 |
      |-----------------|
      | Foreign       1 |
      | Foreign       2 |
      +-----------------+
    groups is just sitting on the shoulders of the giant list, so there are several ways to
    tweak appearances. Here is one:

    Code:
    . groups foreign rep78, sepby(foreign)
    
      +------------------------------------+
      |  foreign   rep78   Freq.   Percent |
      |------------------------------------|
      | Domestic       1       2      2.90 |
      | Domestic       2       8     11.59 |
      | Domestic       3      27     39.13 |
      | Domestic       4       9     13.04 |
      | Domestic       5       2      2.90 |
      |------------------------------------|
      |  Foreign       3       3      4.35 |
      |  Foreign       4       9     13.04 |
      |  Foreign       5       9     13.04 |
      +------------------------------------+
    We did get the same appearance earlier, but that was just fortuitous. The default of
    list separating every 5 lines happened to give a sensible answer.

    groups has further unusual options for table commands. colorder() reorders columns in the
    table from what it would have been. Typically this is an option for the second or later
    pass at tabulation. Suppose we want to highlight which combinations of categories are
    most common.

    To impart a little variety, we will look at a different dataset.

    Code:
    . webuse nlswork
    (National Longitudinal Survey.  Young Women 14-26 years of age in 1968)
    
    . groups collgrad not_smsa c_city south, order(high) sep(0)
    
      +--------------------------------------------------------+
      | collgrad   not_smsa   c_city   south   Freq.   Percent |
      |--------------------------------------------------------|
      |        0          0        0       0    5742     20.13 |
      |        0          0        1       0    4941     17.32 |
      |        0          1        0       1    3982     13.96 |
      |        0          0        1       1    3455     12.11 |
      |        0          1        0       0    3086     10.82 |
      |        0          0        0       1    2527      8.86 |
      |        1          0        0       0    1412      4.95 |
      |        1          0        1       0    1096      3.84 |
      |        1          0        1       1     698      2.45 |
      |        1          0        0       1     598      2.10 |
      |        1          1        0       0     566      1.98 |
      |        1          1        0       1     423      1.48 |
      +--------------------------------------------------------+
    We might want the frequencies and percents to be more prominent. Say that existing columns
    5 and 6 should be moved to the left. It is sufficient to say colorder(5 6). Specifying 5
    6 is equivalent to specifying 5 6 1 2 3 4.

    Code:
    . groups collgrad not_smsa c_city south, order(high) sep(0) colorder(5 6)
    
      +--------------------------------------------------------+
      | Freq.   Percent   collgrad   not_smsa   c_city   south |
      |--------------------------------------------------------|
      |  5742     20.13          0          0        0       0 |
      |  4941     17.32          0          0        1       0 |
      |  3982     13.96          0          1        0       1 |
      |  3455     12.11          0          0        1       1 |
      |  3086     10.82          0          1        0       0 |
      |  2527      8.86          0          0        0       1 |
      |  1412      4.95          1          0        0       0 |
      |  1096      3.84          1          0        1       0 |
      |   698      2.45          1          0        1       1 |
      |   598      2.10          1          0        0       1 |
      |   566      1.98          1          1        0       0 |
      |   423      1.48          1          1        0       1 |
      +--------------------------------------------------------+
    You may have noticed the list option sep(0) being used just now to switch off separator
    lines.

    Another option that is unusual is saving(), which saves the data being tabulated to a new
    dataset. In essence (and minor details aside), they are data, at least temporarily, as
    otherwise list would not be able to show them. That should help in tabulation in forms
    not supported by groups, for graphics or other analyses, or in export to other software.

    There's more, but this post is already long, so I'll just point to the help. A longer version of this
    will appear in the Stata Journal.



  • #2
    The Stata Journal version appeared in

    SJ-17-3 st0496 . . . . . Speaking Stata: Tables as lists: The groups command
    (help groups if installed) . . . . . . . . . . . . . . . . N. J. Cox
    Q3/17 SJ 17(3):760--773
    presents command for listing group frequencies and percents and
    cumulations thereof; for various subsetting and ordering by
    frequencies, percents, and so on; for reordering of columns;
    and for saving tabulated data to new datasets

    Comment


    • #3
      Bug fix announcement:

      Thanks as always to Kit Baum, a fixed version is now up on SSC.

      Code:
      ssc type groups.ado
      *! NJC 1.4.1 20 February 2018
      * NJC 1.4.0 30 June 2017
      The revision will also be announced via Stata Journal 18(1), the next issue.

      This arose only when weights were specified. Users would have seen an error message, so no one would have got incorrect results.

      Thanks to Fahim Ahmad Yousufzai for a bug report.

      Comment

      Working...
      X