what is the base for -collapse- percent statistic with by()?

Daniel Feenberg

Join Date: Oct 2014

Posts: 334
#1

what is the base for -collapse- percent statistic with by()?

23 Aug 2022, 16:07

The -collapse- command (https://www.stata.com/manuals/dcollapse.pdf) has a -percent- statistic that (according to the documentation) shows the

percentage of nonmissing observations

Here is an example with a by() option:

Code:

. input a b a b 1. 1 1 2. 1 1 3. 1 2 4. . 2 5. end . collapse (percent) a,by(b) . list +--------------+ | b a | |--------------| 1. | 1 66.66667 | 2. | 2 33.33333 | +--------------+

I am surprised by this result. I expected the collapsed values of a to be 100 and 50. Are not all the values of a non-missing for b==1 and half the values non-missing for b==2? Where does the 3 in the denominator of the -collapse- calculation come from? Is there a way to get the result I was expecting?
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 36054

23 Aug 2022, 16:34

I guess non-missing observations refers, as said, to observations with non-missing values on all variables mentioned. That is the only way, I think. to make sense of the result. An observation for Stata. you will recall, is in other terminology an entire case, record or row in the dataset. An observation is not here. or anywhere else in Stata, an individual value of a variable.

Compare these results using groups from the Stata Journal, although there are plenty of other ways to do it.

Code:

. sysuse auto, clear
(1978 automobile data)

. groups foreign rep78, missing

  +------------------------------------+
  |  foreign   rep78   Freq.   Percent |
  |------------------------------------|
  | Domestic       1       2      2.70 |
  | Domestic       2       8     10.81 |
  | Domestic       3      27     36.49 |
  | Domestic       4       9     12.16 |
  | Domestic       5       2      2.70 |
  |------------------------------------|
  | Domestic       .       4      5.41 |
  |  Foreign       3       3      4.05 |
  |  Foreign       4       9     12.16 |
  |  Foreign       5       9     12.16 |
  |  Foreign       .       1      1.35 |
  +------------------------------------+

. groups foreign rep78

  +------------------------------------+
  |  foreign   rep78   Freq.   Percent |
  |------------------------------------|
  | Domestic       1       2      2.90 |
  | Domestic       2       8     11.59 |
  | Domestic       3      27     39.13 |
  | Domestic       4       9     13.04 |
  | Domestic       5       2      2.90 |
  |------------------------------------|
  |  Foreign       3       3      4.35 |
  |  Foreign       4       9     13.04 |
  |  Foreign       5       9     13.04 |
  +------------------------------------+

. collapse (percent) rep78, by(foreign)

. l

     +--------------------+
     |  foreign     rep78 |
     |--------------------|
  1. | Domestic   69.5652 |
  2. |  Foreign   30.4348 |
     +--------------------+

I can't recall ever wanting this summary, even though I resort to collapse quite often.

Otherwise put, the last percent is based on 21/69 (ignoring the 5 observations with missing values), not on 21/74.

Last edited by Nick Cox; 23 Aug 2022, 16:37.

Comment

Daniel Feenberg

Join Date: Oct 2014

Posts: 334
#3

23 Aug 2022, 17:38

Eventually I realized that

Code:

replace a=100*a<. collapse (mean) a by(b)

gives the expected result. As it is, -percent- is the only statistic in -collapse- that is normalized by the full dataset, all the others are normalized by the by-group alone.
Comment

Announcement