Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • what is the base for -collapse- percent statistic with by()?

    The -collapse- command (https://www.stata.com/manuals/dcollapse.pdf) has a -percent- statistic that (according to the documentation) shows the
    percentage of nonmissing observations
    Here is an example with a by() option:

    Code:
    . input a b
    
                 a          b
      1. 1 1
      2. 1 1
      3. 1 2
      4. . 2
      5. end
    
    . collapse (percent) a,by(b)
    
    . list
    
         +--------------+
         | b          a |
         |--------------|
      1. | 1   66.66667 |
      2. | 2   33.33333 |
         +--------------+
    I am surprised by this result. I expected the collapsed values of a to be 100 and 50. Are not all the values of a non-missing for b==1 and half the values non-missing for b==2? Where does the 3 in the denominator of the -collapse- calculation come from? Is there a way to get the result I was expecting?

  • #2
    I guess non-missing observations refers, as said, to observations with non-missing values on all variables mentioned. That is the only way, I think. to make sense of the result. An observation for Stata. you will recall, is in other terminology an entire case, record or row in the dataset. An observation is not here. or anywhere else in Stata, an individual value of a variable.

    Compare these results using groups from the Stata Journal, although there are plenty of other ways to do it.


    Code:
    . sysuse auto, clear
    (1978 automobile data)
    
    . groups foreign rep78, missing
    
      +------------------------------------+
      |  foreign   rep78   Freq.   Percent |
      |------------------------------------|
      | Domestic       1       2      2.70 |
      | Domestic       2       8     10.81 |
      | Domestic       3      27     36.49 |
      | Domestic       4       9     12.16 |
      | Domestic       5       2      2.70 |
      |------------------------------------|
      | Domestic       .       4      5.41 |
      |  Foreign       3       3      4.05 |
      |  Foreign       4       9     12.16 |
      |  Foreign       5       9     12.16 |
      |  Foreign       .       1      1.35 |
      +------------------------------------+
    
    . groups foreign rep78
    
      +------------------------------------+
      |  foreign   rep78   Freq.   Percent |
      |------------------------------------|
      | Domestic       1       2      2.90 |
      | Domestic       2       8     11.59 |
      | Domestic       3      27     39.13 |
      | Domestic       4       9     13.04 |
      | Domestic       5       2      2.90 |
      |------------------------------------|
      |  Foreign       3       3      4.35 |
      |  Foreign       4       9     13.04 |
      |  Foreign       5       9     13.04 |
      +------------------------------------+
    
    . collapse (percent) rep78, by(foreign)
    
    . l
    
         +--------------------+
         |  foreign     rep78 |
         |--------------------|
      1. | Domestic   69.5652 |
      2. |  Foreign   30.4348 |
         +--------------------+
    I can't recall ever wanting this summary, even though I resort to collapse quite often.

    Otherwise put, the last percent is based on 21/69 (ignoring the 5 observations with missing values), not on 21/74.
    Last edited by Nick Cox; 23 Aug 2022, 16:37.

    Comment


    • #3
      Eventually I realized that

      Code:
      replace a=100*a<.
      collapse (mean) a by(b)
      gives the expected result. As it is, -percent- is the only statistic in -collapse- that is normalized by the full dataset, all the others are normalized by the by-group alone.

      Comment

      Working...
      X