Bug in collapse (percent) command?

Aidan Wang

Join Date: Apr 2021

Posts: 2
#1

Bug in collapse (percent) command?

20 Apr 2021, 19:36

The collapse (percent) command is supposed to yield the percentage of nonmissing observations. However, it doesn't seem to:

Code:

sysuse auto count if missing(rep78) count gen constant = 1 collapse (percent) rep78, by(constant) list rep78

Is this a bug, or am I misunderstanding the documentation?
Tags: None
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#2

20 Apr 2021, 20:52

I think this is a bug.

And the section of the code where the bug occurs reads

Code:

quietly count if `x'<. scalar `cnt' = r(N) `by' gen double `y' = sum(`x'<.) } `by' replace `y' = (`y'[_N] / `cnt')*100

Both the numerator and the denominator of this expression count the number of non-missings, and hence the result is identically 1, and when you multiply by 100, identically 100.

Write to Stata Corp to let them know they have an error in their code.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#3

20 Apr 2021, 21:02

What I said is correct if you are not by-ing your -collapse-.

If you are by-ing your -collapse-, then the code above calculates (non missings for group i)/(total number of non-missings for the whole sample).

So another possibility is that they just did not explain in the help file what they are calculating with this percent statistic.
Comment
Aidan Wang

Join Date: Apr 2021

Posts: 2
#4

21 Apr 2021, 19:33

Thank you for looking into this. I think you are right in your second post: the collapse (percent) command computes the percentage of non-missing observations that are in group i out of all non-missing observations in the dataset. It is not the percentage of observations in group i that are non-missing.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#5

22 Apr 2021, 00:20

It might be. But this interpretation becomes apparent only when we look at their code. In the help file this is what they say " percent percentage of nonmissing observations," which is ambiguous, because percentage of what?

Originally posted by Aidan Wang View Post

Thank you for looking into this. I think you are right in your second post: the collapse (percent) command computes the percentage of non-missing observations that are in group i out of all non-missing observations in the dataset. It is not the percentage of observations in group i that are non-missing.
Comment
Mead Over

Join Date: Sep 2014

Posts: 110
#6

02 Jun 2021, 13:14

Aidan Wang and Joro Kolev. I think this is a bug. Today, before searching here on Statalist, I have been puzzling over the results of -(percent)- after a -collapse- of several million individual observations into a around 100,000 groups. The (percent) statistic is giving me percentages smaller than 0.001, when I expected that, within a group, the percentage non-missing would be 75% to 100%.

Now I find that you two have already figured this out. Perhaps it's just piling on, but the attached code and results show clearly what you two discovered.

Code:

. version version 16.1 . sysuse auto, clear (1978 Automobile Data) . * Collapse to two observations, one for foreign and one for domestic . collapse (count) N_price=price N_rep78=rep78 (percent) P_price=price P_rep78=rep78 , by(foreign) . . * Manually construct the percentage of times -rep78- is missing in each of the two groups . * by dividing the number of nonmissing -rep78- by the number of observations Nprice: . gen Pct_rep78 = 100*N_rep78/N_price . list +-------------------------------------------------------------+ | foreign N_price N_rep78 P_price P_rep78 Pct_r~78 | |-------------------------------------------------------------| 1. | Domestic 52 48 70.2703 69.5652 92.30769 | 2. | Foreign 22 21 29.7297 30.4348 95.45454 | +-------------------------------------------------------------+ . . * We expect that the -P_rep78- variable computed by -collapse- . * would be identical to the Pct_rep78 variable computed directly. . * But instead -P_rep78- gives the number of non-missing observations in each group . * as a percentage of all the non-missing observations of that variable . di 100*21/(48 + 21) 30.434783

So, as the two of you found, the -(percent)- gives the percentage of non-missing observations within each group in comparison to the count of all non-missing observations for that variable across all groups.

Knowing that a large percent of all of a variable's missing observations are located in a subset of the groups might be useful for some purposes. For example, one could choose to rank the groups by this percentage and exclude the groups that contain most of the missing observations. (It would be good to have a selection model, rather than just omitting the groups without explanation.) So perhaps Stata should retain a statistic like this as an option.

However, as it currently functions, the -(percent)- stat is unique among all the -(stat)- options for -collapse-. That seems to me to be undesirable.

If collapse is going to have statistics that refer to the entire data set rather than only the individual groups, one might want to propose additional such statistics. For example, -collapse- could compute, for each group, the difference between that group's -mean- and the overall -mean-. For -mean- one could substitute any other statistic. Or -collapse could compute the percentile that each group's mean represents within the distribution of means across all groups. In the interest of parsimony and also -collapse-'s efficiency, I think adding this kind of global stats to -collapse- is not a good idea. If such statistics seem desirable, they should probably be produced by a separate command. Or by -egen- commands.

I would like to see a -(stat)- choice on collapse that behaves as Adrian and I thought the current -(percent)- should behave. But as in the above example, it's easy enough to compute the desired percentages based on -collapse- results.

Aslo FYi, as of today, the version 16 PDF documentation does not mention the (percent) statistic, so checking there for how it works doesn't help.
1 like
Comment

Announcement

Bug in collapse (percent) command?

Comment

Comment

Comment

Comment

Comment