Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bug in collapse (percent) command?

    The collapse (percent) command is supposed to yield the percentage of nonmissing observations. However, it doesn't seem to:
    Code:
    sysuse auto
    count if missing(rep78)
    count
    gen constant = 1
    collapse (percent) rep78, by(constant)
    list rep78
    Is this a bug, or am I misunderstanding the documentation?

  • #2
    I think this is a bug.

    And the section of the code where the bug occurs reads

    Code:
    quietly count if `x'<.
                            scalar `cnt' = r(N)
                            `by' gen double `y' = sum(`x'<.)
                    }
                    `by' replace `y' = (`y'[_N] / `cnt')*100
    Both the numerator and the denominator of this expression count the number of non-missings, and hence the result is identically 1, and when you multiply by 100, identically 100.

    Write to Stata Corp to let them know they have an error in their code.

    Comment


    • #3
      What I said is correct if you are not by-ing your -collapse-.

      If you are by-ing your -collapse-, then the code above calculates (non missings for group i)/(total number of non-missings for the whole sample).

      So another possibility is that they just did not explain in the help file what they are calculating with this percent statistic.

      Comment


      • #4
        Thank you for looking into this. I think you are right in your second post: the collapse (percent) command computes the percentage of non-missing observations that are in group i out of all non-missing observations in the dataset. It is not the percentage of observations in group i that are non-missing.

        Comment


        • #5
          It might be. But this interpretation becomes apparent only when we look at their code. In the help file this is what they say " percent percentage of nonmissing observations," which is ambiguous, because percentage of what?

          Originally posted by Aidan Wang View Post
          Thank you for looking into this. I think you are right in your second post: the collapse (percent) command computes the percentage of non-missing observations that are in group i out of all non-missing observations in the dataset. It is not the percentage of observations in group i that are non-missing.

          Comment


          • #6
            Aidan Wang and Joro Kolev. I think this is a bug. Today, before searching here on Statalist, I have been puzzling over the results of -(percent)- after a -collapse- of several million individual observations into a around 100,000 groups. The (percent) statistic is giving me percentages smaller than 0.001, when I expected that, within a group, the percentage non-missing would be 75% to 100%.

            Now I find that you two have already figured this out. Perhaps it's just piling on, but the attached code and results show clearly what you two discovered.


            Code:
            . version
            version 16.1
            
            . sysuse auto, clear
            (1978 Automobile Data)
            
            . *       Collapse to two observations, one for foreign and one for domestic
            . collapse (count) N_price=price N_rep78=rep78 (percent) P_price=price P_rep78=rep78 , by(foreign)
            
            . 
            . *       Manually construct the percentage of times -rep78- is missing in each of the two groups
            . *       by dividing the number of nonmissing -rep78- by the number of observations Nprice:
            . gen Pct_rep78 = 100*N_rep78/N_price
            
            . list
            
                 +-------------------------------------------------------------+
                 |  foreign   N_price   N_rep78   P_price   P_rep78   Pct_r~78 |
                 |-------------------------------------------------------------|
              1. | Domestic        52        48   70.2703   69.5652   92.30769 |
              2. |  Foreign        22        21   29.7297   30.4348   95.45454 |
                 +-------------------------------------------------------------+
            
            . 
            . *       We expect that the -P_rep78- variable computed by -collapse- 
            . *       would be identical to the Pct_rep78 variable computed directly.
            . *       But instead -P_rep78- gives the number of non-missing observations in each group  
            . *       as a percentage of all the non-missing observations of that variable
            . di 100*21/(48 + 21)
            30.434783
            So, as the two of you found, the -(percent)- gives the percentage of non-missing observations within each group in comparison to the count of all non-missing observations for that variable across all groups.

            Knowing that a large percent of all of a variable's missing observations are located in a subset of the groups might be useful for some purposes. For example, one could choose to rank the groups by this percentage and exclude the groups that contain most of the missing observations. (It would be good to have a selection model, rather than just omitting the groups without explanation.) So perhaps Stata should retain a statistic like this as an option.

            However, as it currently functions, the -(percent)- stat is unique among all the -(stat)- options for -collapse-. That seems to me to be undesirable.

            If collapse is going to have statistics that refer to the entire data set rather than only the individual groups, one might want to propose additional such statistics. For example, -collapse- could compute, for each group, the difference between that group's -mean- and the overall -mean-. For -mean- one could substitute any other statistic. Or -collapse could compute the percentile that each group's mean represents within the distribution of means across all groups. In the interest of parsimony and also -collapse-'s efficiency, I think adding this kind of global stats to -collapse- is not a good idea. If such statistics seem desirable, they should probably be produced by a separate command. Or by -egen- commands.

            I would like to see a -(stat)- choice on collapse that behaves as Adrian and I thought the current -(percent)- should behave. But as in the above example, it's easy enough to compute the desired percentages based on -collapse- results.

            Aslo FYi, as of today, the version 16 PDF documentation does not mention the (percent) statistic, so checking there for how it works doesn't help.

            Comment

            Working...
            X