Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Collapsing data by both weighted and unweighted means

    I have a dataset of ~100K, and I'd like to collapse the data for variable Alpha into its mean, and variable Beta into a mean weighted by variable Weight. As far as I can tell, all means have to be entirely unweighted or entirely weighted by Weight. Is there any way to do both with one command?

    My instinct is to collapse it unweighted, reload the data, and collapse it weighted, and then merge the two. That seems a little bulky, though. Thanks.

  • #2
    As far as -collapse- is concerned, you will have to do what you are describing. Collapse is once weighted, collapse it once unweighted, and then merge them.

    Comment


    • #3
      Thanks. That's what I'll do.

      Comment


      • #4
        Hello Matt Mulligan and Joro Kolev ,

        I am in the same situation, and was looking for the answer exactly to avoid the solution you gravitated to. I wanted a single-pass collapse to deliver both weighted and unweighted results.
        I think my code is working, but if you looked at this task before and can point where it may fail, please shout!

        So far the code has passed for anything I could throw at it. (assuming, of course, that the weights are never missing).
        The highlighted line makes it all feasible at a cost of an extra variable, but I can live with that.

        Thank you, Sergiy Radyakin

        Code:
        version 16.0
        clear all
        
        input x w
        1 1
        2 2
        3 2
        7 1
        999 321
        end
        
        summarize x
        local ms=r(mean)
        summarize x [aw=w]
        local mw=r(mean)
        
        generate y=x*w
        
        collapse x (sum)y (sum)w
        display x
        display y/w
        
        local epsilon=0.0000001
        assert reldif(x,`ms')   < `epsilon'
        assert reldif(y/w,`mw') < `epsilon'

        Comment


        • #5
          I don't see any difficulty with your solution. If the the sum of w is very close to zero, then there could be instability in the calculation of y/w, but I don't know that -collapse- with weights would necessarily handle that any better. Similarly, if you had some large weights and then a number of much, much smaller weights, adding up those weights could give a false sum if all the digits of the small weights got shifted into oblivion during the addition. But weight variables are rarely scaled that way. And it's likely that Stata's algorithms for calculating sums and means are aware of these potential problems and are robust to them.

          I tried your code with a few examples that might stress the code in that way, but none of them broke it. I think you can say that for all but the most seriously pathological situations your code is robust, and maybe even robust to them.
          Last edited by Clyde Schechter; 27 Jan 2023, 16:19.

          Comment


          • #6
            Hi Sergiy,

            I do not see anything wrong with your solution, the definition of a weighted arithmetic mean is SUMi Wi/(SUMi Wi)*Xi, and this is what you are doing. I would have just implemented your idea in the reverse order, like this:

            Code:
            . summ x, meanonly
            
            . local MeanX = r(mean)
            
            . summ x [aw = w], meanonly
            
            . local WmeanX = r(mean)
            
            . * Difference starts here
            
            . summ w, meanonly
            
            . gen y = r(N)*w*x/r(sum)
            
            . collapse (mean) x y
            
            . local epsilon=0.0000001
            
            . assert reldif(x,`MeanX')   < `epsilon'
            
            . assert reldif(y,`WmeanX')   < `epsilon'

            Comment


            • #7
              And what I did would fail if there are missing x while w is not missing, which is not a defect of your code. My revised code would be

              Code:
              . summ w if !missing(x), meanonly
              
              . gen y = r(N)*w*x/r(sum)
              
              . collapse (mean) x y
              Overall, your solution is better if you are willing to think; think about what is the formula of the weighted mean, think about what you do with the missings... Then you produce more efficient code.

              The initial solution with the two collapses, one unweighted and one weighted, and then merging, is more verbose but requires much less thinking.



              Comment


              • #8
                Pushing the result into a local macro and pulling it out again is sometimes a needless complication..

                More importantly here, it will lose you some precision in some instances, so using a scalar is preferable. The point is that a local macro is really a string that here just happens to hold numeric characters and there is often a little loss of detail that could be troubling. That said, it is hard to find examples where this really bites.

                Also, use doubles to maximize precision.

                Comment


                • #9
                  Originally posted by Nick Cox View Post
                  Pushing the result into a local macro and pulling it out again is sometimes a needless complication..

                  More importantly here, it will lose you some precision in some instances, so using a scalar is preferable. The point is that a local macro is really a string that here just happens to hold numeric characters and there is often a little loss of detail that could be troubling. That said, it is hard to find examples where this really bites.

                  Also, use doubles to maximize precision.
                  This advice on scalars vs locals is of course very relevant; I personally think that scalars are much under used in Stata.

                  Also doing the calculation in double precision is always a good idea.

                  But here the locals were irrelevant to the calculation at hand, we used them just to verify that what we have done agrees with the result -summarize-, weighted or unweighted, gives.

                  Comment


                  • #10
                    #9 is puzzling.

                    In #4 and #6 results are being compared using local macros. As said, that is unlikely to bite, but the fact remains that comparison of scalar results -- or comparison of variables using the same storage types -- would be the most direct test of whether methods produce the same, or practically the same, answers.

                    Comment


                    • #11
                      I can't believe we have arrived to Stata 18 without a collapse (rawmean) xxx, by(). 🥲
                      DR

                      Comment


                      • #12
                        While that would be a nice convenience, we do have -(rawsum)- in -collapse-, which together with -(count)- enables calculation of the raw mean afterward.

                        Comment

                        Working...
                        X