Collapsing data by both weighted and unweighted means

Matt Mulligan

Join Date: May 2019

Posts: 9
#1

Collapsing data by both weighted and unweighted means

10 Oct 2020, 21:02

I have a dataset of ~100K, and I'd like to collapse the data for variable Alpha into its mean, and variable Beta into a mean weighted by variable Weight. As far as I can tell, all means have to be entirely unweighted or entirely weighted by Weight. Is there any way to do both with one command?

My instinct is to collapse it unweighted, reload the data, and collapse it weighted, and then merge the two. That seems a little bulky, though. Thanks.
Tags: None
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#2

11 Oct 2020, 00:18

As far as -collapse- is concerned, you will have to do what you are describing. Collapse is once weighted, collapse it once unweighted, and then merge them.
1 like
Comment
Matt Mulligan

Join Date: May 2019

Posts: 9
#3

13 Oct 2020, 00:23

Thanks. That's what I'll do.
Comment
Sergiy Radyakin

Join Date: Apr 2014

Posts: 1867
#4

27 Jan 2023, 15:40

Hello Matt Mulligan and Joro Kolev ,

I am in the same situation, and was looking for the answer exactly to avoid the solution you gravitated to. I wanted a single-pass collapse to deliver both weighted and unweighted results.
I think my code is working, but if you looked at this task before and can point where it may fail, please shout!

So far the code has passed for anything I could throw at it. (assuming, of course, that the weights are never missing).
The highlighted line makes it all feasible at a cost of an extra variable, but I can live with that.

Thank you, Sergiy Radyakin

Code:

version 16.0 clear all input x w 1 1 2 2 3 2 7 1 999 321 end summarize x local ms=r(mean) summarize x [aw=w] local mw=r(mean) generate y=x*w collapse x (sum)y (sum)w display x display y/w local epsilon=0.0000001 assert reldif(x,`ms') < `epsilon' assert reldif(y/w,`mw') < `epsilon'
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#5

27 Jan 2023, 16:16

I don't see any difficulty with your solution. If the the sum of w is very close to zero, then there could be instability in the calculation of y/w, but I don't know that -collapse- with weights would necessarily handle that any better. Similarly, if you had some large weights and then a number of much, much smaller weights, adding up those weights could give a false sum if all the digits of the small weights got shifted into oblivion during the addition. But weight variables are rarely scaled that way. And it's likely that Stata's algorithms for calculating sums and means are aware of these potential problems and are robust to them.

I tried your code with a few examples that might stress the code in that way, but none of them broke it. I think you can say that for all but the most seriously pathological situations your code is robust, and maybe even robust to them.

Last edited by Clyde Schechter; 27 Jan 2023, 16:19.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#6

28 Jan 2023, 02:04

Hi Sergiy,

I do not see anything wrong with your solution, the definition of a weighted arithmetic mean is SUMi Wi/(SUMi Wi)*Xi, and this is what you are doing. I would have just implemented your idea in the reverse order, like this:

Code:

. summ x, meanonly . local MeanX = r(mean) . summ x [aw = w], meanonly . local WmeanX = r(mean) . * Difference starts here . summ w, meanonly . gen y = r(N)*w*x/r(sum) . collapse (mean) x y . local epsilon=0.0000001 . assert reldif(x,`MeanX') < `epsilon' . assert reldif(y,`WmeanX') < `epsilon'
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#7

28 Jan 2023, 02:20

And what I did would fail if there are missing x while w is not missing, which is not a defect of your code. My revised code would be

Code:

. summ w if !missing(x), meanonly . gen y = r(N)*w*x/r(sum) . collapse (mean) x y

Overall, your solution is better if you are willing to think; think about what is the formula of the weighted mean, think about what you do with the missings... Then you produce more efficient code.

The initial solution with the two collapses, one unweighted and one weighted, and then merging, is more verbose but requires much less thinking.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#8

28 Jan 2023, 03:36

Pushing the result into a local macro and pulling it out again is sometimes a needless complication..

More importantly here, it will lose you some precision in some instances, so using a scalar is preferable. The point is that a local macro is really a string that here just happens to hold numeric characters and there is often a little loss of detail that could be troubling. That said, it is hard to find examples where this really bites.

Also, use doubles to maximize precision.
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#9

28 Jan 2023, 07:35

Originally posted by Nick Cox View Post

Pushing the result into a local macro and pulling it out again is sometimes a needless complication..

More importantly here, it will lose you some precision in some instances, so using a scalar is preferable. The point is that a local macro is really a string that here just happens to hold numeric characters and there is often a little loss of detail that could be troubling. That said, it is hard to find examples where this really bites.

Also, use doubles to maximize precision.

This advice on scalars vs locals is of course very relevant; I personally think that scalars are much under used in Stata.

Also doing the calculation in double precision is always a good idea.

But here the locals were irrelevant to the calculation at hand, we used them just to verify that what we have done agrees with the result -summarize-, weighted or unweighted, gives.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#10

28 Jan 2023, 11:04

#9 is puzzling.

In #4 and #6 results are being compared using local macros. As said, that is unlikely to bite, but the fact remains that comparison of scalar results -- or comparison of variables using the same storage types -- would be the most direct test of whether methods produce the same, or practically the same, answers.
Comment
Daniel Ruiz

Join Date: Aug 2017

Posts: 4
#11

02 May 2023, 21:14

I can't believe we have arrived to Stata 18 without a collapse (rawmean) xxx, by(). 🥲

DR
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30101
#12

03 May 2023, 09:33

While that would be a nice convenience, we do have -(rawsum)- in -collapse-, which together with -(count)- enables calculation of the raw mean afterward.
2 likes
Comment

Announcement

Collapsing data by both weighted and unweighted means

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment