The weights in -summarize- behave not as advertised in the manual, and can somebody explain frequency and analytic weights in this context

Joro Kolev

Join Date: Aug 2018
Posts: 3050

The weights in -summarize- behave not as advertised in the manual, and can somebody explain frequency and analytic weights in this context

26 Mar 2021, 11:40

I came up with a simple example, I discovered that Stata does not do what the manual of -summarize- says, and I would kindly ask an expert on weights -- in particular on the difference between frequency and analytic weights -- to give an opinion on what should happen here.

The simple example dataset is

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(weight myvar)
6 1
4 2
0 3
end

The first problem is that the explanation in Methods and Formulas of -summarize- manual provides formulas for only one type of weights (-summarize- accepts three types of weights, analytic, frequency and importance). Given that the manual provided formulas for weights (without any differentiation between the three types) I expect they all to give the same result. But they do not:

Code:

. summ myvar [aw=weight]

    Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------------------
       myvar |       2          10         1.4   .6928203          1          2

. return list

scalars:
                  r(N) =  2
              r(sum_w) =  10
               r(mean) =  1.4
                r(Var) =  .48
                 r(sd) =  .6928203230275509
                r(min) =  1
                r(max) =  2
                r(sum) =  14

. summ myvar [iw=weight]

    Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------------------
       myvar |       2          10         1.4   .5163978          1          2

. return list

scalars:
                  r(N) =  2
              r(sum_w) =  10
               r(mean) =  1.4
                r(Var) =  .2666666666666667
                 r(sd) =  .5163977794943222
                r(min) =  1
                r(max) =  2
                r(sum) =  14

. summ myvar [fw=weight]

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
       myvar |         10         1.4    .5163978          1          2

. return list

scalars:
                  r(N) =  10
              r(sum_w) =  10
               r(mean) =  1.4
                r(Var) =  .2666666666666667
                 r(sd) =  .5163977794943222
                r(min) =  1
                r(max) =  2
                r(sum) =  14

.

The mean is

Code:

. dis .6*1 + .4*2
1.4

so this is clear why. The maximum is 2 because the 0 weight on 3 removed the latter, the observations and sum of weights seems all clear...

However why are the standard deviations and variances different, when the manual displays only one formula for all weights?

And why is importance weights coinciding with the frequency weights, and not the analytic weights?

Tags: None

daniel klein

Join Date: Mar 2014

Posts: 3872
#2

26 Mar 2021, 12:22

Disclaimer: I am all but an expert on weights.

You will find this FAQ interesting.

More generally, weights are explained in their own help-file

Code:

help weight

You will find a more detailed discussion in [U]20.24 Weighted estimation which is linked to in the help file above.

The methods and formulas section of summarize is correct because all weights that summarize accept are treated in the same way; it is just that aweights are rescaled to sum to N before they are plugged into the formula.
2 likes
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

26 Mar 2021, 12:42

Added in edit: I got distracted composing this while doing the laundry and daniel klein provided a better answer than this, so he is more of an expert than I am.

Apparently the manual entry for summarize describes only analytical weights, because it uses for its calculations weights w_i that have been derived from the given weights v_i by normalizing the w_ito sum to the number of observations n. Frequency weights treat the data as it would appear if the command were run unweighted after first running expand weight to replicate observations, so w_i = v_i and n is replaced by the sum of the w_i. And as help weight tells us, importance weights are treated arbitrarily by each command, and it appears that summarize treats them identically to frequency weights.

[U] 20.24 Weighted Estimation has a lot to say about the different types of weights. It is unfortunate that the output of help weight links only to [U] 11.1.6 weight which is not much more than is in the help file, but it does in turn link to [U] 20.24 Weighted Estimation.

Last edited by William Lisowski; 26 Mar 2021, 12:46.
Comment

Announcement

The weights in -summarize- behave not as advertised in the manual, and can somebody explain frequency and analytic weights in this context

Comment

Comment