Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • The weights in -summarize- behave not as advertised in the manual, and can somebody explain frequency and analytic weights in this context

    I came up with a simple example, I discovered that Stata does not do what the manual of -summarize- says, and I would kindly ask an expert on weights -- in particular on the difference between frequency and analytic weights -- to give an opinion on what should happen here.

    The simple example dataset is

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(weight myvar)
    6 1
    4 2
    0 3
    end
    The first problem is that the explanation in Methods and Formulas of -summarize- manual provides formulas for only one type of weights (-summarize- accepts three types of weights, analytic, frequency and importance). Given that the manual provided formulas for weights (without any differentiation between the three types) I expect they all to give the same result. But they do not:

    Code:
    . summ myvar [aw=weight]
    
        Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
    -------------+-----------------------------------------------------------------
           myvar |       2          10         1.4   .6928203          1          2
    
    . return list
    
    scalars:
                      r(N) =  2
                  r(sum_w) =  10
                   r(mean) =  1.4
                    r(Var) =  .48
                     r(sd) =  .6928203230275509
                    r(min) =  1
                    r(max) =  2
                    r(sum) =  14
    
    . summ myvar [iw=weight]
    
        Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
    -------------+-----------------------------------------------------------------
           myvar |       2          10         1.4   .5163978          1          2
    
    . return list
    
    scalars:
                      r(N) =  2
                  r(sum_w) =  10
                   r(mean) =  1.4
                    r(Var) =  .2666666666666667
                     r(sd) =  .5163977794943222
                    r(min) =  1
                    r(max) =  2
                    r(sum) =  14
    
    . summ myvar [fw=weight]
    
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
           myvar |         10         1.4    .5163978          1          2
    
    . return list
    
    scalars:
                      r(N) =  10
                  r(sum_w) =  10
                   r(mean) =  1.4
                    r(Var) =  .2666666666666667
                     r(sd) =  .5163977794943222
                    r(min) =  1
                    r(max) =  2
                    r(sum) =  14
    
    .
    The mean is
    Code:
    . dis .6*1 + .4*2
    1.4
    so this is clear why. The maximum is 2 because the 0 weight on 3 removed the latter, the observations and sum of weights seems all clear...

    However why are the standard deviations and variances different, when the manual displays only one formula for all weights?

    And why is importance weights coinciding with the frequency weights, and not the analytic weights?

  • #2
    Disclaimer: I am all but an expert on weights.

    You will find this FAQ interesting.

    More generally, weights are explained in their own help-file

    Code:
    help weight
    You will find a more detailed discussion in [U]20.24 Weighted estimation which is linked to in the help file above.

    The methods and formulas section of summarize is correct because all weights that summarize accept are treated in the same way; it is just that aweights are rescaled to sum to N before they are plugged into the formula.

    Comment


    • #3
      Added in edit: I got distracted composing this while doing the laundry and daniel klein provided a better answer than this, so he is more of an expert than I am.

      Apparently the manual entry for summarize describes only analytical weights, because it uses for its calculations weights wi that have been derived from the given weights vi by normalizing the wito sum to the number of observations n. Frequency weights treat the data as it would appear if the command were run unweighted after first running expand weight to replicate observations, so wi = vi and n is replaced by the sum of the wi. And as help weight tells us, importance weights are treated arbitrarily by each command, and it appears that summarize treats them identically to frequency weights.

      [U] 20.24 Weighted Estimation has a lot to say about the different types of weights. It is unfortunate that the output of help weight links only to [U] 11.1.6 weight which is not much more than is in the help file, but it does in turn link to [U] 20.24 Weighted Estimation.
      Last edited by William Lisowski; 26 Mar 2021, 12:46.

      Comment

      Working...
      X