rmean but for weighted average

Stan Peterburgsky

Join Date: Jun 2019

Posts: 74
#1

rmean but for weighted average

27 Nov 2023, 15:46

Hello everyone,
Is there something analogous to

Code:

egen average = rmean( netherlands belgium portugal france italy)

for weighted average?

(Also, I'm looking for something that would handle missing values in a reasonable manner. For example, if italy is missing, I would want to compute the weighted average of the other 4 countries.)

Last edited by Stan Peterburgsky; 27 Nov 2023, 15:49.
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35757

27 Nov 2023, 16:30

egen doesn't support weights cleanly -- although they could be passed in an option -- and in this case there could presumably be as many weights as main variables. If weights were constant across observations, you wouldn't need this any way. So, here is some token code.

If you wanted to wrap it in a program, you'd make the working variables temporary and (possibly) make the count variable optional. And you would support if and in. And you'd think about implementing it in Mata.

Code:

clear
set obs 10
set seed 2803 
foreach v in w1 w2 w3 x1 x2 x3 { 
    gen `v'= cond(runiform() < 0.1, ., runiformint(1, 10)) 
}

gen double numer = 0 
gen double denom = 0 
gen count = 0
 
unab w : w* 
unab x : x*
local nx = wordcount("`x'")  
assert `nx' == wordcount("`w'")

forval j = 1/`nx' { 
    local X : word `j' of `x'
    local W : word `j' of `w'
    replace numer = numer + (`W' * `X') if !missing(`W', `X')
    replace denom = denom + (`W') if !missing(`W', `X')
    replace count = count + !missing(`W', `X')
}

gen double wanted = numer/denom 

list 

     +-----------------------------------------------------------------+
     | w1   w2   w3   x1   x2   x3   numer   denom   count      wanted |
     |-----------------------------------------------------------------|
  1. |  6    7    5    3    9    8     121      18       3   6.7222222 |
  2. |  2    3   10    9    4    2      50      15       3   3.3333333 |
  3. |  1    2    .   10    8    8      26       3       2   8.6666667 |
  4. |  7    2    8   10    5    9     152      17       3   8.9411765 |
  5. |  .    1   10   10   10    3      40      11       2   3.6363636 |
     |-----------------------------------------------------------------|
  6. |  2    3    1    6    6    2      32       6       3   5.3333333 |
  7. |  2    4    6   10    4    8      84      12       3           7 |
  8. | 10    4    4   10    5    2     128      18       3   7.1111111 |
  9. |  3    4    5    1   10    6      73      12       3   6.0833333 |
 10. |  7    5    4    3    1    2      34      16       3       2.125 |
     +-----------------------------------------------------------------+

Comment

Stan Peterburgsky

Join Date: Jun 2019

Posts: 74
#3

27 Nov 2023, 18:26

Thank you, Nick. I will try this code tomorrow.

The weights are actually constant, mostly. The exception is that if there is a missing value, I need to use scaled weights so they add up to 100%. With 15 countries, there are potentially 2^15 permutations of what may be missing/nonmissing, so addressing each case is not really practical.
Comment

Stan Peterburgsky

Join Date: Jun 2019
Posts: 74

28 Nov 2023, 16:07

Everything works great, thank you for your help. I just have a follow-up question: What's the advantage of making numer and denom a double as opposed to a float here? The following code seems to make denom appear as if it doesn't equal to 1 in any of the 400 observations, whereas in reality denom is usually equal to 1. I believe this "contradiction" wouldn't occur if denom is made a float.

Code:

. count if denom!=1
  400

. table denom

----------------------
    denom |      Freq.
----------+-----------
.74567972 |          1
.76414793 |          1
 .8083977 |          2
.83720278 |          6
.85359532 |          1
.86436813 |          2
.89319275 |          2
.94661069 |          1
.95272672 |          1
.95575023 |          2
.95589119 |          3
.97082649 |          1
 .9732861 |          2
.97429188 |          2
.97944062 |          5
 .9850647 |          1
        1 |        367
----------------------

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35757
#5

29 Nov 2023, 03:36

The motive for using a double is just to make full use of the information, mindful of (e.g.) the instability of quotients on division.

You can test your guess on your own data, but naturally I can't. You didn't give a data example in #1 or explain the storage types you're using, and indeed I wouldn't usually expect that to be important.

Two extreme scenarios both seem plausible.

No one really cares about later decimal places, as they cannot be interpreted in practice.

There is challenge about exact repeatability of results, in which whether different software used float or double or their equivalents or something else might turn out to be an issue.
Comment

Announcement