Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • rmean but for weighted average

    Hello everyone,
    Is there something analogous to
    Code:
    egen average = rmean( netherlands belgium portugal france italy)
    for weighted average?

    (Also, I'm looking for something that would handle missing values in a reasonable manner. For example, if italy is missing, I would want to compute the weighted average of the other 4 countries.)
    Last edited by Stan Peterburgsky; 27 Nov 2023, 15:49.

  • #2
    egen doesn't support weights cleanly -- although they could be passed in an option -- and in this case there could presumably be as many weights as main variables. If weights were constant across observations, you wouldn't need this any way. So, here is some token code.

    If you wanted to wrap it in a program, you'd make the working variables temporary and (possibly) make the count variable optional. And you would support if and in. And you'd think about implementing it in Mata.

    Code:
    clear
    set obs 10
    set seed 2803 
    foreach v in w1 w2 w3 x1 x2 x3 { 
        gen `v'= cond(runiform() < 0.1, ., runiformint(1, 10)) 
    }
    
    gen double numer = 0 
    gen double denom = 0 
    gen count = 0
     
    unab w : w* 
    unab x : x*
    local nx = wordcount("`x'")  
    assert `nx' == wordcount("`w'")
    
    forval j = 1/`nx' { 
        local X : word `j' of `x'
        local W : word `j' of `w'
        replace numer = numer + (`W' * `X') if !missing(`W', `X')
        replace denom = denom + (`W') if !missing(`W', `X')
        replace count = count + !missing(`W', `X')
    }
    
    gen double wanted = numer/denom 
    
    list 
    
         +-----------------------------------------------------------------+
         | w1   w2   w3   x1   x2   x3   numer   denom   count      wanted |
         |-----------------------------------------------------------------|
      1. |  6    7    5    3    9    8     121      18       3   6.7222222 |
      2. |  2    3   10    9    4    2      50      15       3   3.3333333 |
      3. |  1    2    .   10    8    8      26       3       2   8.6666667 |
      4. |  7    2    8   10    5    9     152      17       3   8.9411765 |
      5. |  .    1   10   10   10    3      40      11       2   3.6363636 |
         |-----------------------------------------------------------------|
      6. |  2    3    1    6    6    2      32       6       3   5.3333333 |
      7. |  2    4    6   10    4    8      84      12       3           7 |
      8. | 10    4    4   10    5    2     128      18       3   7.1111111 |
      9. |  3    4    5    1   10    6      73      12       3   6.0833333 |
     10. |  7    5    4    3    1    2      34      16       3       2.125 |
         +-----------------------------------------------------------------+

    Comment


    • #3
      Thank you, Nick. I will try this code tomorrow.

      The weights are actually constant, mostly. The exception is that if there is a missing value, I need to use scaled weights so they add up to 100%. With 15 countries, there are potentially 2^15 permutations of what may be missing/nonmissing, so addressing each case is not really practical.

      Comment


      • #4
        Everything works great, thank you for your help. I just have a follow-up question: What's the advantage of making numer and denom a double as opposed to a float here? The following code seems to make denom appear as if it doesn't equal to 1 in any of the 400 observations, whereas in reality denom is usually equal to 1. I believe this "contradiction" wouldn't occur if denom is made a float.

        Code:
        . count if denom!=1
          400
        
        . table denom
        
        ----------------------
            denom |      Freq.
        ----------+-----------
        .74567972 |          1
        .76414793 |          1
         .8083977 |          2
        .83720278 |          6
        .85359532 |          1
        .86436813 |          2
        .89319275 |          2
        .94661069 |          1
        .95272672 |          1
        .95575023 |          2
        .95589119 |          3
        .97082649 |          1
         .9732861 |          2
        .97429188 |          2
        .97944062 |          5
         .9850647 |          1
                1 |        367
        ----------------------

        Comment


        • #5
          The motive for using a double is just to make full use of the information, mindful of (e.g.) the instability of quotients on division.

          You can test your guess on your own data, but naturally I can't. You didn't give a data example in #1 or explain the storage types you're using, and indeed I wouldn't usually expect that to be important.

          Two extreme scenarios both seem plausible.

          No one really cares about later decimal places, as they cannot be interpreted in practice.

          There is challenge about exact repeatability of results, in which whether different software used float or double or their equivalents or something else might turn out to be an issue.

          Comment

          Working...
          X