Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Different results for weighted median using same Stata Manual Methodology [Stata/SE 15.0]

    Hello everyone,

    I found something about weighted medians in Stata/SE. To my knowledge and what I have found at the moment, this is not reported or explained in the forum.

    Consider a dataset with 176 observations, and two variables: a monetary variable (y) and a weight (w), normalized to the number of observations. The data is sorted by "y". Example:

    Code:
    . list in 1/5
    
         +-----------------------+
         |         y           w |
         |-----------------------|
      1. |     153.3   1.2242037 |
      2. |   75753.3   1.2242037 |
      3. | 92089.306   1.2255392 |
      4. |  113553.3   .80866169 |
      5. | 119325.52   1.2849769 |
         +-----------------------+
    
    . 
    . list in 171/176
    
         +-----------------------+
         |         y           w |
         |-----------------------|
    171. | 1008153.3   1.9178511 |
    172. | 1050153.3    .8489436 |
    173. | 1191875.5   1.2725638 |
    174. | 1428153.3    1.039806 |
    175. |   1671717   1.3656695 |
         |-----------------------|
    176. | 1932153.3   .74033083 |
         +-----------------------+
    Obtaining the weighted median with -summarize, detail- give us

    Code:
    . sum y [aw=w], d
    
                                  y
    -------------------------------------------------------------
          Percentiles      Smallest
     1%      75753.3          153.3
     5%     133878.4        75753.3
    10%     188204.2       92089.31       Obs                 176
    25%     221473.1       113553.3       Sum of Wgt.         176
    
    50%     338399.3                      Mean           405967.1
                            Largest       Std. Dev.      271224.9
    75%     504507.2        1191876
    90%     714153.3        1428153       Variance       7.36e+10
    95%     840153.3        1671717       Skewness       2.291167
    99%      1671717        1932153       Kurtosis       10.72572
    
    . di r(p50)
    338399.33
    Now, if I manually calculate the weighted median following the methodology for percentiles described in Stata Base Reference Manual: Release 15, pages 2673-2674, I found this:

    Code:
    .* Following reference manual
    
    . preserve
    
    .         gen P = (0.5*_N)    // defining the cutting point for the 50th percentile
    
    .         gen W = w if _n == 1  // Defining the cumulative sum of weights
    (175 missing values generated)
    
    .         replace W = w[_n] + W[_n-1] if _n > 1
    (175 real changes made)
    
    .         gen index = ( W > P )  // Index for finding "center" of weighted distribution
    
    .         replace index = index[_n] + index[_n-1] if _n > 1
    (88 real changes made)
    
    . * Calculating median 
    
    .         gen aux_median = ( y[_n-1] + y[_n] )/2 if index == 1 & W[_n-1] == P
    (176 missing values generated)
    
    .         replace aux_median = y if index == 1 & W[_n-1] != P
    (1 real change made)
    
    .         replace aux_median = 0 if aux_median == .
    (175 real changes made)
    
    .         egen median = max(aux_median)
    
    .         di median
    336153.3
    
    . restore
    This is a different result. What could be happening here?

    One hypothesis (but I can't confirm it, as -summarize- is a built-in command) is that this is related with the number of decimals that -summarize- considers when using weights. in fact, cutting arbitrarily in three decimals allow us to achieve the same result that -summarize-.


    Code:
    . * Following reference manual
    . preserve
    
    .         gen P = (0.5*_N)    // defining the cutting point for the 50th percentile
    
    .         gen W = w if _n == 1  // Defining the cumulative sum of weights
    (175 missing values generated)
    
    .         replace W = w[_n] + W[_n-1] if _n > 1
    (175 real changes made)
    
    .         replace W = round(W,0.001)  // Cutting decimals to 3
    (176 real changes made)
    
    .         gen index = ( W > P )  // Index for finding "center" of weighted distribution
    
    .         replace index = index[_n] + index[_n-1] if _n > 1
    (87 real changes made)
    
    . * Calculating median 
    
    .         gen aux_median = ( y[_n-1] + y[_n] )/2 if index == 1 & W[_n-1] == P
    (175 missing values generated)
    
    .         replace aux_median = y if index == 1 & W[_n-1] != P
    (0 real changes made)
    
    .         replace aux_median = 0 if aux_median == .
    (175 real changes made)
    
    .         egen median = max(aux_median)
    
    .         di median
    338399.33
    
    . restore
    Thanks in advance for any help.

    Kind regards,
    David
Working...
X