Finding decile of a group's median income in the whole distribution

Usha Adelina

Join Date: Apr 2016

Posts: 15
#1

Finding decile of a group's median income in the whole distribution

04 Apr 2016, 05:30

Hello, I have a question. Suppose I have a dataset as follows,

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int(villageid expenditure) float medianexp byte decileexpend 3101 500 585 3 3101 600 585 5 3101 570 585 4 3101 960 585 10 3102 850 542.5 8 3102 400 542.5 2 3102 685 542.5 6 3102 375 542.5 1 3103 900 695 9 3103 100 695 1 3103 620 695 6 3103 770 695 7 end

------------------ copy up to and including the previous line ------------------

Where medianexp is the median expenditure for each village (based on villageid) and decileexpend is the decile of each expenditure data belonging to the whole distribution (including villageid 3101 3102 3103). My question is, is there a way to find out to which decile does the median expenditure belongs, based on the whole expenditure distribution? Thank you very much.

Last edited by Usha Adelina; 04 Apr 2016, 05:36.
Tags: decile, median, percentile
Nick Cox

Join Date: Mar 2014

Posts: 35724
#2

04 Apr 2016, 05:42

Depending on how one was raised, decile here is either a natural or at least conventional term or an abuse of terminology.

Classically, the deciles are the 9 points in a distribution corresponding to cumulative probabilities 0.1(0.1)0.9 (10(10)90% if you prefer).

But it seems clear that for you decile means decile-based bin or interval, although the more exact terminology is awkward.

That said, if your unit is villages, not individuals, then you can apply http://www.stata.com/support/faqs/st...ing-positions/
roughly as follows:

Code:

egen villagetag = tag(villageid) count if villagetag local N = r(N) egen rank = rank(medianexp) if villagetag gen prank = (rank - 0.5)/`N' bysort villageid (prank) : replace prank = prank[1]

The calculation is different if you want to weight each village by the number of individuals. This produces a percentile rank as a fraction. Bins 1(1)10 would be produced from that by rounding (and thus discarding information).
Comment
Usha Adelina

Join Date: Apr 2016

Posts: 15
#3

06 Apr 2016, 09:19

Dear Mr. Cox, thank you very much for your answer and suggestion. However, I think I haven't asked precisely and clearly and therefore I haven't found the answer yet.

My question is: is there a way to find out which decile the median expenditure of each village ( medianexp ) belongs to, based on the distribution of expenditure (expend). Where do the median expenditure of each village stand in the distribution of the whole individual expenditure distribution, not just the distribution of median expenditure?

Once again, thank you very much and I look forward to your reply.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#4

06 Apr 2016, 09:24

Look at cquantile (SSC) or inside at its code. You may need to restructure your data. Watch out for double counting.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35724

06 Apr 2016, 10:42

I remember now that some work I did gives one of several direct routes to what I think you want.

See http://www.stata-journal.com/sjpdf.h...iclenum=pr0054 -- especially Section 4, which has more general comments on binning.

Here I use your example data (thanks), but because it's small, I use quintiles. The principles for deciles are naturally similar. In fact the Stata Journal paper has a decile example.

Code:

 
clear
input int(villageid expenditure) float medianexp byte decileexpend
3101 500   585  3
3101 600   585  5
3101 570   585  4
3101 960   585 10
3102 850 542.5  8
3102 400 542.5  2
3102 685 542.5  6
3102 375 542.5  1
3103 900   695  9
3103 100   695  1
3103 620   695  6
3103 770   695  7
end

_pctile expenditure, nq(5) 
matrix quintile = r(r1), r(r2), r(r3), r(r4) 
generate q_village = 5 if medianexp < .
quietly forvalues i = 4(-1)1 {
    replace q_village = `i' if medianexp <= quintile[1, `i']
}

list, sepby(villageid) 

     +------------------------------------------------------+
     | villag~d   expend~e   median~p   decile~d   q_vill~e |
     |------------------------------------------------------|
  1. |     3101        500        585          3          3 |
  2. |     3101        600        585          5          3 |
  3. |     3101        570        585          4          3 |
  4. |     3101        960        585         10          3 |
     |------------------------------------------------------|
  5. |     3102        850      542.5          8          2 |
  6. |     3102        400      542.5          2          2 |
  7. |     3102        685      542.5          6          2 |
  8. |     3102        375      542.5          1          2 |
     |------------------------------------------------------|
  9. |     3103        900        695          9          4 |
 10. |     3103        100        695          1          4 |
 11. |     3103        620        695          6          4 |
 12. |     3103        770        695          7          4 |
     +------------------------------------------------------+

Note that the village medians won't typically fall into 5 equal groups, even approximately, but will tend to bunch around the middle groups as determined by the individuals' distribution.

Comment

Usha Adelina

Join Date: Apr 2016
Posts: 15

06 Apr 2016, 22:22

Dear Mr. Cox, the last suggestion gave results which are exactly how I expected it to be. Thank you very, very much!
However, my actual dataset is bigger than the example and I aim to find the percentile (while on the question it was decile) of each village's median expenditure. Is it okay to twitch the command a little bit and input it as:

Code:

_pctile expenditure, nq(100) 
matrix quintile = r(r1), r(r2), r(r3), r(r4), r(r5), r(r6), r(r7), r(r8), r(r9), r(r10), r(r11), r(r12), r(r13), r(r14), r(r15), r(r16), r(r17), r(r18), r(r19), r(r20), r(r21), r(r22), r(r23), r(r24), r(r25), r(r26), r(r27), r(r28), r(r29), r(r30), r(r31), r(r32), r(r33), r(r34), r(r35), r(r36), r(r36), r(r37), r(r38), r(r39), r(r40), r(r41), r(r42), r(r43), r(r44), r(r45), r(r46), r(r47), r(r48), r(r49), r(r50), r(r51), r(r52), r(r53), r(r54), r(r55), r(r56), r(r57), r(r58), r(r59), r(r60), r(r61), r(r62), r(r63), r(r64), r(r65), r(r66), r(r67), r(r68), r(r69), r(r70), r(r71), r(r72), r(r73), r(r74), r(r75), r(r76), r(r77), r(r78), r(r79), r(r80), r(r81), r(r82), r(r83), r(r84), r(r85), r(r86), r(r87), r(r88), r(r89), r(r90), r(r91), r(r92), r(r93), r(r94), r(r95), r(r96), r(r97), r(r98), r(r99)
generate q_village = 100 if medianexp < .
quietly forvalues i = 99(-1)1 {
     replace q_village = `i' if medianexp <= quintile[1, `i']
 
}

I find the results to be correct, I just want to make sure that I am doing the right steps. Once again, thank you very much for your help.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35724
#7

07 Apr 2016, 02:25

Looks OK.

By the way, the matrix is not essential, but the paper cited was about uses of matrices. A matrix will stick around in a session, but results like r(r1) are likely to be overwritten.

In practice, I would make sure that you use informative variable names. Also, the long statement can be made into a loop. I would use a column vector for that many percentiles.

Code not tested!

Code:

_pctile expenditure, nq(100) matrix pctile = J(99, 1, .) forval i = 1/99 { mat pctile[`i', 1] = r(r`i') } mat li pctile generate pct_village = 100 if medianexp < . quietly forvalues i = 99(-1)1 { replace pct_village = `i' if medianexp <= pctile[`i', 1] }
Comment
Usha Adelina

Join Date: Apr 2016

Posts: 15
#8

08 Apr 2016, 10:12

Dear Mr. Cox, thank you very much for the suggestion. I tried testing the last code and the results were better compared to the code that I asked (by comparing the results to the centiles of the expenditure distribution). Thank you very much for your help. Best regards from Indonesia.
Comment

Announcement