Mean of variable based on percentiles of another

Kristina Mingos

Join Date: Feb 2018
Posts: 8

Mean of variable based on percentiles of another

10 Feb 2018, 10:43

Hello, I am trying to create the mean of a specific variable based on the for each 10% percentile of another. The complication is that the data are frequencies so that each observation refers to one or more observations. Below is an example of the data.

Specifically, I would like to first order "success_rate" from smallest to largest and then for each decile of the "frequency" data, create an average of "success_rate".

Thank you!!

frequency	success_rate
50	0.3523399
50	0.0508585
50	0.2308863
58	0.2329031
62	0.0651608
64	0.3329133
64	0.0353564
65	0.0671273
65	0.1377679
66	0.1189656
70	0.0730684
71	0.0672357
73	0.095927
77	0.115903
77	0.2428701
81	0.332109
83	0.2538869
84	0.0639709
85	0.1152154
85	0.1305838
87	0.0424351
90	0.2949313
91	0.3132716
91	0.3804671
91	0.1309527
92	0.3211307
93	0.059292
93	0.062638

Tags: None

eric_a_booth

Join Date: Apr 2014
Posts: 292

10 Feb 2018, 11:14

Code:

clear
input frequency    success_rate
50    0.3523399
50    0.0508585
50    0.2308863
58    0.2329031
62    0.0651608
64    0.3329133
64    0.0353564
65    0.0671273
65    0.1377679
66    0.1189656
70    0.0730684
71    0.0672357
73    0.095927
77    0.115903
77    0.2428701
81    0.332109
83    0.2538869
84    0.0639709
85    0.1152154
85    0.1305838
87    0.0424351
90    0.2949313
91    0.3132716
91    0.3804671
91    0.1309527
92    0.3211307
93    0.059292
93    0.062638
end

sort frequency
xtile deciles = frequency, n(10) 
tabstat success_rate , by(deciles)

bys decile: egen avgsuccess = mean(success_rate)

Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX

Comment

Kristina Mingos

Join Date: Feb 2018

Posts: 8
#3

10 Feb 2018, 11:25

Thank you Eric!!

I think that this code divides the dataset evenly into deciles based on the number of observations. Instead, I am trying to find the average of "success_rate" over deciles of the sum of the frequency (which might not correspond to evenly-sized decile bins).

I have been playing around with the following code, which I think is getting a bit closer:

sumdist success_rate [aw=frequency], n(10)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#4

10 Feb 2018, 12:16

If I understand this correctly you think that cumulative frequency is a correct basis for defining decile bins. That's an unusual basis without a third variable.
Comment

Kristina Mingos

Join Date: Feb 2018
Posts: 8

10 Feb 2018, 12:27

Hi Nick, thanks for your message! I am trying to divide the observations above (sample of the data) into 10 evenly sized bins according to the "success_rate". So each bin should represent a decile and contain ten percent of the cumulative data in "frequency". And I am trying to do so by sorting it from lowest to highest success_rate. Below is an excerpt of the data, sorted by the success_rate. And per decile of the cumulative frequency data, I am trying to find an average of both "success_rate" and "availability". Thank you!!

frequency	success_rate	availability
257	0.0101429	0.0324626
414	0.0132045	0.6706765
637	0.0135301	0.4140095
576	0.0137689	0.5567042
1319	0.013963	0.3013454
878	0.0154065	0.3464003
1601	0.0155413	0.4790811
369	0.0156932	0.55193
422	0.0157005	0.4254054
725	0.017094	0.2754081
467	0.0173873	0.3159016
427	0.0174372	0.5539747
816	0.0178512	0.4543116
1931	0.0184269	0.5617299
696	0.018487	0.2455464
187	0.0192474	0.3038814
855	0.019413	0.6584018
466	0.019919	0.5623539
1460	0.0203009	0.4913832
1012	0.0203631	0.6424753
3066	0.0208644	0.4795209

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35724

11 Feb 2018, 03:38

Thanks. I think I've got a better idea of what you want. Here's some code. As decile bins seem an arbitrary way to throw away some of the information in your data, here is another way to think about them too.

On the arbitrariness of bins, see also Section 4 of http://www.stata-journal.com/sjpdf.h...iclenum=pr0054 which emphasises that binning in different directions would usually yield different choices.

Spanning bars aren't documented except in

SJ-15-1 gr0063 . . . . . Stata tip 122: Variable bar widths in two-way graphs
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B. Jann
Q1/15 SJ 15(1):316--318 (no commands)
highlights the bartype(spanning) option of the twoway bar
command, an undocumented feature that can be used to produce
bars of different widths

and in the material Ben references. But they aren't difficult so long as you remember that k bins require k + 1 bin boundaries to be specified.

Code:

clear 
input frequency    success_rate    availability
257    0.0101429    0.0324626
414    0.0132045    0.6706765
637    0.0135301    0.4140095
576    0.0137689    0.5567042
1319    0.013963    0.3013454
878    0.0154065    0.3464003
1601    0.0155413    0.4790811
369    0.0156932    0.55193
422    0.0157005    0.4254054
725    0.017094    0.2754081
467    0.0173873    0.3159016
427    0.0174372    0.5539747
816    0.0178512    0.4543116
1931    0.0184269    0.5617299
696    0.018487    0.2455464
187    0.0192474    0.3038814
855    0.019413    0.6584018
466    0.019919    0.5623539
1460    0.0203009    0.4913832
1012    0.0203631    0.6424753
3066    0.0208644    0.4795209
end 

su frequency, meanonly 
gen cumprob = sum(frequency) / r(sum) 
gen decile_bin = ceil(10 * cumprob/cumprob[_N]) 
gen pp = cond(_n == 1, cumprob/2, (cumprob + cumprob[_n-1])/2)  
egen wanted = mean(availability), by(decile_bin) 
list, sepby(decile_bin) 

     +----------------------------------------------------------------------------+
     | freque~y   succes~e   availa~y    cumprob   decile~n         pp     wanted |
     |----------------------------------------------------------------------------|
  1. |      257   .0101429   .0324626   .0138313          1   .0069157   .3723829 |
  2. |      414   .0132045   .6706765   .0361122          1   .0249717   .3723829 |
  3. |      637   .0135301   .4140095   .0703945          1   .0532533   .3723829 |
     |----------------------------------------------------------------------------|
  4. |      576   .0137689   .5567042   .1013939          2   .0858942   .4290248 |
  5. |     1319    .013963   .3013454   .1723804          2   .1368871   .4290248 |
     |----------------------------------------------------------------------------|
  6. |      878   .0154065   .3464003    .219633          3   .1960067   .3464003 |
     |----------------------------------------------------------------------------|
  7. |     1601   .0155413   .4790811   .3057962          4   .2627146   .4329562 |
  8. |      369   .0156932     .55193   .3256553          4   .3157257   .4329562 |
  9. |      422   .0157005   .4254054   .3483666          4   .3370109   .4329562 |
 10. |      725    .017094   .2754081    .387385          4   .3678758   .4329562 |
     |----------------------------------------------------------------------------|
 11. |      467   .0173873   .3159016   .4125182          5   .3999516    .441396 |
 12. |      427   .0174372   .5539747   .4354986          5   .4240084    .441396 |
 13. |      816   .0178512   .4543116   .4794145          5   .4574565    .441396 |
     |----------------------------------------------------------------------------|
 14. |     1931   .0184269   .5617299   .5833378          6   .5313761   .5617299 |
     |----------------------------------------------------------------------------|
 15. |      696    .018487   .2455464   .6207954          7   .6020666   .4026099 |
 16. |      187   .0192474   .3038814   .6308595          7   .6258274   .4026099 |
 17. |      855    .019413   .6584018   .6768742          7   .6538669   .4026099 |
     |----------------------------------------------------------------------------|
 18. |      466    .019919   .5623539   .7019536          8   .6894139   .5268686 |
 19. |     1460   .0203009   .4913832   .7805285          8    .741241   .5268686 |
     |----------------------------------------------------------------------------|
 20. |     1012   .0203631   .6424753   .8349927          9   .8077606   .6424753 |
     |----------------------------------------------------------------------------|
 21. |     3066   .0208644   .4795209          1         10   .9174963   .4795209 |
     +----------------------------------------------------------------------------+



set obs `= _N + 1'
replace cumprob = 0 in L 
sort cumprob 
twoway bar availability cumprob, bartype(spanning) xtitle(cumulative probability) scheme(s1color) bfcolor(none)

Click image for larger version

Name: spanning.png
Views: 1
Size: 27.6 KB
ID: 1429594

Comment

Kristina Mingos

Join Date: Feb 2018

Posts: 8
#7

12 Feb 2018, 07:57

Thank you so much, Nick!! This is really helpful!!
Comment

Announcement

Mean of variable based on percentiles of another

Comment

Comment

Comment

Comment

Comment

Comment