Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Mean of variable based on percentiles of another

    Hello, I am trying to create the mean of a specific variable based on the for each 10% percentile of another. The complication is that the data are frequencies so that each observation refers to one or more observations. Below is an example of the data.

    Specifically, I would like to first order "success_rate" from smallest to largest and then for each decile of the "frequency" data, create an average of "success_rate".

    Thank you!!
    frequency success_rate
    50 0.3523399
    50 0.0508585
    50 0.2308863
    58 0.2329031
    62 0.0651608
    64 0.3329133
    64 0.0353564
    65 0.0671273
    65 0.1377679
    66 0.1189656
    70 0.0730684
    71 0.0672357
    73 0.095927
    77 0.115903
    77 0.2428701
    81 0.332109
    83 0.2538869
    84 0.0639709
    85 0.1152154
    85 0.1305838
    87 0.0424351
    90 0.2949313
    91 0.3132716
    91 0.3804671
    91 0.1309527
    92 0.3211307
    93 0.059292
    93 0.062638


  • #2
    Code:
    clear
    input frequency    success_rate
    50    0.3523399
    50    0.0508585
    50    0.2308863
    58    0.2329031
    62    0.0651608
    64    0.3329133
    64    0.0353564
    65    0.0671273
    65    0.1377679
    66    0.1189656
    70    0.0730684
    71    0.0672357
    73    0.095927
    77    0.115903
    77    0.2428701
    81    0.332109
    83    0.2538869
    84    0.0639709
    85    0.1152154
    85    0.1305838
    87    0.0424351
    90    0.2949313
    91    0.3132716
    91    0.3804671
    91    0.1309527
    92    0.3211307
    93    0.059292
    93    0.062638
    end
    
    sort frequency
    xtile deciles = frequency, n(10) 
    tabstat success_rate , by(deciles)
    
    bys decile: egen avgsuccess = mean(success_rate)
    Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX

    Comment


    • #3
      Thank you Eric!!

      I think that this code divides the dataset evenly into deciles based on the number of observations. Instead, I am trying to find the average of "success_rate" over deciles of the sum of the frequency (which might not correspond to evenly-sized decile bins).

      I have been playing around with the following code, which I think is getting a bit closer:

      sumdist success_rate [aw=frequency], n(10)

      Comment


      • #4
        If I understand this correctly you think that cumulative frequency is a correct basis for defining decile bins. That's an unusual basis without a third variable.

        Comment


        • #5
          Hi Nick, thanks for your message! I am trying to divide the observations above (sample of the data) into 10 evenly sized bins according to the "success_rate". So each bin should represent a decile and contain ten percent of the cumulative data in "frequency". And I am trying to do so by sorting it from lowest to highest success_rate. Below is an excerpt of the data, sorted by the success_rate. And per decile of the cumulative frequency data, I am trying to find an average of both "success_rate" and "availability". Thank you!!
          frequency success_rate availability
          257 0.0101429 0.0324626
          414 0.0132045 0.6706765
          637 0.0135301 0.4140095
          576 0.0137689 0.5567042
          1319 0.013963 0.3013454
          878 0.0154065 0.3464003
          1601 0.0155413 0.4790811
          369 0.0156932 0.55193
          422 0.0157005 0.4254054
          725 0.017094 0.2754081
          467 0.0173873 0.3159016
          427 0.0174372 0.5539747
          816 0.0178512 0.4543116
          1931 0.0184269 0.5617299
          696 0.018487 0.2455464
          187 0.0192474 0.3038814
          855 0.019413 0.6584018
          466 0.019919 0.5623539
          1460 0.0203009 0.4913832
          1012 0.0203631 0.6424753
          3066 0.0208644 0.4795209

          Comment


          • #6
            Thanks. I think I've got a better idea of what you want. Here's some code. As decile bins seem an arbitrary way to throw away some of the information in your data, here is another way to think about them too.

            On the arbitrariness of bins, see also Section 4 of http://www.stata-journal.com/sjpdf.h...iclenum=pr0054 which emphasises that binning in different directions would usually yield different choices.

            Spanning bars aren't documented except in

            SJ-15-1 gr0063 . . . . . Stata tip 122: Variable bar widths in two-way graphs
            . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B. Jann
            Q1/15 SJ 15(1):316--318 (no commands)
            highlights the bartype(spanning) option of the twoway bar
            command, an undocumented feature that can be used to produce
            bars of different widths

            and in the material Ben references. But they aren't difficult so long as you remember that k bins require k + 1 bin boundaries to be specified.

            Code:
            clear 
            input frequency    success_rate    availability
            257    0.0101429    0.0324626
            414    0.0132045    0.6706765
            637    0.0135301    0.4140095
            576    0.0137689    0.5567042
            1319    0.013963    0.3013454
            878    0.0154065    0.3464003
            1601    0.0155413    0.4790811
            369    0.0156932    0.55193
            422    0.0157005    0.4254054
            725    0.017094    0.2754081
            467    0.0173873    0.3159016
            427    0.0174372    0.5539747
            816    0.0178512    0.4543116
            1931    0.0184269    0.5617299
            696    0.018487    0.2455464
            187    0.0192474    0.3038814
            855    0.019413    0.6584018
            466    0.019919    0.5623539
            1460    0.0203009    0.4913832
            1012    0.0203631    0.6424753
            3066    0.0208644    0.4795209
            end 
            
            su frequency, meanonly 
            gen cumprob = sum(frequency) / r(sum) 
            gen decile_bin = ceil(10 * cumprob/cumprob[_N]) 
            gen pp = cond(_n == 1, cumprob/2, (cumprob + cumprob[_n-1])/2)  
            egen wanted = mean(availability), by(decile_bin) 
            list, sepby(decile_bin) 
            
                 +----------------------------------------------------------------------------+
                 | freque~y   succes~e   availa~y    cumprob   decile~n         pp     wanted |
                 |----------------------------------------------------------------------------|
              1. |      257   .0101429   .0324626   .0138313          1   .0069157   .3723829 |
              2. |      414   .0132045   .6706765   .0361122          1   .0249717   .3723829 |
              3. |      637   .0135301   .4140095   .0703945          1   .0532533   .3723829 |
                 |----------------------------------------------------------------------------|
              4. |      576   .0137689   .5567042   .1013939          2   .0858942   .4290248 |
              5. |     1319    .013963   .3013454   .1723804          2   .1368871   .4290248 |
                 |----------------------------------------------------------------------------|
              6. |      878   .0154065   .3464003    .219633          3   .1960067   .3464003 |
                 |----------------------------------------------------------------------------|
              7. |     1601   .0155413   .4790811   .3057962          4   .2627146   .4329562 |
              8. |      369   .0156932     .55193   .3256553          4   .3157257   .4329562 |
              9. |      422   .0157005   .4254054   .3483666          4   .3370109   .4329562 |
             10. |      725    .017094   .2754081    .387385          4   .3678758   .4329562 |
                 |----------------------------------------------------------------------------|
             11. |      467   .0173873   .3159016   .4125182          5   .3999516    .441396 |
             12. |      427   .0174372   .5539747   .4354986          5   .4240084    .441396 |
             13. |      816   .0178512   .4543116   .4794145          5   .4574565    .441396 |
                 |----------------------------------------------------------------------------|
             14. |     1931   .0184269   .5617299   .5833378          6   .5313761   .5617299 |
                 |----------------------------------------------------------------------------|
             15. |      696    .018487   .2455464   .6207954          7   .6020666   .4026099 |
             16. |      187   .0192474   .3038814   .6308595          7   .6258274   .4026099 |
             17. |      855    .019413   .6584018   .6768742          7   .6538669   .4026099 |
                 |----------------------------------------------------------------------------|
             18. |      466    .019919   .5623539   .7019536          8   .6894139   .5268686 |
             19. |     1460   .0203009   .4913832   .7805285          8    .741241   .5268686 |
                 |----------------------------------------------------------------------------|
             20. |     1012   .0203631   .6424753   .8349927          9   .8077606   .6424753 |
                 |----------------------------------------------------------------------------|
             21. |     3066   .0208644   .4795209          1         10   .9174963   .4795209 |
                 +----------------------------------------------------------------------------+
            
            
            
            set obs `= _N + 1'
            replace cumprob = 0 in L 
            sort cumprob 
            twoway bar availability cumprob, bartype(spanning) xtitle(cumulative probability) scheme(s1color) bfcolor(none)

            Click image for larger version

Name:	spanning.png
Views:	1
Size:	27.6 KB
ID:	1429594

            Comment


            • #7
              Thank you so much, Nick!! This is really helpful!!

              Comment

              Working...
              X