Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Splitting histogram by quintiles

    Hi all,

    I would like to plot a mental health score distribution in a histogram, and show markers dividing the distribution into 5 quintiles. I tried to do this using the code below, but the vertical lines appear behind the histogram bars (as shown in the image). Is there anyway I can bring them forward, so that the division is clearly visible across the distribution?

    Code:
    histogram sf1_avg, xline(-.97453 .0019515 .6555958 1.196212) xtitle("Mental health score")

    Click image for larger version

Name:	histogram.jpg
Views:	1
Size:	41.4 KB
ID:	1760383



  • #2
    Here is some technique. You need variables holding quintiles, the vertical value at the base of each line (easy: it's zero), and the vertical value at the top of each line (which you can get from a first pass at the histogram: in your case 0.4 might work well).

    There is a review of added line technique in press at the Stata Journal, but that won't be out until around December. There are other ways to do it, but you only need one.

    Code:
    sysuse auto, clear
    histogram mpg, width(1)
    pctile quintile=mpg, nq(5)
    gen zero = 0
    gen top = 0.15
    histogram mpg, width(1) lcolor(stc1*2) fcolor(stc1*0.2) addplot(pcspike zero quintile top quintile) legend(off) note(quintiles added)

    Attached Files

    Comment


    • #3
      Thanks a lot Nick, this worked.

      Comment


      • #4
        Note that a quantile plot is an alternative choice.

        Code:
         
         quantile af1_avg, xla(0 "0" 1 "1", 0.2(0.2)0.8, format(%02.1f))

        Comment


        • #5
          Thanks, I tried this too but got an error msg saying "invalid format"

          Comment


          • #6
            Also, is there a way to name the different quintile regions? For eg, to name the area denoting the 1st quintile as "Poor mental health".

            Comment


            • #7
              #5 There was a typo in my code. Sorry about that. But you too should have noticed that two commas made no sense!

              #6 You can do this. But you will have a severe problem with space with labelling any bin but bin 1.

              As before, you need to work from the maximum density to know good places to put the text.

              Code:
              sysuse auto, clear 
              
              quantile price, xla(0 "0" 1 "1" 0.2(0.2)0.8, format(%02.1f))
              
              pctile quintiles=price, nq(5)
              
              su price, meanonly 
              local x1 = (r(min) + quintiles[1]) / 2 
              local x5 = (r(max) + quintiles[4]) / 2 
              
              forval j = 2/4 { 
                  local x`j' = (quintiles[`j'-1] + quintiles[`j']) / 2 
              }
              
              twoway histogram price, width(500) start(2000) fcolor(stc1*0.2)
              
              local y = 4.8e-4 
              gen y = `y'
              gen zero = 0 
              
              local opts ms(none) mlabc(black) mlabpos(0) mlabsize(large)
              
              twoway histogram price, width(500) start(2000) fcolor(stc1*0.2) ///
              || pcspike y quintiles zero quintiles     /// 
              || scatteri `y' `x1' "1", `opts' ///
              || scatteri `y' `x2' "2", `opts' ///
              || scatteri `y' `x3' "3", `opts' ///
              || scatteri `y' `x4' "4", `opts' ///
              || scatteri `y' `x5' "5", `opts' legend(off) xtitle(Price (USD)) ytitle(Density) yla(0(1e-4)4e-4)
              Click image for larger version

Name:	histogram_quintiles2.png
Views:	1
Size:	28.7 KB
ID:	1760630

              Comment


              • #8
                Conversely, you have more space on a quantile plot.

                Code:
                sysuse auto, clear
                
                gen zero = 0
                su price, meanonly
                gen top = r(max) * 1.1
                gen spike = 0.2 * _n in 1/4
                gen what = _n in 1/5
                gen text = (_n - 0.5) / 5 in 1/5
                
                quantile price, xla(0 "0" 1 "1" 0.2(0.2)0.8, format(%02.1f)) rlopts(lc(none)) aspect(1) ///
                addplot(pcspike top spike zero spike || scatter zero text, ms(none) mlabc(black) mlabsize(large) mla(what) mlabpos(12)) legend(off) yla(0(4000)16000)


                Your code would be more like

                Code:
                gen zero = -6
                gen top = 3  
                gen spike = 0.2 * _n in 1/4
                gen what = _n in 1/5
                gen text = (_n - 0.5) / 5 in 1/5
                
                quantile afl_avg, xla(0 "0" 1 "1" 0.2(0.2)0.8, format(%02.1f)) rlopts(lc(none)) aspect(1) ///
                addplot(pcspike top spike zero spike || scatter zero text, ms(none) mlabc(black) mlabsize(large) mla(what) mlabpos(12)) legend(off) yla(-6/3)
                except that you would want to edit the what variable to the text you want.

                Comment


                • #9
                  Hi Nick,

                  Thanks very much for this very helpful code. I am yet to try it and will get to it soon.

                  I have another related question with regard to plotting lines on an area graph as opposed to a histogram. As before when I just specify the xline values, the lines are hidden behind the graph. I tried adapting your code for the histogram in this case, generating bottom and top values and the specific years where the line appears, but I get an error message saying the option 'addplot' is not allowed. Would you know how this can be done?

                  This is the code I've used so far and the graph it produces:

                  Code:
                  twoway (area production year, fcolor(bluishgray) lcolor(bluishgray) lwidth(none)), ytitle(`"Vehicle sales"') ylabel(, labsize(small) nogrid) ymtick(, labsize(vsmall)) xtitle(, size(zero)) xscale(range(2009 2021)) xline(2013 2016, lcolor(navy)) xlabel(2009(1)2021, labsize(small) nogrid)
                  Click image for larger version

Name:	Graph.jpg
Views:	1
Size:	29.8 KB
ID:	1760981


                  What I would like to produce is something like this:

                  Click image for larger version

Name:	graph1.jpg
Views:	1
Size:	20.7 KB
ID:	1760982

                  Comment


                  • #10
                    Here is the data:
                    Code:
                    * Example generated by -dataex-. For more info, type help dataex
                    clear
                    input int year long production
                    2009 227283
                    2010 244007
                    2011 224193
                    2012 226502
                    2013 215926
                    2014 180311
                    2015 173009
                    2016 161294
                    2018   6371
                    2019   5606
                    2020   4730
                    2021   5391
                    end

                    Comment


                    • #11
                      Code:
                      * Example generated by -dataex-. For more info, type help dataex
                      clear
                      input int year long production
                      2009 227283
                      2010 244007
                      2011 224193
                      2012 226502
                      2013 215926
                      2014 180311
                      2015 173009
                      2016 161294
                      2018   6371
                      2019   5606
                      2020   4730
                      2021   5391
                      end
                      
                      twoway area production year, fcolor(bluishgray) lcolor(bluishgray) lwidth(none) ///
                      , ytitle(`"Vehicle sales"') ylabel(, labsize(small) nogrid) ymtick(, labsize(vsmall)) ///
                      xtitle("") xlabel(2009(1)2021, labsize(small) nogrid) ///
                      || spike production year if inlist(year, 2013,2016), lcolor(navy) legend(off)
                      See the help for addplot option

                      Commands that allow the addplot() option

                      graph commands never allow the addplot() option. The addplot() option is allowed by
                      commands outside graph that are implemented in terms of graph twoway.

                      For instance, the histogram command -- see [R] histogram -- allows addplot(). graph
                      twoway histogram -- see [G-2] graph twoway histogram -- does not.

                      Comment


                      • #12
                        Thank you so much

                        Comment


                        • #13
                          Hi Nick,

                          I have another follow-up question relating to the histogram produced in post #7. Is there anyway to restrict the markers differentiating between quintiles to the height of the histogram at each point, as done for the bar graph, without all of the red lines extending all the way up to the top?

                          Comment


                          • #14
                            That can be done. It is not especially easy unless I am missing something. See

                            SJ-5-2 gr0014 . . . . . . . Stata tip 20: Generating histogram bin variables
                            . . . . . . . . . . . . . . . . . . . . . . . . . . . . D. A. Harrison
                            Q2/05 SJ 5(2):280--281 (no commands)
                            tip illustrating the use of twoway__histogram_gen for
                            creation of complex histograms and other graphs or tables

                            for how to get the density in a variable. Then you need to interpolate. Because your densities are defined for midpoints, that is a little fiddly.

                            Code:
                            * change in another problem:
                            * dataset
                            * variable from price
                            * barwidth from 500
                            * half barwidth from 250
                            * place for labels from 4.5e-4
                            * ylabel from 0(1e-4)4e-4
                            
                            sysuse auto, clear
                            
                            pctile quintiles=price, nq(5)
                            
                            su price, meanonly
                            local x1 = (r(min) + quintiles[1]) / 2
                            local x5 = (r(max) + quintiles[4]) / 2
                            
                            forval j = 2/4 {
                                local x`j' = (quintiles[`j'-1] + quintiles[`j']) / 2
                            }
                            
                            twoway__histogram_gen price, width(500) start(2000) gen(h x)
                            
                            count if x < .
                            
                            local where = r(N)
                            
                            forval j = 1/4 {
                                local ++where
                                replace x = quintiles[`j'] in `where'
                            }
                            
                            sort x
                            
                            list x h if x < .
                            
                            gen H = h
                            replace H = cond(x - x[_n-1] <= 250, h[_n-1], h[_n+1]) if missing(H) & !missing(x[_n-1], x[_n+1])
                            
                            list x h H if x < .
                            
                            gen zero = 0
                            
                            local opts ms(none) mlabc(black) mlabpos(0) mlabsize(large)
                            
                            local y = 4.5e-4
                            
                            twoway bar h x, barw(500) fcolor(stc1*0.1) ///
                            || pcspike H x zero x if H < . & h == ., lw(medthick)   ///
                            || scatteri `y' `x1' "1", `opts' ///
                            || scatteri `y' `x2' "2", `opts' ///
                            || scatteri `y' `x3' "3", `opts' ///
                            || scatteri `y' `x4' "4", `opts' ///
                            || scatteri `y' `x5' "5", `opts' legend(off) xtitle(Price (USD)) ytitle(Density) yla(0(1e-4)4e-4)
                            Click image for larger version

Name:	morequintiles.png
Views:	1
Size:	27.0 KB
ID:	1761808

                            Last edited by Nick Cox; 17 Aug 2024, 03:13.

                            Comment


                            • #15
                              Again, I think this kind of thing works better on a quantile plot.

                              Clearly you need to use your dataset and your variable and customize yla() ytitle(). Otherwise the code is fairly generic for quintiles.

                              Alll that said, while quantile plots remain a great idea, quantile binning is in my view vastly over-rated, and such plots show some of the reasons why.

                              The quantiles as levels are arbitrary and not ever (or hardly ever) substantive or practical thresholds at which something changes otherwise.

                              Clinically, for example, what to think about a patient shouldn't depend on how many other patients are higher or lower. and the same kind of comment carries over to most other contexts.

                              It's also somewhere between likely and inevitable that minutely different individuals may end up in different bins and that at least one bin is highly heterogeneous in a way that obscures important variation.

                              More discussion (and references) at https://journals.sagepub.com/doi/pdf...867X1201200413 (Section 4)

                              https://journals.sagepub.com/doi/pdf...867X1201200413 (Sections 1 and 6)



                              Code:
                              sysuse auto, clear 
                              
                              egen rank = rank(price), unique 
                              egen count = count(price)
                              gen pp = (rank - 0.5) / count 
                              
                              su price, meanonly 
                              local base = r(min)
                              gen base = r(min) - 0.05 * (r(max) - r(min))
                              
                              pctile quintiles = price, nq(5)
                              gen where = 0.2 * _n in 1/4
                              gen where2 = 0.2 * _n - 0.1 in 1/5
                              gen text = _n in 1/5 
                              gen zero = 0 
                              
                              scatter price pp, xla(0 "0" 1 "1" 0.2(0.2)0.8, format(%02.1f)) yla(4000(2000)16000) aspect(0.9) ytitle(Price (USD)) ///
                              || spike quintiles where, base(`base') lc(stc2) ///
                              || spike where quintiles, horizontal lc(stc2) /// 
                              || scatter quintiles zero, ms(none) mla(quintiles) mlabpos(9) mlabsize(medium) mlabc(stc2) xsc(r(-0.13 .)) /// 
                              || scatter base where2, ms(none) mlabc(black) mlabsize(large) mla(text) mlabpos(12) legend(off) ///
                              xtitle(Fraction of data)
                              Click image for larger version

Name:	morequintiles2.png
Views:	1
Size:	38.1 KB
ID:	1761810

                              Comment

                              Working...
                              X