Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • creating a box whiskers plot with a dot representing the mean but without outliers

    Dear Statalisters, thanks in advance for your help. I am aware that this topic has been already discussed in the past (and ucla has a page about it) but I am trying to modify the code excluding the outliers without succeeding on that.

    The code provided on ucla is the follow:

    use http://www.ats.ucla.edu/stat/data/hsb2, clear
    sort prog

    by prog: egen med = median(read)
    by prog: egen lqt = pctile(read), p(25)
    by prog: egen uqt = pctile(read), p(75)
    by prog: egen iqr = iqr(read)
    by prog: egen mean = mean(read)


    twoway rbar lqt med prog, fcolor(gs12) lcolor(black) barw(.5) || ///
    rbar med uqt prog, fcolor(gs12) lcolor(black) barw(.5) || ///
    rspike lqt ls prog, lcolor(black) || ///
    rspike uqt us prog, lcolor(black) || ///
    rcap ls ls prog, msize(*6) lcolor(black) || ///
    rcap us us prog, msize(*6) pstyle(p1) || ///
    scatter outliers prog, mcolor(black) || ///
    scatter mean prog, msymbol(Oh) msize(*2) fcolor(gs12) mcolor(black) ///
    legend(off) xlabel( 1 "general" 2 "academic" 3 "vocational") ///
    ytitle(reading score) graphregion(fcolor(gs15))





    If I replace with my variables in it it works just fine but I failed in all the attempts to exclude the outliers. Could you please help me with this? Many thanks in advance, Raffaele

  • #2
    Hello Raffaele,

    Maybe I didn't understand your query and perhaps you may get further help.

    That said, you are supposed to get the median (not the mean) when you employ a box plot.

    Furthermore, you didn't present the command in Stata to perform a box plot, but just scatter plots and range plots

    With regards to the outliers, they are part and parcel of your data. I don't get the gist of deleting them. They will provide "food for thought" for the rationale of your statistical analysis.

    Best,

    Marcos
    Best regards,

    Marcos

    Comment


    • #3
      Hi Marcos, thanks for your reply.I am aware that outliers should not be omitted as part of the box plot but given that in my data I want to plot income distribution by region, as it often happens, data are considerably skewed.
      I know that my command is not the right one to perform a box plot but it provides a different way to plot it when you would like to include the mean as well

      I can include the source:
      http://www.ats.ucla.edu/stat/stata/code/twboxplot.htm

      As I said before, I tried to play with the code in order to get rid of the outliers but I was not successful while (of course) the code in this original version works perfectly with my data too.

      Thanks,

      Raffaele

      Comment


      • #4
        You are asking for comments on why code you don't show us doesn't do what you want.

        I am with Marcos on feeling queasy at leaving out outliers from box plots, especially if you also show whiskers.

        But if you plot incomes, or indeed any other positive variable, on log scale then you will pull in your outliers. Further, in principle log(median(varname)) = median(log(varname))) and similarly for quartiles, setting aside small print on the last step in each case often being averaging of two values. Hence using a box meshes well with using log scale. You then have a choice between showing arithmetic means on log scale and showing geometric means, or both.

        See also http://www.stata.com/support/faqs/gr...les/index.html for a warning about whiskers extending up to 1.5 IQR from each quartile and log scale.

        More generally, see

        http://www.stata-journal.com/article...article=gr0039

        http://www.stata-journal.com/article...ticle=gr0039_1

        http://www.stata-journal.com/article...article=gr0045

        All that said, see also stripplot from SSC. Here I use logarithmic scale and a quantile-box plot as suggested by Parzen and add geometric means too. References for the quantile-box plot are in the help for stripplot.

        Code:
        sysuse auto, clear
        set scheme s1color
        egen gmean = gmean(price), by(foreign)
        stripplot price , box center vertical cumul cumprob over(foreign) ysc(log) ///
        addplot(scatter gmean foreign, ms(Dh) msize(*2)) xla(, noticks) yla(, ang(h)) ytitle(Price (USD))
        Click image for larger version

Name:	qboxplot.png
Views:	1
Size:	11.2 KB
ID:	1301119






        Comment


        • #5
          Thank you, that was extremely helpful!

          Comment


          • #6
            You can also show the geometric mean using a reference line, without needing to create a new variable. In the example given that works better in my view.

            Code:
            sysuse auto, clear
            set scheme s1color
            stripplot price , box center vertical cumul cumprob over(foreign) ysc(log) ///
            refline reflevel(gmean) xla(, noticks) yla(, ang(h)) ytitle(Price (USD))
            Click image for larger version

Name:	qboxplot2.png
Views:	1
Size:	11.0 KB
ID:	1301366

            Comment

            Working...
            X