Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Reduced Y-axis scale on boxplot

    Outliers in my data are sometimes of such a high value that the actual box and whiskers get compressed to, nearly, a flat line (see attachment).
    It seems Stata allows scaling of the axis only to expand it to a range outside the data range. For instance, below code has no effect:

    HTML Code:
    graph box Wind05 if Wind05 != 1, yscale(range(0 5))
    Is there any other way to clip the y axis in such a way that the box plot would be more nicely visible?

    Attached Files

  • #2
    That's correct. Invocation of yscale() or indeed xscale() that implies omission of data points will be ignored. With data like this you would probably benefit from working on a different scale any way. If you posted the data, you would get positive suggestions.

    If you insist on graph box, watch out for the problem flagged at http://www.stata.com/support/faqs/gr...ithmic-scales/



    Last edited by Nick Cox; 07 May 2015, 10:09.

    Comment


    • #3
      The variable is growth in wind power production, as a year-on-year growth factor, i.e., Yt/Yt-1
      I didn't see any particular reason to use a log scale for this variable. When browsing the data, it appears only 20 out of 3700 observations take a value higher than 5.
      Still, I am reluctant to clip the data at this point, (as with below syntax), because that sort of defeats the purpose of making a box plot. I could use any value above the top whisker to censor data, then get a new value for the top whisker, and repeat. Also, I could also just as well censor data at 4, or at 3, or at π.


      HTML Code:
      graph box Wind05 if Wind05 != 1 & Wind05 < 5

      (I am censoring values of 1, by the way, because no growth, including for countries with no wind power production at all, has been set at a value of one for another exercise)

      Attached Files

      Comment


      • #4
        I'd still recommend a transformation to see what is going on. With stripplot from SSC

        Code:
        stripplot Wind05 if Wind05 != 1, ysc(log) cumul box vertical ysc(log) pctile(5) centre yla(1 2 5 10 20 50 100, ang(h)) xsize(3)
        Click image for larger version

Name:	wind.png
Views:	1
Size:	6.7 KB
ID:	1293553



        This flavour of the box plot draws whiskers out to 5% and 95% percentiles and a particular virtue of that is the commutative property log(quantile) = quantile(log) (saving some very fine print about linear interpolation), so the problem highlighted in the FAQ cited in #2 does not bite. Showing box plots with quantiles as well as quartiles by the way is far from novel and goes back at least to the 1930s.

        In addition, a quantile plot is superimposed.

        This combination was called a quantile-box plot by Emanuel Parzen within

        Parzen, E. 1979a. Nonparametric statistical data modeling. Journal, American Statistical Association 74: 105-121.

        Parzen, E. 1979b. A density-quantile function perspective on robust estimation. In Launer, R.L. and G.N. Wilkinson (Eds) Robustness in statistics. New York: Academic Press, 237-258.

        Parzen, E. 1982. Data modeling using quantile and density-quantile functions. In Tiago de Oliveira, J. and Epstein, B. (Eds) Some recent advances in statistics. London: Academic Press, 23-52.

        and quite possibly also in other papers. (I'd be very grateful for any other references.)

        Note, however, that JMP uses the term quantile-box plot in a different sense, if I recall correctly for the definition of whiskers in terms of percentiles which is used above. The public version of the help file for stripplot (SSC) is some months behind the version on my machines.

        The crucial points to me that

        1. There is no need for, and much loss from, truncating the data. You don't have outrageous outliers so much as a very skewed distribution.

        2. A transformed scale does help to see what is going on. As said, this is a very skewed distribution, even on a log scale.

        3. The conventional box plot doesn't do justice to the interesting detail in the distribution. It's a fairly uninformative graph.

        Comment


        • #5
          Incidental note: values of 1 have special meaning to you and you want to exclude them. That's your choice, but as some values are below 1, I would not describe that as censoring.

          Comment


          • #6
            Thanks for the suggestions and notes. This gives me some more options of representing the distribution without truncating the data.
            The point of the original exercise, by the way, was to graphically represent a pattern of outliers in the variable Wind05. The likely extent (range of values) of year-on-year growth decreases with increasing wind power output in the previous year. I originally planned to create a row of side by side box plots of Wind05 over categories of some other variable.
            Although the higher values may not seem problematic outliers here, they were affecting regression results. They are also purely the result of one big wind park being built where there previously where very few turbines installed, and don't reflect much about longer term growth patterns (e.g., a 150 multiplication of output is perfectly unlikely to be repeated).

            What would be a proper term for excluding specific values within a extreme ranges of the data, by the way? Just excluding, or filtering?

            Comment


            • #7
              Working backwards, it's my understanding that observations with value 1 don't belong in this exercise. I don't think you need a special term for excluding term. The term "filtering" to me usually implies some operation on a time or similar series.

              I'd suggest that an outlier is an outlier whichever way you look at it and that most outliers that are not obvious mistakes or alien visitors are apparent rather than real. I find that flipping to logarithmic scale is the best single way to think about outliers and in fact to make them seem just like values that happen to be much bigger (or on occasion much smaller).

              Only you have the full data but if the response distribution is like this it's hard for me to imagine that linear regression without working on a transformed scale makes much sense. You don't need to transform as you can use a generalized linear model with appropriate link. With new/old as a response it's immediate that log(new/old) = log new - log old and is symmetric around new = old.

              Alternatively you need a different response variable.

              Comment

              Working...
              X