Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Box plot with extreme values

    Hi,

    I would like some advice on how to make my boxplot (below) better. I have extremely different values for each category and not sure what the best options are for making the graph more presentable. Any help would be really appreciated! Thank you!
    Click image for larger version

Name:	Screen Shot 2018-08-20 at 9.38.06 PM.png
Views:	1
Size:	89.0 KB
ID:	1458938



  • #2
    Hi,

    Is there any way to split the box plot from above into two with different y-axis so that the food can be grouped? Thank you for any advice!

    Comment


    • #3
      Plotting on the log scale should improve things without requiring a second graph. See this FAQ written by Nick Cox.
      Last edited by Steve Samuels; 20 Aug 2018, 16:53.
      Steve Samuels
      Statistical Consulting
      [email protected]

      Stata 14.2

      Comment


      • #4
        Steve's suggestion of logarithmic scales is the most crucial. However, if there are zeros you need something a little different such as log(amount + 1) or cube roots.

        In addition, alphabetical order is markedly inferior to ordering by say median and horizontal alignment would allow longer, informative names. You can't prefer "Sorgh" to "Sorghum" or "Banan" to "Bananas" (guessing here).

        Comment


        • #5
          Thank you both for the helpful advice!

          Comment


          • #6
            Hi,

            I have another question about what to do with some of the outliers on the graph for Banana? There is an option to exclude all the outliers but is this a good idea? Also, I am a bit confused now as to how I should label my y-axis since the values are logged? Thank you for any comments and suggestions!

            Click image for larger version

Name:	Screen Shot 2018-08-21 at 9.01.17 AM.png
Views:	1
Size:	103.2 KB
ID:	1458979
            Last edited by Mangji Zo; 21 Aug 2018, 02:08.

            Comment


            • #7
              These questions are driven by the use of a box plot default. graph hbox here follows the Tukey convention that points are shown individually if they lie outside the interval [lower quartile MINUS 1.5 iqr, upper quartile PLUS 1.5 iqr], but that's just a rule of thumb for identifying points that deserve scrutiny, and certainly not an objective criterion for identifying points that should be ignored (even graphically).

              How many observations do you have? Can you give a data example? I think you can get much better plots with other commands, although I did write some of them. (As before, alphabetical order is fairly dopey here.)

              Comment


              • #8
                Hi Nick,

                Thank you for your suggestions! I would definitely like to know how to make better plots than the one I made. I have around 5,600 observations and I have made some changes to the dataset when I graphed my box plot above but here is my original dataset below:

                Thank you!

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input str5 Crop_name float kg_Ae
                "Avoca"  417.1424
                "Avoca" 1679.4794
                "Avoca"  505.3988
                "Avoca" 1451.6776
                "Avoca"  713.9709
                "Avoca"  5198.389
                "Avoca"  517.6205
                "Avoca"  517.6205
                "Avoca"  644.0481
                "Avoca"  816.5007
                "Avoca" 1262.0365
                "Avoca"  923.5715
                "Avoca"  1959.895
                "Avoca"  972.9604
                "Avoca"  844.9393
                "Avoca"  1866.088
                "Avoca" 1537.5515
                "Avoca"   1381.85
                "Avoca"  718.1984
                "Avoca"  945.1615
                "Avoca"  713.9709
                "Avoca" 1692.4987
                "Avoca" 1212.9572
                "Avoca"  1461.394
                "Avoca"  417.1424
                "Avoca"  682.2885
                "Avoca"  977.3156
                "Avoca" 516.15204
                "Avoca"  1719.152
                "Avoca"  713.9709
                "Avoca" 1679.4794
                "Banan"  35361.12
                "Banan"   1492.85
                "Banan"  49807.89
                "Banan"  34671.15
                "Banan"  67691.29
                "Banan"  83618.65
                "Banan"  82168.63
                "Banan" 37408.348
                "Banan"  54884.83
                "Banan" 13283.044
                "Banan" 2228.5352
                "Banan" 36355.938
                "Banan"  51133.71
                "Banan"   57551.3
                "Banan"  77294.67
                "Banan"  48849.38
                "Banan"  46485.19
                "Banan"  83618.65
                "Banan"  41686.72
                "Banan"  152851.3
                "Banan"  8845.711
                "Banan"  56769.86
                "Banan"  54380.91
                "Banan" 2325.5735
                "Banan"  42256.75
                "Banan"  74424.98
                "Banan"  64909.46
                "Banan"  81979.07
                "Banan"  48849.38
                "Banan"  42056.72
                "Banan"  48849.38
                "Banan" 108512.75
                "Banan" 35449.305
                "Banan" 69887.766
                "Banan"  63347.46
                "Banan"  7361.397
                "Banan" 21642.506
                "Banan"  41637.88
                "Banan"  59229.88
                "Banan" 25165.707
                "Banan" 123610.18
                "Banan"  115570.5
                "Banan"  45070.29
                "Banan"  54884.83
                "Banan" 34253.426
                "Banan" 129228.83
                "Banan"  58163.55
                "Banan"   50587.8
                "Banan"  56544.04
                "Banan"  7560.261
                "Banan"  98716.47
                "Banan"  40042.74
                "Banan"  47478.86
                "Banan"  57785.25
                "Banan"  44984.72
                "Banan"  346711.5
                "Banan"  1609.176
                "Banan"  41323.17
                "Banan" 159721.03
                "Banan"  51994.04
                "Banan" 12259.728
                "Banan" 76672.984
                "Banan"  63917.14
                "Banan"  55441.39
                "Banan"  88623.26
                "Banan" 24972.123
                "Banan" 23023.943
                "Banan"   54800.2
                "Banan"  65087.78
                end
                ------------------ copy up to and including the previous line ------------------

                Listed 100 out of 5572 observations
                Use the count() option to list more


                Comment


                • #9
                  Thanks for the example data. This is an example of what can be done using stripplot (SSC).

                  Code:
                  * This only works for the data example
                  replace Crop_name = cond(Crop_name == "Avoca", "Avocados", "Bananas") 
                  
                  gen whatever = log10(kg_Ae) 
                  
                  stripplot whatever, over(Crop_name) vertical box(barw(0.15)) boffset(-0.2) pctile(5) ///
                  stack width(0.05) height(0.3) ms(sh) xtitle("") scheme(s1color) ///
                  yla(3 "1000" 4 "10000" 5 "100000", ang(h)) note(whiskers to 5 and 95% points) xla(, noticks)
                  Click image for larger version

Name:	avocados.png
Views:	1
Size:	20.8 KB
ID:	1459013


                  Points to think about:

                  1. Show the raw data to the extent possible. (In practice, some binning will be needed here.)

                  2. I favour drawing box whiskers to particular percentiles (in essence, a practice long preceding box plot literature). One advantage of this is that (small print aside) transform of percentitle = percentile of transform, so the problem raised in the FAQ cited in #3 does not arise. But always explain any such choice.

                  3. As before, don't use alphabetical order.

                  4. Do use better names.

                  5. The full dataset may oblige horizontal alignment and/or tinkering with width() and height() options.

                  6. Better title needed on vertical axis.

                  Comment


                  • #10
                    Thank you, Nick! Your suggestions are very helpful and I learned a lot!

                    Comment


                    • #11
                      Hi Nick,

                      Thank you for your help earlier. I have another question regarding box plots and stripplots (post #6 and #9). Can you explain a bit more about the default values for whiskers for boxplots and stripplots? For stripplots, is the default always at 5 to 95%? If I want to change that, what would be the command to do so? Thank you!

                      Comment


                      • #12
                        The help for stripplot explains. You don't get a box plot and so you don't get any whiskers by default. All documented.

                        Comment

                        Working...
                        X