Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Box plots

    Hi all,

    I am quite stuck in plotting a proper box plot with STATA. The following is the plot resulting from
    Code:
    graph box avsales_new_no_outliers lagged_tot_sales
    : boxplot_avales_new_lagged_totsales.pdf .

    Of course it is not informative. The summary statistics of the two variables look as follows:

    Code:
    . sum avsales_new_no_outliers lagged_tot_sales
    
        Variable |       Obs        Mean    Std. Dev.       Min        Max
    -------------+--------------------------------------------------------
    avsales_ne~s |      2890     9118740    3.55e+07   11.22149   7.46e+08
    lagged_tot~s |      9680    4.19e+08    2.40e+09    1.99628   5.03e+10
    and the detailed:
    Code:
                       avsales_new_no_outliers
    -------------------------------------------------------------
          Percentiles      Smallest
     1%     44.37548       11.22149
     5%     292.2828       12.29292
    10%     1510.966       13.19586       Obs                2890
    25%        23579       16.25492       Sum of Wgt.        2890
    
    50%     451135.3                      Mean            9118740
                            Largest       Std. Dev.      3.55e+07
    75%      3227006       4.25e+08
    90%     1.88e+07       4.34e+08       Variance       1.26e+15
    95%     4.50e+07       5.36e+08       Skewness       9.474786
    99%     1.49e+08       7.46e+08       Kurtosis       128.6959
    
                          lagged_tot_sales
    -------------------------------------------------------------
          Percentiles      Smallest
     1%      41.2445        1.99628
     5%     639.6671       2.220227
    10%     4347.424       2.519224       Obs                9680
    25%     72393.32       3.017936       Sum of Wgt.        9680
    
    50%      1134182                      Mean           4.19e+08
                            Largest       Std. Dev.      2.40e+09
    75%     1.33e+07       3.88e+10
    90%     1.89e+08       4.30e+10       Variance       5.76e+18
    95%     1.08e+09       4.50e+10       Skewness       8.670407
    99%     1.48e+10       5.03e+10       Kurtosis       98.53456
    Now, I am trying with the logs but do not know if making the box plot of the logs is the right thing to do. The problem seems to be the high variability of lagged_tot_sales and the more observations of that variable w.r.t. avsales_new_no_outliers.
    Have you got any idea of what's happening and what should I do?

    Many thanks,

    Federico
    Attached Files

  • #2
    You are asked to post example graphs using files with .png extensions. Please do read and act on 12.4 in https://www.statalist.org/forums/help#stata
    Then everybody can just see the graphs without needing to click on the links.

    I can't tell you anything about those variables. If you are comparing variables that aren't fairly compared, then don't do that.

    Otherwise, box plots on logarithmic scales are possible but there is a pitfall documented in an FAQ

    FAQ . . . . . . . . . . . . . . . . . . . Box plots and logarithmic scales
    . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox
    9/05 How can I best get box plots on logarithmic scales?

    https://www.stata.com/support/faqs/g...ithmic-scales/

    I find the rule-of-thumb of plotting values individually if and only if they fall outside [lower quartile - 1.5 IQR, upper quartile + 1.5 IQR] awkward in use and to explain. I would rather just see the data. I tend to use stripplot (SSC), with which you can combine different displays. Here for example I show a box but no whiskers -- and all the data too. You have immensely more data points, but the graph doesn't become unreadable as the sample size increases. I also use niceloglabels (Stata Journal) to get some better-looking labels than the default. (Members here have pointed out using small x can be improved upon if you can fire up the multiplication symbol.)

    Code:
    sysuse census
    niceloglabels pop, style(125) local(myla) powers
    stripplot pop, ysc(log) box centre cumul vertical yla(`myla', ang(h))
    Click image for larger version

Name:	pop_boxplot.png
Views:	1
Size:	25.8 KB
ID:	1472262


    That said, it's entirely possible that other displays such as histograms will work well for your data. The niceloglabels paper has some detailed examples of how best to get histograms on logarithmic scales.

    SJ-18-1 gr0072 . . . . . . . Speaking Stata: Logarithmic binning and labeling
    (help niceloglabels) . . . . . . . . . . . . . . . . . . . N. J. Cox
    Q1/18 SJ 18(1):262--286
    introduces the niceloglabels command for helping (even automating)
    label choice
    Last edited by Nick Cox; 27 Nov 2018, 04:18.

    Comment


    • #3
      First of all sorry for my negligence in not having taken a look to the rules about png format.
      Unfortunately I don't think that an example with data, moreover, will be of some help...so I will try your suggestion and let you Know.
      By that time, many many thanks again Professor

      Comment


      • #4
        Note that your two variables have maxima 746,000,000 and 50,300,000,000 and the ratio of those maxima is

        . di 7.46e8/5.03e10
        .01483101

        Hence without a logarithmic scale the distribution of one variable is squeezed into 1.5% of the space of the other.

        Comment


        • #5
          So I need a logarithmic scale to have more visually appealing results right?

          Comment


          • #6
            Indeed; that's my inference and my implication.

            Comment


            • #7
              Perfect!
              Many thanks again!

              Federico

              Comment

              Working...
              X