Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Why does twoway bar not always include 0 by default?

    The documentation for -twoway bar- reads, "base(#) specifies the value from which the bar should extend. The default is base(0) when 0 falls between the minimum and maximum of yvar. Otherwise, the default base is the value of yvar closest to 0."

    But this does not seem to be true:

    Code:
    // works like anyone outside of Statacorp would expect, but counter to what the documentation says since the min of mpg is >0, but Stata still used base(0):
    sysuse auto, clear
    collapse (sum) mpg, by(rep78)
    twoway bar mpg rep78
    
    // min of mpg is still >0, but now Stata excludes 0, using a default base of the value of yvar closest to 0:
    sysuse auto, clear
    collapse (mean) mpg, by(rep78)
    twoway bar mpg rep78
    
    // now the only way (I know of) to get 0 back is to add base(0) as an option (forcing me to now manually fix the y-axis):
    twoway bar mpg rep78, base(0)
    So, two questions:
    1. why does -twoway bar- change behavior depending on whether data is aggregated by mean or sum?
    2. why would Stata EVER choose to make the default of a bar graph NOT include 0? Am I missing something here? No other software I've used does this and it seems that cases when you don't want to show 0 on a bar graph are exceptionally rare (not to mention it is much easier to alter the base to exclude 0 than it is to add it back and have to mess with labels again).

  • #2
    There is another way you can get the 0 back in the third graph:
    Code:
    twoway bar mpg rep78, ylabel(0(4)28)
    Stata will always extend the axis to include the values specified in the corresponding axis label option. (It will not, however, truncate the axis if there are data points outside the range of the labeled values.)

    Why is zero not always the base? Because if you had data where the values for the tops of the bars were, say, in the range of 100 to 105, using a zero base would make the variation in the bar heights practically invisible because that part of the vertical range would be such a small portion of the entire axis. Since it is usually the variation among the bars that is of interest, not their absolute heights, using a base value that represents the lowest data point (or a round number just below that) makes a graph that is better fit for purpose, and, so, is the default.

    Added: Forgot to answer your question about mean vs sum. The range of values is very different. When big numbers are involved (sum) using a zero base produces a graph that squashes the variation to near invisibility, whereas when smaller numbers are involved (mean) it does not. It's nothing to do with mean and sum per se. It's the difference between large and small numbers.
    Last edited by Clyde Schechter; 22 Sep 2022, 15:26.

    Comment


    • #3
      re question 1 in the original post:
      You (helpfully) point out that Stata's choice to include or exclude 0 has "nothing to do with mean and sum" but is "the difference between large and small numbers." This explanation makes intuitive sense, but unfortunately it is backwards in the example above: collapsing by mean (smaller values) excludes 0 while collapsing by sum (bigger values) includes 0. Another potential explanation for Stata's behavior that checks out in this particular case is that -twoway bar- is relying on some combination of yvar's range and min value (e.g., a high range-to-min ratio means includes 0 while a low range-to-min ratio excludes 0).

      Regardless, even if we do understand why Stata treats these examples differently, the documentation under -twoway bar- (see #1) still seems misleading. You add a nice caveat that that the documentation omits, suggesting that Stata may choose "a round number just below [the min]." Unfortunately, even that caveat doesn't explain Stata's behavior in the following case:

      Code:
      clear
      input byte x double y
       1       .5
       2 .4210526
       3 .5714286
       4 .7222222
       5 .6591337
       6 .6084123
       7 .6459931
       8 .5892181
       9 .6137941
      10 .6062673
      11 .5749965
      12 .5902259
      13 .5871892
      14 .5662012
      15 .5857022
      16 .5770655
      17 .5624521
      18 .5181159
      19 .5013959
      20       .5
      21 .5075922
      22  .556338
      23       .3
      24 .4666667
      end
      
      twoway bar y x
      twoway bar y x, base(0)
      Here, as you can see by running the code or referencing the attached images, Stata misses the mark and excludes the min value of 0.3 entirely from the first graph, suggesting that it's not just rounding down from the min to a round number, but that it might even round up. Am I missing something?


      re question 2 in the original post (this is mostly grumbling, so feel free to address only question 1 if question 2 is off-topic):
      I understand your point that using a zero base can make variation in the bar heights practically invisible and agree there are cases when you don't want to include 0. However, I still don't understand why the default is to exclude 0 even though

      1. nearly all mainstream visualization tools* include 0 by default even if it squashes the variation to near invisibility,
      2. this goes against the prevailing** philosophy that bar graphs excluding 0 can be misleading, and
      3. it requires more thinking to add zero back and correct the axis than to manually increase the base when you actually do have a justification to exclude 0.

      *I can't speak for every tool out there, but this is the case in R (base R, ggplot, + any other library I've ever used), Python (matplotlib, plotly, + every package I've ever used), highcharts (JS), Tableau, Power BI, Qlik, etc. Some BI tools (like Domo or Periscope if I'm not mistaken) will even add a scale break on the y-axis to show that blank space between zero and the min has been omitted if the absolute value of the data is large relative to the variance.

      **See here or here for the first couple of random blogs you'll get when Googling "should graphs include 0", or even this Vox video that supports your argument by emphasizing there are some cases when excluding 0 is acceptable.
      Attached Files

      Comment


      • #4
        I would want to add https://stats.stackexchange.com/ques...-start-at-zero to the discussion. for reasons that will be evident if you look at it. .

        There isn't a full answer to the question unless the developers explain why this was the choice, and I can't speak for them, nor is the link just given emanating from StataCorp. But I will try a guess.

        It's important that twoway bar is just one part of twoway. In general, twoway graphs are based on showing the range of the data, plus some extra space for cosmetic reasons tied up with axis label defaults, unless the user specifies otherwise. You're wantingtwoway bar to be an exception because bars should usually start at zero.

        I think StataCorp's line would be that that is a user's choice based on their data and their goal, on which Stata must remain ignorant.

        As it happens, I agree that bars with arbitrary zero are usually wrong, or at least potentially misleading. But bars with meaningful non-zero bases are quite common. I've seen bars based at other reference levels defined in principle (100 for values with such a base, 1 if parity is a reference level, 32 F if temperatures above and below freezing are of concern) or in practice (bars encode deviation from a mean or smooth series, although the latter requires twoway rbar).

        The point is serious and vexed: to what extent should Stata tell the user what to do, like a parent or a supervisor, whenever there are choices, and in graphics there usually are many choices? My personal advice is that a bar chart is a poor choice for the data in #3. Not knowing anything about what they are -- even whether they are real -- or what your interest is -- puts me n exactly the same position as Stata, and Stata is not going to tell you what is a good idea, although as another researcher I might have advice. You'd be annoyed or surprised if Stata defaulted to showing a line chart instead on the grounds that this is a much better idea. Stata's choice on base of bars is of the same kind, a choice to back off from telling you what you should be doing.
        Last edited by Nick Cox; 23 Sep 2022, 01:52.

        Comment

        Working...
        X