Stacked area plot

Masoumeh Sanagou

Join Date: May 2017

Posts: 107
#1

Stacked area plot

14 Mar 2018, 17:22

Hi Statalist,

I'm wondering if I could produce the attached graph (stacked area plot) using Stata?

I used

twoway area

and played with different options but I could not produce exactly the same graph.

Groups overlaid each other. Each category did not start from the end of previous category. (my graph)

Last edited by Masoumeh Sanagou; 14 Mar 2018, 17:26.
Tags: None
Masoumeh Sanagou

Join Date: May 2017

Posts: 107
#2

14 Mar 2018, 17:28

graphs are attached here.
Attached Files
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10190

15 Mar 2018, 06:31

The stack option is part of the graph command but you can be able to recreate a stacked area plot with twoway. However, you should communicate to your readers that the graphs are stacked (and not overlaid), otherwise the y-axis makes no sense. The trick is to create cumulative categories and plot the last category first. Here is an example

Code:

clear
set obs 16
set seed 1234
gen year= 1999+_n
forvalues i= 1/4{
gen CNTRY`i'= rnormal(10000, `i'500)
}

*GENERATE CUMULATIVE VALUES
gen _CNTRY1= CNTRY1

forvalues i= 2/4{
local j= `i'-1
gen _CNTRY`i'= _CNTRY`j' + CNTRY`i'
}

*REVERSE ORDER: PLOT LAST CUMULATIVE CATEGORY FIRST

twoway (area _CNTRY4 _CNTRY3 _CNTRY2 _CNTRY1 year, graphregion(color(white))///
yla(0(10000)50000) xlab(2000(5) 2015) color(gs4 gs8 gs12 gs 16))

Click image for larger version

Name: stacked_area.png
Views: 1
Size: 41.7 KB
ID: 1434548

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35694
#4

20 Aug 2020, 04:52

Quite different issues arise here, ranging from how can I do this in Stata to why StataCorp doesn't support this kind of graph directly. To the last. I think the best answer is that Stata focuses on common primitive graph types and very common derived graph types. By the last distinction I mean that a bar chart is a standard primitive graphic, but no user of a statistics language or environment expects to be told that a histogram is a special kind of bar chart, so please go away, work out some bin limits and frequencies, and then come back asking for a bar chart. If anyone thinks that a stacked area chart is worthy of direct support, then StataCorp drew the line in the wrong place for them.

I want to focus on yet another issue, easy to ask and harder to answer: how well this design really works? I have drawn stacked bar charts (quite often) and stacked area charts (occasionally) but I have come to distrust them. Usually you can do better.

This is a long post, so here is the executive summary. Stacked plots are quite popular, but do they really work well? Consider whether a design based on small multiples would work as well or better.

If any graph is what you want, that's good. If a graph is what your readers can understand and find helpful, that's very good. Yet some graphs are better than others. We can argue about details, but graphs aren't equally easy to understand in principle or equally effective in practice at conveying a message.

Nothing that follows is original or unusual. But just like say pie charts, much criticised but still often used, the stacked design refuses to die quietly.

Stacked charts are seductive. My guess is that if people spelled out why they chose one, the argument would go like this:

1. Components of a total (with zero or positive values) can be stacked because they are additive.

2. The graph shows a total directly, which is usually of interest and importance.

3. The graph shows the components too, so you can drill down to see the detail.

4. So, one graph gives a summary and shows details too. Good design, good use of space (and reader time).

#1 is right and #2 usually works fine. The problem is whether #3 really works as implied, given that only the series stacked at the bottom is easy to compare with its horizontal baseline and so the others are harder to compare with each other or indeed with anything else. In practice, when some values are small they can be harder to compare, or even to see clearly -- beyond the obvious fact that one quantity is much smaller than another.

Also, most designs imply the use of legends, which are at best a necessary evil. Faced with a legend, how often do you think "I could study this in detail because I understand the principle, but life is short, so not now"? Not now usually morphs into never.

If #3 doesn't work well, #4 isn't convincing.

The first post in this thread showed what were recognisably some Australian data but didn't give a source. So Andrew Musau just faked some data to show a solution. No criticism of that but Australian data for states and territories are a really good example for thinking about the strengths -- and limitations -- of this design as

* 8 (or so) states and territories are few enough to make this design tempting to many, but also many enough to show its limitations. (Compare, for example, 50 states of the US, not to mention DC, Puerto Rico, Guam, and so forth, as more likely to produce a ridiculous mess.) .

* Any Australian (researcher or lay) -- and many a non-Australian too -- really isn't surprised to see New South Wales and Victoria looming large for many kinds of data, and so on, and territories being harder to spot. So what else is new? What can we really learn from the graph?

Another easy but also hard question is what are the goals here, and naturally I can't speak for anyone but myself.

So I downloaded some Australian data to play. Here they are

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int year long(pop1 pop2 pop3 pop4 pop5 pop6 pop7 pop8) int pop9 2001 6530349 4763615 3571469 1906274 1503461 473668 321538 201743 2584 2002 6580807 4817774 3653123 1928512 1511567 474152 324627 202251 2397 2003 6620715 4873809 3743121 1952741 1520399 478534 327357 201725 2336 2004 6650735 4927149 3829970 1979542 1528189 483178 328940 202663 2356 2005 6693206 4989246 3918494 2011207 1538804 486202 331399 205905 2381 2006 6742690 5061266 4007992 2050581 1552529 489302 335170 209057 2379 2007 6834156 5153522 4111018 2106139 1570619 493262 342644 213748 2514 2008 6943461 5256375 4219505 2171700 1588665 498568 348368 219874 2683 2009 7053755 5371934 4328771 2240250 1608902 504353 354785 226027 2876 2010 7144292 5461101 4404744 2290845 1627322 508847 361766 229778 3055 2011 7218529 5537817 4476778 2353409 1639614 511483 367985 231292 3117 2012 7304244 5651091 4568687 2425507 1656725 511724 376539 235915 3033 2013 7404032 5772669 4652824 2486944 1671488 512231 383257 241722 2962 2014 7508353 5894917 4719653 2517608 1686945 513621 388799 242894 2896 2015 7616168 6022322 4777692 2540672 1700668 515117 395813 244692 2851 2016 7732858 6173172 4845152 2555978 1712843 517514 403104 245678 4608 2017 7867936 6321606 4927629 2574193 1723923 522410 412025 247517 4621 2018 7980168 6462019 5009424 2594181 1736527 528298 420379 247058 4634 2019 8089817 6596039 5094510 2621509 1751963 534457 426704 245929 4643 end label var pop1 "New South Wales" label var pop2 "Victoria" label var pop3 "Queensland" label var pop4 "Western Australia" label var pop5 "South Australia" label var pop6 "Tasmania" label var pop7 "Australian Capital Territory" label var pop8 "Northern Territory" label var pop9 "Other Territories"

Incidental detail: This isn't in general the best structure for these data. It's a fair starting point for present purposes, however. The comments below flag where I used community-contributed commands, which must be installed before you can use them.

Let's give the stacked design a chance. I use some of Andrew Musau's trickery (another route would make more use of twoway rarea).

Code:

* ssc inst mycolours * https://www.statalist.org/forums/forum/general-stata-discussion/general/1568168-mycolours-package-available-on-ssc mycolours clonevar show1 = pop1 forval j = 2/9 { local k = `j' - 1 gen show`j' = show`k' + pop`j' _crcslbl show`j' pop`j' } local toshow forval j = 9(-1)1 { local toshow `toshow' show`j' } twoway area `toshow' year, xla(2001 2019 2005(5)2015) xtitle("") //// col("`OK9'" "`OK8'" "`OK7'" "`OK6'" "`OK5'" "`OK4'" "`OK3'" "`OK2'" "`OK1'") /// yla(0 5e6 "5" 1e7 "10" 15e6 "15" 2e7 "20" 25e6 "25", ang(h)) legend(col(1) pos(3)) ytitle(Population (millions)) name(G0, replace)

I worked a bit to make this presentable, but there is still a long way to go. Leaving the legend large is a slightly mischievous way to underline that legend management is a big deal for such graphs, even with just 9 items. (I don't know enough about Australia, or was too lazy, to work out whether there are official or traditional colours for each state or territory.)

The bigger deal yet is how well this really works. I can read off, for example, that the population of Victoria was increasing faster than New South Wales over this period. But much detail that might be interesting has been suppressed. If that is, in a sense, the goal, then so be it.

A natural alternative is to switch to small multiples, several panels in one graph. 9 variables fall easily into a 3 x 3 display. Here are some line plots.

Code:

* ssc install combineplot * https://www.statalist.org/forums/forum/general-stata-discussion/general/3055-combineplot-available-on-ssc combineplot (pop*) year, combine(name(G1, replace)): line @y @x , yla(, ang(h)) xtitle("")

Still lots to do here, e.g. work on the axis labels, but we can see much extra detail. "Other Territories" requires a story, and that would mean drilling down to a smaller scale. Here I just leave them out.

Here, and very often, the trade-off between (1) choosing different scales to honour the detail for each series and (2) choosing the same scale to make series comparable is difficult and delicate. Logarithmic scale often helps, but not much here.

Let's look at relative growth. For 2001-2019 data, 2010 is one possible origin. Here 2010 data are in the 10th observation -- at least if we sort on year. How to use an index year with panel data is a little trickier, but covered at https://www.stata-journal.com/articl...article=dm0055

Code:

sort year forval j = 1/9 { gen pop`j'_s = pop`j'/pop`j'[10] _crcslbl pop`j'_s pop`j' } combineplot (pop1_s-pop8_s) year, combine(name(G2, replace) note(2010=100)): /// line @y @x , yla(0.8 "80" 0.9 "90" 1 "100" 1.1 "110" 1.2 "120", ang(h)) xtitle("") ysc(r(0.8 1.2)) yli(1, lc(gs12) lw(thin))

Here the next step might be ordering the areas differently, say from fastest growing to slowest. Or some might wonder whether superimposing the graphs would work better.

Naturally the total can be calculated easily as a row sum and as another variable. (That is one reason why I said this is a fair structure for present purposes.)

A final point, at least for now, is that these data are smoothly varying, so easier to work with than many other variables.
4 likes
Comment

Announcement

Comment

Comment

Comment