Bar chart when one category is extremely frequent

Valentina Rutigliano

Join Date: Jan 2021

Posts: 17
#1

Bar chart when one category is extremely frequent

22 Oct 2021, 21:16

This is a question on data visualization in Stata, but also relevant for other software.
I have a dataset of frequencies by year, for 3 categories, as shown in the following example:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float year double Cat_1 float(Cat_3 Cat_2) 2004 454 5 19 2005 791 6 34 2006 968 11 32 2007 1020 15 18 2008 1131 12 20 2009 1169 7 59 2010 1879 18 39 2011 2020 22 36 2012 2302 16 53 2013 2610 18 59 2014 3035 25 52 2015 3530 17 68 2016 4248 29 80 2017 5054 24 81 2018 6297 37 101 2019 8383 37 120 end

I would like to show the prevalence of the 3 different categories over time. The first problem is that the absolute number of observations increases greatly by year, which led me to consider a graph of relative frequencies (perhaps adding absolute numbers) as follows:

Code:

graph hbar Cat_1 Cat_2 Cat_3, over(year) stack percent

But this is still not very satisfactory because an even bigger problem is that the first category (Cat_1) is overwhelmingly frequent relative to the other two.
I know in spreadsheets sometimes people artificially resize the height of the bar for the most prevalent category, but Stata does not have an option for that and I do not believe that this would be a good thing to do.
I am wondering if someone has suggestions on how to effectively visualize my data.

Last edited by Valentina Rutigliano; 22 Oct 2021, 21:19.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35724
#2

23 Oct 2021, 01:52

Code:

line cat_? year, ysc(log) yla(10 100 1000 10000 3 30 300 3000, ang(h))
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#3

23 Oct 2021, 04:10

Here are some more bells and whistles.

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float year double Cat_1 float(Cat_3 Cat_2) 2004 454 5 19 2005 791 6 34 2006 968 11 32 2007 1020 15 18 2008 1131 12 20 2009 1169 7 59 2010 1879 18 39 2011 2020 22 36 2012 2302 16 53 2013 2610 18 59 2014 3035 25 52 2015 3530 17 68 2016 4248 29 80 2017 5054 24 81 2018 6297 37 101 2019 8383 37 120 end set scheme s1color gen text = cond(_n == 1, "Cat_1", cond(_n == 2, "Cat_2", cond(_n == 3, "Cat_3", ""))) gen y = cond(_n == 1, Cat_1[_N], cond(_n == 2, Cat_2[_N], cond(_n == 3, Cat_3[_N], .))) gen x = 2019 line Cat_? year, ysc(log) yla(3 10 30 100 300 1000 3000 10000, ang(h)) xla(2004(5)2019) /// || scatter y x, ms(none) mla(text) legend(off) xsc(r(. 2020.5)) ytitle(whatever)

Details:

1. With a logarithmic scale you often need to spell out nicer labels. Much more discussion at https://www.stata-journal.com/articl...article=gr0072

2. One of my slogans is Lose the legend! Kill the key! (if you can). Direct labelling is better if there is the space.Naturally, I don't know what your variables are really called (or should be better called if your variables really are called Cat_1 and so on). If your real names (or text labels) are much longer than shown then you'll need to bump up 2020.5 (or fall back on a legend).

3. Only you know the readership here -- fellow researchers, senior executives, lay people? My daily newspaper and favourite weekly both use logarithmic scales when a good idea and are optimistic that people know about them or will see what is going on. Oddly I don't remember really learning about log scales except perhaps that something arose in secondary school when a teacher said you'll need logarithmic scale graph paper for this and here it is (circa 1965).

In a talk to a not very technical audience I might show this

and then say something like this

On the left you see that Cat_1 has been by far the dominant category and has been shooting up and in comparison the other Cats are tiny. But it's hard to see the detail for those other Cats. On the right I use a logarithmic scale which stretches the low end of the scale and and squeezes the top end. as you can see from the numbers on the left. In fact all Cats are growing at roughly the same percent rate.

Code:

line Cat_? year, ysc(log) yla(3 10 30 100 300 1000 3000 10000, ang(h)) xla(2004(5)2019) /// || scatter y x, ms(none) mla(text) legend(off) xsc(r(. 2021.2)) ytitle(whatever) name(G1, replace) line Cat_? year, yla(0(1000)9000, ang(h)) xla(2004(5)2019) /// || scatter y x, ms(none) mla(text) legend(off) xsc(r(. 2021.2)) ytitle(whatever) name(G2, replace) graph combine G2 G1
1 like
Comment

Announcement

Bar chart when one category is extremely frequent

Comment

Comment