Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Bar chart when one category is extremely frequent

    This is a question on data visualization in Stata, but also relevant for other software.
    I have a dataset of frequencies by year, for 3 categories, as shown in the following example:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input float year double Cat_1 float(Cat_3 Cat_2)
    2004  454  5  19
    2005  791  6  34
    2006  968 11  32
    2007 1020 15  18
    2008 1131 12  20
    2009 1169  7  59
    2010 1879 18  39
    2011 2020 22  36
    2012 2302 16  53
    2013 2610 18  59
    2014 3035 25  52
    2015 3530 17  68
    2016 4248 29  80
    2017 5054 24  81
    2018 6297 37 101
    2019 8383 37 120
    end
    I would like to show the prevalence of the 3 different categories over time. The first problem is that the absolute number of observations increases greatly by year, which led me to consider a graph of relative frequencies (perhaps adding absolute numbers) as follows:
    Code:
    graph hbar Cat_1 Cat_2 Cat_3, over(year) stack percent
    But this is still not very satisfactory because an even bigger problem is that the first category (Cat_1) is overwhelmingly frequent relative to the other two.
    I know in spreadsheets sometimes people artificially resize the height of the bar for the most prevalent category, but Stata does not have an option for that and I do not believe that this would be a good thing to do.
    I am wondering if someone has suggestions on how to effectively visualize my data.
    Last edited by Valentina Rutigliano; 22 Oct 2021, 21:19.

  • #2
    Code:
    line cat_? year, ysc(log) yla(10 100 1000 10000 3 30 300 3000, ang(h))

    Comment


    • #3
      Here are some more bells and whistles.

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input float year double Cat_1 float(Cat_3 Cat_2)
      2004  454  5  19
      2005  791  6  34
      2006  968 11  32
      2007 1020 15  18
      2008 1131 12  20
      2009 1169  7  59
      2010 1879 18  39
      2011 2020 22  36
      2012 2302 16  53
      2013 2610 18  59
      2014 3035 25  52
      2015 3530 17  68
      2016 4248 29  80
      2017 5054 24  81
      2018 6297 37 101
      2019 8383 37 120
      end
      
      set scheme s1color 
      
      gen text = cond(_n == 1, "Cat_1", cond(_n == 2, "Cat_2", cond(_n == 3, "Cat_3", "")))
      gen y = cond(_n == 1, Cat_1[_N], cond(_n == 2, Cat_2[_N], cond(_n == 3, Cat_3[_N], .)))
      gen x = 2019 
      line Cat_? year, ysc(log) yla(3 10 30 100 300 1000 3000 10000, ang(h)) xla(2004(5)2019) /// 
      || scatter y x, ms(none) mla(text) legend(off) xsc(r(. 2020.5)) ytitle(whatever)

      Click image for larger version

Name:	notabarchart.png
Views:	1
Size:	29.6 KB
ID:	1632966


      Details:

      1. With a logarithmic scale you often need to spell out nicer labels. Much more discussion at https://www.stata-journal.com/articl...article=gr0072

      2. One of my slogans is Lose the legend! Kill the key! (if you can). Direct labelling is better if there is the space.Naturally, I don't know what your variables are really called (or should be better called if your variables really are called Cat_1 and so on). If your real names (or text labels) are much longer than shown then you'll need to bump up 2020.5 (or fall back on a legend).

      3. Only you know the readership here -- fellow researchers, senior executives, lay people? My daily newspaper and favourite weekly both use logarithmic scales when a good idea and are optimistic that people know about them or will see what is going on. Oddly I don't remember really learning about log scales except perhaps that something arose in secondary school when a teacher said you'll need logarithmic scale graph paper for this and here it is (circa 1965).

      In a talk to a not very technical audience I might show this

      Click image for larger version

Name:	notabarchart2.png
Views:	1
Size:	34.0 KB
ID:	1632967


      and then say something like this

      On the left you see that Cat_1 has been by far the dominant category and has been shooting up and in comparison the other Cats are tiny. But it's hard to see the detail for those other Cats. On the right I use a logarithmic scale which stretches the low end of the scale and and squeezes the top end. as you can see from the numbers on the left. In fact all Cats are growing at roughly the same percent rate.


      Code:
      line Cat_? year, ysc(log) yla(3 10 30 100 300 1000 3000 10000, ang(h)) xla(2004(5)2019) /// 
      || scatter y x, ms(none) mla(text) legend(off) xsc(r(. 2021.2)) ytitle(whatever) name(G1, replace)
      
      line Cat_? year, yla(0(1000)9000, ang(h)) xla(2004(5)2019) /// 
      || scatter y x, ms(none) mla(text) legend(off) xsc(r(. 2021.2)) ytitle(whatever) name(G2, replace)
      
      graph combine G2 G1

      Comment

      Working...
      X