Graphing percentages as line graph for categorical variable

Roberto Vidri

Join Date: Mar 2019

Posts: 36
#1

Graphing percentages as line graph for categorical variable

02 Oct 2020, 13:41

Hi all,

I have a variable, "tx_order" with contains 7 categories of a treatment. A second category "YEAR_OF_DIAGNOSIS" indicated the year administered.

I'm trying to plot the percentage of each category per year in a line graph. Where I have a line for each category depicting the percentages - see below. For example, in 2004 "Surgery" would be 34.55%, "Radiation" 0%, etc.

Tabulation:

I created these graphs with binary variables, using the code below. However, I don't know how to do it for a non-binary variable, like the one described above.

tab chemo YEAR_OF_DIAGNOSIS, col
by YEAR_OF_DIAGNOSIS, sort: egen pc_chemo = mean(100*chemo)
label variable pc_chemo "Chemotherapy"

*Graph
twoway (connected pc_chemo YEAR_OF_DIAGNOSIS), ///
xtitle(Year) xlabel(#14) ///
ytitle(Received Radiation (%)) ylabel (0(10)100)

I would really appreciate your help!
Attached Files

Last edited by Roberto Vidri; 02 Oct 2020, 13:49.
Tags: None
Javier Jaramillo Morales

Join Date: Oct 2020

Posts: 2
#2

06 Oct 2020, 14:37

I have exactly the same question. I do not want to create a graph bar or stack bar because it looks too "crowded".
I have 2 categorical variables:
a. Year: 2014, 2015, 2016, 2017, 2018
b. Types of Malformation: cardiac, orofacial, etc (11 in total).

Attached is my table.
I will appreciate the help.
Thanks
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35810
#3

06 Oct 2020, 17:22

These questions interest many people, but I guess the main reason #1 didn't get an answer was the absence of a data example we can use easily, and #2 is no different. Please see FAQ Advice #12 for an explanation of why screenshots (images) are not as helpful as you hope, and what to do instead.

#1 has 7 rather queasy medical categories -- in my view as never more than a patient -- over 4 years in the table and #2 11 categories over 5 years. So, let's fake some data of the second size to make things as realistic as possible.

A glance at both tables shows a very common pattern.

There are some frequent categories and rather more infrequent categories. which are going to be hard to tease apart on a standard line graph.

A popular but not always effective remedy is to supply a legend, which takes up a large fraction of the total display and doesn't usually help much.

The code comes first, and then some commentary.

Code:

clear set obs 55 set seed 2803 egen year = seq(), from(2014) to(2018) egen cat = seq(), to(11) block(5) label def cat 1 algebra 2 biology 3 chemistry 4 drama 5 English 6 forestry 7 geography 8 history 9 idiosyncrasy 10 judo 11 kudos label val cat cat gen freq = (cat - 4)^2 * runiformint(1, 10) egen pc = pc(freq), by(year) * install from SSC * graph 1 fabplot line pc year, by(cat) egen median = median(-pc), by(cat) egen group = group(median cat) * install from Stata Journal labmask group, val(cat) decode * graph 2 fabplot line pc year, by(group, l1title(%, orient(horizontal))) frontopts(lw(thick)) front(connected) xtitle("")

The first graph tried is a fabplot (front and back plot) with (a) a line graph for each series, in front (b) all the other line graphs, in back.

See https://www.statalist.org/forums/for...ailable-on-ssc for the fuller story, except that even people who find this interesting will want to skim and skip through a slow and repetitive story

That's a start, but we can do a lot better.

1. Alphabetical order is a natural default for Stata graphs, but dopey for showing patterns in the data. We should sort the series by magnitude.

2. Each individual series needs more emphasis.

3. A small peeve of mine is that titles like "year" should be cut as obvious. (Other way round, a reader who needs to be told what 2014 to 2018 mean needs even more help than that!)

4. pc is just a short name I thought up, and there the reader does deserve better.

We could reorder by hand, but that is not much fun. I chose to order by median and note that negating the median means that the highest median gets rank 1 from egen, group() It is possible that two categories have the same median; if so, ties are broken by the corresponding categories. Then we have a little deal to get the value labels of the original categorical variable copied over to be the value labels of the new ordered categorical variable. That is what labmask does. The slightly whimsical name comes from the idea that we give a variable new value labels to be worn like a mask; the mask is what you see.

You see that idiosyncrasy scores higher than kudos, but there you go.
1 like
Comment
Javier Jaramillo Morales

Join Date: Oct 2020

Posts: 2
#4

07 Oct 2020, 06:48

Thank you so much for the asnwer Dr. Cox. I will try this code. Also sorry about the picture I will be more careful next time.
Comment
Roberto Vidri

Join Date: Mar 2019

Posts: 36
#5

09 Nov 2020, 16:37

Thank you for your post and help, Dr. Cox
Comment
Sarah Mj

Join Date: Oct 2023

Posts: 4
#6

23 Nov 2023, 02:12

I want to create graph like

I want to see the percentage of tretment 1 and 2 over time

i have the time ver as D14-D30-D45.. separately

and tretment ver coded as 1 and 2
Comment

Announcement

Graphing percentages as line graph for categorical variable

Comment

Comment

Comment

Comment

Comment