Overlaying normal distribution graphs to compare outcomes between genders

Stat Olivia

Join Date: Mar 2017

Posts: 4
#1

Overlaying normal distribution graphs to compare outcomes between genders

04 Apr 2017, 03:18

Hello,

In order to compare the distribution of math scores between female and male students, we aim to plot bell curves of the results for each gender and overlay them in a single graph with a single set of axes. However, the closest we have come to accomplishing this is to get two separate graphs beside each other in one file. The "twoway" command seems unable to provide normal distribution graphs and smooth lines. How do we achieve our aim?

This is our current code:

Code:

clear input float(mathScoreGirl mathScoreBoy) -.16130325 . . -1.8561664 . -.3401746 -1.4119294 . -.06797387 . 1.1607436 . . .01306791 .3450717 . -3.188828 . -1.4759908 . end histogram mathScoreGirl, percent normal /* */ color(none) lwidth(none) saving(DistrGirls, replace) histogram mathScoreBoy, percent normal /* */ color(none) lwidth(none) saving(DistrBoys, replace) graph combine DistrGirls.gph DistrBoys.gph, ycommon /* */ saving(DistrComp, asis replace)

Thank you in advance!

Last edited by Stat Olivia; 04 Apr 2017, 03:31.
Tags: graph, line, normal distribution, overlay, twoway
Nick Cox

Join Date: Mar 2014

Posts: 35698
#2

04 Apr 2017, 05:11

Stat Olivia: The request to re-register using a full real name still stands. http://www.statalist.org/forums/foru...t-observations #4

You can get what I think you are asking for. It's just very hard to optimise:

1. Superimposing histograms means that one set of bars may often occlude the other. Making the bars transparent solves this problem only to create others.

2. Although still very popular, binning is arbitrary (which width? which start?) and obscures any fine structure of interest and importance as well as noise.

Here is some sample code. I won't show the graph, as it isn't interesting.

Code:

clear input float(mathScoreGirl mathScoreBoy) -.16130325 . . -1.8561664 . -.3401746 -1.4119294 . -.06797387 . 1.1607436 . . .01306791 .3450717 . -3.188828 . -1.4759908 . end twoway histogram mathScoreGirl, /// bfc(none) blc(orange) width(0.5) start(-3.5) /// || function normalden(x), ra(mathScoreGirl) lc(orange) /// || histogram mathScoreBoy, bfc(none) blc(blue) width(0.5) start(-3.5) /// || function normalden(x + 0.5), lcolor(blue) ra(mathScoreGirl) /// legend(order(1 "Girls" 3 "Boys")) xtitle(Mathematics scores) ytitle(Probability density)

Many people will have different favourites for this problem. Kernel density estimation might be one. I was once a fan, now am not so keen. You need not care about that, and in any case others are free to push it. My favourite would be a quantile plot with a transformed probability scale such that a normal distribution shows as a straight line. qnorm will do separate graphs, but superimposition is likely to work better for a problem like yours and for that you could use qplot from the Stata Journal. See also http://stats.stackexchange.com/quest...-sample-t-test and/or search qplot in Stata for references.

Code:

gen female = mathScoreGirl < . label def female 0 Boys 1 Girls label val female female gen mathScore = max(mathScoreGirl, mathScoreBoy) qplot mathScore, trscale(invnormal(@)) over(female) aspect(1) sysuse auto, clear qplot mpg, trscale(invnormal(@)) over(foreign) aspect(1) /// xtitle(standard normal deviate) legend(col(1) order(2 1) pos(11) ring(0))

As before, I won't show the graph for your data: your example served its purpose of showing us your variables and data structure, but the sample is too small to be interesting. I do show the last example for mpg from the auto dataset.

The key principles are

1. A (perfect) sample from a normal distribution falls exactly on a straight line.

2. Samples with different means have different intercepts.

3. Samples with different SDs have different slopes.

4. Curvature may be suggestive of the need for transformation.

5. Fine structure may be discernible as each data point is plotted separately. (Here, it's the granularity whereby mpg is reported as integers.)
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3456
#3

04 Apr 2017, 05:17

Plotting normal distribution as if they are empirical results (which is what you seem to want to do) is often a bad idea. It suggest a structure that is not actually observed. I suspect that this is why it is not implemented as a standard graph. That does not mean that it is impossible. Below is an example. But this does not mean that I recommend you doing this ( I don't).

Code:

// prepare example data sysuse nlsw88, clear gen lnwage = ln(wage) // collect means and standard deviations sum lnwage if race == 1 local m_w = r(mean) local sd_w = r(sd) sum lnwage if race == 2 local m_b = r(mean) local sd_b = r(sd) // plot various smooth lines twoway function white = normalden(x,`m_w',`sd_w') , range(0 4) || /// function black = normalden(x,`m_b',`sd_b') , range(0 4)

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#4

04 Apr 2017, 05:27

Maarten's post seems very consistent with mine. We have the advantage of knowing each other's views on many issues over several years. He picked up what I did not, that you are suppressing the actual histograms and just want the normal curves. I have to sign up to not recommending that at all.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#5

17 Sep 2019, 04:51

Here's an example revisiting #2 for paired data. I've used one of Stata's own datasets so that anyone interested can reproduce this easily after installing qplot from the Stata Journal website.

The example is trivial but the ideas here of plotting the differences so that the data can be checked and the mean compared with zero are more general.

Code:

webuse fuel, clear list gen diff = mpg2 - mpg1 su diff qplot diff, trscale(invnormal(@)) aspect(1) xtitle(standard normal deviate) yla(, ang(h)) yli(`r(mean)') ytitle("difference, mpg2 - mpg1") ymtick(0, grid glc(gs12)) text(`r(mean)' -1.5 "mean", place(12))
Comment

Announcement

Overlaying normal distribution graphs to compare outcomes between genders

Comment

Comment

Comment

Comment