How to add the symmetry axis of the normal distribution when graphing the histograms by groups

Fred Lee

Join Date: Nov 2017
Posts: 473

How to add the symmetry axis of the normal distribution when graphing the histograms by groups

01 Apr 2019, 00:25

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double(tenaciousGoalPursuit group)
2.9 2
2.1 2
3.5 2
2.1 1
3.9 2
4 2
3.4 2
2.4 1
3.3 2
3.2 2
3.3 1
3.2 1
2.2 1
3.7 1
4 2
2.8 1
2.9 2
2.8 1
3.2 2
3.9 1
2.8 1
2.7 1
3.5 2
2.9 2
3.5 2
2.5 1
2.6 1
3.5 1
3.4 1
2.8 1
3.2 2
3.4 1
3.3 2
2.7 1
2.9 2
3.1 2
3.3 1
2.5 1
4.5 1
3.2 1
4.5 2
3.5 1
2 1
2.1 1
3.4 2
2.6 2
2.6 2
3.3 2
2.8 1
3.5 2
3.6 1
3.3 2
3 2
3.5 2
2.5 1
4.1 2
2.8 2
3.1 2
2.5 1
3.1 1
2.9 2
3.1 2
3.4 1
2.6 1
1.7 1
3.8 2
3.3 1
2.3 1
3.5 1
2.3 1
2.6 1
2.2 1
2.7 1
3.7 1
2.3 1
3.2 2
3.6 2
3.8 1
3.8 1
2.7 1
3.2 1
3.5 2
3.3 1
4 1
2.9 2
3.4 2
2.4 1
2.5 2
2.8 1
3 1
3.6 2
2.2 1
3.1 1
2.4 1
2.9 2
3.3 1
3.2 2
3.4 1
3.3 1
3 1
end
hist  tenaciousGoalPursuit ,by(group)  normal

Tags: None

Nick Cox

Join Date: Mar 2014

Posts: 35485
#2

01 Apr 2019, 03:53

I use scheme s1color as a default. This is what your histogram looks like:

It has the strengths and the weaknesses of histograms, the main strength being perhaps that anyone who got through a first course in statistics should have a rough idea of what is being shown and the main weakness being that it's hard to know what is a feature of the data and what is artefact depending on arbitrary bin widths and bin starts. There could also be different views on whether showing density is a good choice, even though it's just the default and the researcher can choose something else. .

In particular, my eye is drawn to the shorter bar near the mean for group 1. Is that a hint of bimodality or just a quirk in the display to be put down to bin alignment?

Side-by-side histograms are not the only possibility, but this one makes comparison of groups (presumably an interesting and important part of the exercise!) quite hard. One has to look closely to see that the mean of group 2 is a bit higher. Otherwise, the impression is that the distributions are broadly similar.

A key point here is that you have, by modern standards, a small dataset, so there's scope to show all the data, and not just a reduction of it.

What's the role of the normal distribution here? Sometimes there are grounds for thinking that data should be approximately normal. More commonly, it is just a refererence distribution.

I would like to persuade you to try something else. Here is one possibility, a normal quantile plot. So, we don't discard the idea of a normal distribution as reference, but we see all the individual values

For this, you need qplot from the Stata Journal.

Your dataex output in #1 is a little mangled, so I'll repeat it here.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input double(tenaciousGoalPursuit group) 2.9 2 2.1 2 3.5 2 2.1 1 3.9 2 4 2 3.4 2 2.4 1 3.3 2 3.2 2 3.3 1 3.2 1 2.2 1 3.7 1 4 2 2.8 1 2.9 2 2.8 1 3.2 2 3.9 1 2.8 1 2.7 1 3.5 2 2.9 2 3.5 2 2.5 1 2.6 1 3.5 1 3.4 1 2.8 1 3.2 2 3.4 1 3.3 2 2.7 1 2.9 2 3.1 2 3.3 1 2.5 1 4.5 1 3.2 1 4.5 2 3.5 1 2 1 2.1 1 3.4 2 2.6 2 2.6 2 3.3 2 2.8 1 3.5 2 3.6 1 3.3 2 3 2 3.5 2 2.5 1 4.1 2 2.8 2 3.1 2 2.5 1 3.1 1 2.9 2 3.1 2 3.4 1 2.6 1 1.7 1 3.8 2 3.3 1 2.3 1 3.5 1 2.3 1 2.6 1 2.2 1 2.7 1 3.7 1 2.3 1 3.2 2 3.6 2 3.8 1 3.8 1 2.7 1 3.2 1 3.5 2 3.3 1 4 1 2.9 2 3.4 2 2.4 1 2.5 2 2.8 1 3 1 3.6 2 2.2 1 3.1 1 2.4 1 2.9 2 3.3 1 3.2 2 3.4 1 3.3 1 3 1 end qplot tenacious, over(group) trscale(invnormal(@)) aspect(1) /// xtitle(standard normal quantile) ytitle(tenacious goal pursuit) /// yla(1.5(0.5)4.5, grid ang(h)) legend(ring(0) order(2 1) pos(11) col(1))

What do I see here?

Granularity. All measurements are multiples of 0.1

Small spikes and gaps. A spike at 2.9 for group 2 and a gap at the same value for group 1. If there's a story, you should be able to tell it. Otherwise, it seems much less alarming than on the histograms. If you have more data than you showed, the gap may well be filled in.

Approximate normality. A feature of using the horizontal scale is that normal distributions will plot as straight lines.

Group 2 has higher mean but lower SD.

Code:

. tabstat tenacious , by(group) s(n mean sd) Summary for variables: tenaciousGoalPursuit by categories of: group group | N mean sd ---------+------------------------------ 1 | 58 2.943103 .5837303 2 | 42 3.264286 .4621379 ---------+------------------------------ Total | 100 3.078 .5567909 ----------------------------------------

I didn't answer the question. If you want a vertical line on each histogram at the position of each mean, you will need to add it explicitly. That is a little hard with your chosen command.

Last edited by Nick Cox; 01 Apr 2019, 04:00.
Comment
Fred Lee

Join Date: Nov 2017

Posts: 473
#3

01 Apr 2019, 04:07

Originally posted by Nick Cox View Post

I use scheme s1color as a default. This is what your histogram looks like:
[ATTACH=CONFIG]n1491115[/ATTACH]

It has the strengths and the weaknesses of histograms, the main strength being perhaps that anyone who got through a first course in statistics should have a rough idea of what is being shown and the main weakness being that it's hard to know what is a feature of the data and what is artefact depending on arbitrary bin widths and bin starts. There could also be different views on whether showing density is a good choice, even though it's just the default and the researcher can choose something else. .

In particular, my eye is drawn to the shorter bar near the mean for group 1. Is that a hint of bimodality or just a quirk in the display to be put down to bin alignment?

Side-by-side histograms are not the only possibility, but this one makes comparison of groups (presumably an interesting and important part of the exercise!) quite hard. One has to look closely to see that the mean of group 2 is a bit higher. Otherwise, the impression is that the distributions are broadly similar.

A key point here is that you have, by modern standards, a small dataset, so there's scope to show all the data, and not just a reduction of it.

What's the role of the normal distribution here? Sometimes there are grounds for thinking that data should be approximately normal. More commonly, it is just a refererence distribution.

I would like to persuade you to try something else. Here is one possibility, a normal quantile plot. So, we don't discard the idea of a normal distribution as reference, but we see all the individual values

For this, you need qplot from the Stata Journal.

Your dataex output in #1 is a little mangled, so I'll repeat it here.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input double(tenaciousGoalPursuit group) 2.9 2 2.1 2 3.5 2 2.1 1 3.9 2 4 2 3.4 2 2.4 1 3.3 2 3.2 2 3.3 1 3.2 1 2.2 1 3.7 1 4 2 2.8 1 2.9 2 2.8 1 3.2 2 3.9 1 2.8 1 2.7 1 3.5 2 2.9 2 3.5 2 2.5 1 2.6 1 3.5 1 3.4 1 2.8 1 3.2 2 3.4 1 3.3 2 2.7 1 2.9 2 3.1 2 3.3 1 2.5 1 4.5 1 3.2 1 4.5 2 3.5 1 2 1 2.1 1 3.4 2 2.6 2 2.6 2 3.3 2 2.8 1 3.5 2 3.6 1 3.3 2 3 2 3.5 2 2.5 1 4.1 2 2.8 2 3.1 2 2.5 1 3.1 1 2.9 2 3.1 2 3.4 1 2.6 1 1.7 1 3.8 2 3.3 1 2.3 1 3.5 1 2.3 1 2.6 1 2.2 1 2.7 1 3.7 1 2.3 1 3.2 2 3.6 2 3.8 1 3.8 1 2.7 1 3.2 1 3.5 2 3.3 1 4 1 2.9 2 3.4 2 2.4 1 2.5 2 2.8 1 3 1 3.6 2 2.2 1 3.1 1 2.4 1 2.9 2 3.3 1 3.2 2 3.4 1 3.3 1 3 1 end qplot tenacious, over(group) trscale(invnormal(@)) aspect(1) /// xtitle(standard normal quantile) ytitle(tenacious goal pursuit) /// yla(1.5(0.5)4.5, grid ang(h)) legend(ring(0) order(2 1) pos(11) col(1))

[ATTACH=CONFIG]n1491116[/ATTACH]

What do I see here?

Granularity. All measurements are multiples of 0.1

Small spikes and gaps. A spike at 2.9 for group 2 and a gap at the same value for group 1. If there's a story, you should be able to tell it. Otherwise, it seems much less alarming than on the histograms. If you have more data than you showed, the gap may well be filled in.

Approximate normality. A feature of using the horizontal scale is that normal distributions will plot as straight lines.

Group 2 has higher mean but lower SD.

Code:

. tabstat tenacious , by(group) s(n mean sd) Summary for variables: tenaciousGoalPursuit by categories of: group group | N mean sd ---------+------------------------------ 1 | 58 2.943103 .5837303 2 | 42 3.264286 .4621379 ---------+------------------------------ Total | 100 3.078 .5567909 ----------------------------------------

I didn't answer the question. If you want a vertical line on each histogram at the position of each mean, you will need to add it explicitly. That is a little hard with your chosen command.

Thanks a lot,Nick! I was inspired a lot thanks to your response. Maybe I will give up drawing the vertical line of each mean, since we can estimate the values approximately.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35485
#4

01 Apr 2019, 04:41

Thanks for the thanks; glad to think it was helpful.

PS: Note that there really is no need to copy all of #2 in replying to it in #3. The point of quotation is to be selective in structuring a reply.
Comment
Fred Lee

Join Date: Nov 2017

Posts: 473
#5

01 Apr 2019, 05:46

Aha, thanks for the tip.
Comment

Announcement

How to add the symmetry axis of the normal distribution when graphing the histograms by groups

Comment

Comment

Comment

Comment