Formal tests of normality and are they important?

Cassie Wright

Join Date: Dec 2021

Posts: 44
#1

Formal tests of normality and are they important?

23 Dec 2021, 08:09

Hello!

I'm conducting some formal tests of normality to see if I can use an ANOVA test: I used the Shapiro-Francia test and a skew test. My data set is N = 2100.

The Shapiro-Francia test result rejects the null hypothesis, and the skew test suggests that my variables are moderately skewed. I was wondering, are these important to do? I've read some people suggesting that no data set will truly be normal, especially with a large one. Therefore, could I just do the ANOVA test anyway?

Thank you in advance!

Best,

Cassie
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35778
#2

23 Dec 2021, 08:34

See e.g. https://stats.stackexchange.com/ques...tially-useless for many of the arguments.

If allowed only Yes or No as an answer to Are they important? I would say No. I would always fire up a normal quantile plot using qnorm (or qplot from the Stata Journal) even if as a matter of curiosity I look at the results of a test for normality (and I would want to mention Doornik-Hansen as a good candidate, given that you do any test at all).

A point missed again and again is that even when normality is an ideal condition for some procedure -- often misleadingly stated as an assumption -- the normality being talked about is (a) about conditional distributions, not marginal distributions (b) usually the least important ideal condition.
1 like
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17730
#3

23 Dec 2021, 08:51

Cassie:
as an aside to Nick's guidance, why using -anova- when -regress- can do it better (BTW: normality of residual distribution only is an OLS weak requirement)?

Kind regards,
Carlo
(Stata 19.0)
Comment
Cassie Wright

Join Date: Dec 2021

Posts: 44
#4

23 Dec 2021, 09:07

Originally posted by Carlo Lazzaro View Post

Cassie:
as an aside to Nick's guidance, why using -anova- when -regress- can do it better (BTW: normality of residual distribution only is an OLS weak requirement)?

Ah I see. I have been using regress but someone suggested I use ANOVA, so I wanted to give it a go. I didn't know that it was a weak requirement of OLS though. I was wondering, is homoscedasticity an important requirement of OLS?
Comment
Cassie Wright

Join Date: Dec 2021

Posts: 44
#5

23 Dec 2021, 09:27

Originally posted by Nick Cox View Post

See e.g. https://stats.stackexchange.com/ques...tially-useless for many of the arguments.

If allowed only Yes or No as an answer to Are they important? I would say No. I would always fire up a normal quantile plot using qnorm (or qplot from the Stata Journal) even if as a matter of curiosity I look at the results of a test for normality (and I would want to mention Doornik-Hansen as a good candidate, given that you do any test at all).

A point missed again and again is that even when normality is an ideal condition for some procedure -- often misleadingly stated as an assumption -- the normality being talked about is (a) about conditional distributions, not marginal distributions (b) usually the least important ideal condition.

I see. I apologise if it sounds like I'm asking the same question again and again but with different wordings. I'm currently doing a regression model and trying to figure out how to check my model assumptions and fit. Are there any useful tests you would recommend for that? I have seen some things about cooks distance and testing for heteroscedasticity, but I feel as though I am a bit out of my depth as I'm not sure about it's importance.

Thank you for your time.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#6

23 Dec 2021, 09:37

I apologise if it sounds like I'm asking the same question again and again but with different wordings.

At the risk of putting words in his mouth, I think Nick Cox's reference to "again and again" was meant at a population scale. That is, this question, or some mild variant thereof, is repeatedly asked by many people on this Forum.

Concerning normality as a condition for OLS regression, the smaller the sample, the more it matters for getting good standard errors. But the smaller the sample, the less useful any test is for answering the normality question. With even moderate size samples, regression results will be pretty robust to violations of normality unless the distribution of residuals is highly skewed. And in large samples, normality really is not an issue because the central limit theorem will assure that the statistics that the sampling distributions of the coefficient estimates will be (asymptotically) normal.

As for heteroscedasticity, the statistical tests for it that I am aware of are all targeted towards some specific form of heteroscedasticity. While I might use them, I generally prefer looking at a residuals vs fitted scatterplot, and perhaps some residuals vs predictor scatterplots. The good news is that if it is present, you can resolve the problem by just using robust standard errors. And also remember that heteroscedasticity does not introduce bias into the estimated coefficients--the effect is, again, on the standard errors. So if you are in a situation where only the coefficient estimates are wanted and the standard errors (and consequently the t-statistic, p-value, and confidence intervals) are irrelevant, then you don't even need to consider the question at all.
1 like
Comment
Cassie Wright

Join Date: Dec 2021

Posts: 44
#7

23 Dec 2021, 09:56

Originally posted by Clyde Schechter View Post

At the risk of putting words in his mouth, I think Nick Cox's reference to "again and again" was meant at a population scale. That is, this question, or some mild variant thereof, is repeatedly asked by many people on this Forum.

Concerning normality as a condition for OLS regression, the smaller the sample, the more it matters for getting good standard errors. But the smaller the sample, the less useful any test is for answering the normality question. With even moderate size samples, regression results will be pretty robust to violations of normality unless the distribution of residuals is highly skewed. And in large samples, normality really is not an issue because the central limit theorem will assure that the statistics that the sampling distributions of the coefficient estimates will be (asymptotically) normal.

As for heteroscedasticity, the statistical tests for it that I am aware of are all targeted towards some specific form of heteroscedasticity. While I might use them, I generally prefer looking at a residuals vs fitted scatterplot, and perhaps some residuals vs predictor scatterplots. The good news is that if it is present, you can resolve the problem by just using robust standard errors. And also remember that heteroscedasticity does not introduce bias into the estimated coefficients--the effect is, again, on the standard errors. So if you are in a situation where only the coefficient estimates are wanted and the standard errors (and consequently the t-statistic, p-value, and confidence intervals) are irrelevant, then you don't even need to consider the question at all.

Thank you so much for letting me know! My sample is pretty big N=1895 so am I right in saying that this limits the concern about violating normality of residuals assumption? I think I went down this rabbit hole because when I tried to use a scatter plot for residuals vs fitted, I got this as a result and it confused me. Is this a graph I should be expecting?

Last edited by Cassie Wright; 23 Dec 2021, 10:00.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#8

23 Dec 2021, 10:19

Well, you cannot assess normality from this scatterplot. For that you would use -qnorm- with the residuals. But given a sample size of 1895, any assessment of normality would be a waste of time. The central limit theorem will bring you safely home.

The lining up of the points on the graph in vertical lines is always seen in models where all of the predictor variables are discrete: there are then only finitely many possible predicted values. So, assuming that all your predictors are discrete, this is not a problem either.

As for heteroscedasticity, the "thickness" of the band of points appears to be pretty much the same across the range of predicted values--so nothing to worry about on that score.

What I do perceive in the plot you show is a suggestion that your model is not quite right. Notice that the graph appears to be sloping somewhat downward to the left. In other words, your model is giving higher residuals at low predicted values and lower residuals at high predicted values. Otherwise put, your model is underpredicting when it predicts low values and overpredicting when it predicts high values of the outcome. This kind of pattern is sometimes seen when the model is misspecified, either by not including some variable that would correct this, or by needing to have some variable transformed in some way to achieve better fit. That said, in this case, the extent of the problem seems fairly small and if these results look good enough for practical purposes, I would not invest an enormous amount of effort in trying to fix it. Also, transforming discrete variables doesn't fix this problem anyway. If I were going to pursue this at all, I would look at including more predictors in the model.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35778
#9

23 Dec 2021, 10:30

At the risk of putting words in his mouth, I think Nick Cox's reference to "again and again" was meant at a population scale. That is, this question, or some mild variant thereof, is repeatedly asked by many people on this Forum.

Absolutely. In fact the reference set is even wider. I lurk on Cross Validated (and in some other places) and misplaced anxiety (or even paranoia) about not having normal distributions is widespread there.
2 likes
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17730
#10

23 Dec 2021, 12:25

Cassie:
as far as I can see, -anova-'s popularity has been declining during the last years (exception made for same repeated measures studies).
Once you have a decent command of the OLS machinery, you almost forget -anova-.
Heteroskedasticty is not intimidating at all; as Clyde wisely pointed out, once detected, -robust- standard errors can manage it.

Kind regards,
Carlo
(Stata 19.0)
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4491
#11

23 Dec 2021, 13:03

a bit off the point of the thread (but it has moved anyway) - just to note that at least in some senses, anova remains important as a way of looking at the world; for lots of detail about this, see

Gelman, A (2005), "Analysis OF VARIANCE—WHY IT IS MORE IMPORTANT THAN EVER", The Annals of Statistics, 33(1): 1-53 (with discussion)

A very much shorter version with a much more limited aim:

Gelman, A (2005), "Comment: Anova as a tool for structuring and understanding hierarchical models," Chance, 18(3): 33
1 like
Comment
Cassie Wright

Join Date: Dec 2021

Posts: 44
#12

23 Dec 2021, 14:18

Originally posted by Carlo Lazzaro View Post

Cassie:
as far as I can see, -anova-'s popularity has been declining during the last years (exception made for same repeated measures studies).
Once you have a decent command of the OLS machinery, you almost forget -anova-.
Heteroskedasticty is not intimidating at all; as Clyde wisely pointed out, once detected, -robust- standard errors can manage it.

Thank you for letting me know. Just do double check but does that mean I should redo my code from:

Code:

regress DV IV

to

Code:

regress DV IV, robust
Comment
Cassie Wright

Join Date: Dec 2021

Posts: 44
#13

23 Dec 2021, 14:19

Originally posted by Rich Goldstein View Post

a bit off the point of the thread (but it has moved anyway) - just to note that at least in some senses, anova remains important as a way of looking at the world; for lots of detail about this, see

Gelman, A (2005), "Analysis OF VARIANCE—WHY IT IS MORE IMPORTANT THAN EVER", The Annals of Statistics, 33(1): 1-53 (with discussion)

A very much shorter version with a much more limited aim:

Gelman, A (2005), "Comment: Anova as a tool for structuring and understanding hierarchical models," Chance, 18(3): 33

Fascinating, thank you for your input. I shall give it a read!
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17730
#14

24 Dec 2021, 01:34

Cassie:
correct.

Kind regards,
Carlo
(Stata 19.0)
Comment
Cassie Wright

Join Date: Dec 2021

Posts: 44
#15

27 Dec 2021, 08:17

Originally posted by Clyde Schechter View Post

At the risk of putting words in his mouth, I think Nick Cox's reference to "again and again" was meant at a population scale. That is, this question, or some mild variant thereof, is repeatedly asked by many people on this Forum.

Concerning normality as a condition for OLS regression, the smaller the sample, the more it matters for getting good standard errors. But the smaller the sample, the less useful any test is for answering the normality question. With even moderate size samples, regression results will be pretty robust to violations of normality unless the distribution of residuals is highly skewed. And in large samples, normality really is not an issue because the central limit theorem will assure that the statistics that the sampling distributions of the coefficient estimates will be (asymptotically) normal.

As for heteroscedasticity, the statistical tests for it that I am aware of are all targeted towards some specific form of heteroscedasticity. While I might use them, I generally prefer looking at a residuals vs fitted scatterplot, and perhaps some residuals vs predictor scatterplots. The good news is that if it is present, you can resolve the problem by just using robust standard errors. And also remember that heteroscedasticity does not introduce bias into the estimated coefficients--the effect is, again, on the standard errors. So if you are in a situation where only the coefficient estimates are wanted and the standard errors (and consequently the t-statistic, p-value, and confidence intervals) are irrelevant, then you don't even need to consider the question at all.

Hello! So I've done a robust regression, and for two of my indicator variables - although they have a significant p value - the CI includes zero. Does this mean that I have not rejected my null hypothesis?
Comment

Announcement

Formal tests of normality and are they important?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment