Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Formal tests of normality and are they important?

    Hello!

    I'm conducting some formal tests of normality to see if I can use an ANOVA test: I used the Shapiro-Francia test and a skew test. My data set is N = 2100.

    The Shapiro-Francia test result rejects the null hypothesis, and the skew test suggests that my variables are moderately skewed. I was wondering, are these important to do? I've read some people suggesting that no data set will truly be normal, especially with a large one. Therefore, could I just do the ANOVA test anyway?

    Thank you in advance!

    Best,

    Cassie

  • #2
    See e.g. https://stats.stackexchange.com/ques...tially-useless for many of the arguments.

    If allowed only Yes or No as an answer to Are they important? I would say No. I would always fire up a normal quantile plot using qnorm (or qplot from the Stata Journal) even if as a matter of curiosity I look at the results of a test for normality (and I would want to mention Doornik-Hansen as a good candidate, given that you do any test at all).

    A point missed again and again is that even when normality is an ideal condition for some procedure -- often misleadingly stated as an assumption -- the normality being talked about is (a) about conditional distributions, not marginal distributions (b) usually the least important ideal condition.

    Comment


    • #3
      Cassie:
      as an aside to Nick's guidance, why using -anova- when -regress- can do it better (BTW: normality of residual distribution only is an OLS weak requirement)?
      Kind regards,
      Carlo
      (Stata 19.0)

      Comment


      • #4
        Originally posted by Carlo Lazzaro View Post
        Cassie:
        as an aside to Nick's guidance, why using -anova- when -regress- can do it better (BTW: normality of residual distribution only is an OLS weak requirement)?
        Ah I see. I have been using regress but someone suggested I use ANOVA, so I wanted to give it a go. I didn't know that it was a weak requirement of OLS though. I was wondering, is homoscedasticity an important requirement of OLS?

        Comment


        • #5
          Originally posted by Nick Cox View Post
          See e.g. https://stats.stackexchange.com/ques...tially-useless for many of the arguments.

          If allowed only Yes or No as an answer to Are they important? I would say No. I would always fire up a normal quantile plot using qnorm (or qplot from the Stata Journal) even if as a matter of curiosity I look at the results of a test for normality (and I would want to mention Doornik-Hansen as a good candidate, given that you do any test at all).

          A point missed again and again is that even when normality is an ideal condition for some procedure -- often misleadingly stated as an assumption -- the normality being talked about is (a) about conditional distributions, not marginal distributions (b) usually the least important ideal condition.
          I see. I apologise if it sounds like I'm asking the same question again and again but with different wordings. I'm currently doing a regression model and trying to figure out how to check my model assumptions and fit. Are there any useful tests you would recommend for that? I have seen some things about cooks distance and testing for heteroscedasticity, but I feel as though I am a bit out of my depth as I'm not sure about it's importance.

          Thank you for your time.

          Comment


          • #6
            I apologise if it sounds like I'm asking the same question again and again but with different wordings.
            At the risk of putting words in his mouth, I think Nick Cox's reference to "again and again" was meant at a population scale. That is, this question, or some mild variant thereof, is repeatedly asked by many people on this Forum.

            Concerning normality as a condition for OLS regression, the smaller the sample, the more it matters for getting good standard errors. But the smaller the sample, the less useful any test is for answering the normality question. With even moderate size samples, regression results will be pretty robust to violations of normality unless the distribution of residuals is highly skewed. And in large samples, normality really is not an issue because the central limit theorem will assure that the statistics that the sampling distributions of the coefficient estimates will be (asymptotically) normal.

            As for heteroscedasticity, the statistical tests for it that I am aware of are all targeted towards some specific form of heteroscedasticity. While I might use them, I generally prefer looking at a residuals vs fitted scatterplot, and perhaps some residuals vs predictor scatterplots. The good news is that if it is present, you can resolve the problem by just using robust standard errors. And also remember that heteroscedasticity does not introduce bias into the estimated coefficients--the effect is, again, on the standard errors. So if you are in a situation where only the coefficient estimates are wanted and the standard errors (and consequently the t-statistic, p-value, and confidence intervals) are irrelevant, then you don't even need to consider the question at all.

            Comment


            • #7
              Click image for larger version

Name:	Screenshot 2021-12-23 at 16.59.44.png
Views:	1
Size:	620.0 KB
ID:	1642144
              Originally posted by Clyde Schechter View Post
              At the risk of putting words in his mouth, I think Nick Cox's reference to "again and again" was meant at a population scale. That is, this question, or some mild variant thereof, is repeatedly asked by many people on this Forum.

              Concerning normality as a condition for OLS regression, the smaller the sample, the more it matters for getting good standard errors. But the smaller the sample, the less useful any test is for answering the normality question. With even moderate size samples, regression results will be pretty robust to violations of normality unless the distribution of residuals is highly skewed. And in large samples, normality really is not an issue because the central limit theorem will assure that the statistics that the sampling distributions of the coefficient estimates will be (asymptotically) normal.

              As for heteroscedasticity, the statistical tests for it that I am aware of are all targeted towards some specific form of heteroscedasticity. While I might use them, I generally prefer looking at a residuals vs fitted scatterplot, and perhaps some residuals vs predictor scatterplots. The good news is that if it is present, you can resolve the problem by just using robust standard errors. And also remember that heteroscedasticity does not introduce bias into the estimated coefficients--the effect is, again, on the standard errors. So if you are in a situation where only the coefficient estimates are wanted and the standard errors (and consequently the t-statistic, p-value, and confidence intervals) are irrelevant, then you don't even need to consider the question at all.
              Thank you so much for letting me know! My sample is pretty big N=1895 so am I right in saying that this limits the concern about violating normality of residuals assumption? I think I went down this rabbit hole because when I tried to use a scatter plot for residuals vs fitted, I got this as a result and it confused me. Is this a graph I should be expecting?

              Last edited by Cassie Wright; 23 Dec 2021, 10:00.

              Comment


              • #8
                Well, you cannot assess normality from this scatterplot. For that you would use -qnorm- with the residuals. But given a sample size of 1895, any assessment of normality would be a waste of time. The central limit theorem will bring you safely home.

                The lining up of the points on the graph in vertical lines is always seen in models where all of the predictor variables are discrete: there are then only finitely many possible predicted values. So, assuming that all your predictors are discrete, this is not a problem either.

                As for heteroscedasticity, the "thickness" of the band of points appears to be pretty much the same across the range of predicted values--so nothing to worry about on that score.

                What I do perceive in the plot you show is a suggestion that your model is not quite right. Notice that the graph appears to be sloping somewhat downward to the left. In other words, your model is giving higher residuals at low predicted values and lower residuals at high predicted values. Otherwise put, your model is underpredicting when it predicts low values and overpredicting when it predicts high values of the outcome. This kind of pattern is sometimes seen when the model is misspecified, either by not including some variable that would correct this, or by needing to have some variable transformed in some way to achieve better fit. That said, in this case, the extent of the problem seems fairly small and if these results look good enough for practical purposes, I would not invest an enormous amount of effort in trying to fix it. Also, transforming discrete variables doesn't fix this problem anyway. If I were going to pursue this at all, I would look at including more predictors in the model.

                Comment


                • #9
                  At the risk of putting words in his mouth, I think Nick Cox's reference to "again and again" was meant at a population scale. That is, this question, or some mild variant thereof, is repeatedly asked by many people on this Forum.
                  Absolutely. In fact the reference set is even wider. I lurk on Cross Validated (and in some other places) and misplaced anxiety (or even paranoia) about not having normal distributions is widespread there.

                  Comment


                  • #10
                    Cassie:
                    as far as I can see, -anova-'s popularity has been declining during the last years (exception made for same repeated measures studies).
                    Once you have a decent command of the OLS machinery, you almost forget -anova-.
                    Heteroskedasticty is not intimidating at all; as Clyde wisely pointed out, once detected, -robust- standard errors can manage it.
                    Kind regards,
                    Carlo
                    (Stata 19.0)

                    Comment


                    • #11
                      a bit off the point of the thread (but it has moved anyway) - just to note that at least in some senses, anova remains important as a way of looking at the world; for lots of detail about this, see

                      Gelman, A (2005), "Analysis OF VARIANCE—WHY IT IS MORE IMPORTANT THAN EVER", The Annals of Statistics, 33(1): 1-53 (with discussion)

                      A very much shorter version with a much more limited aim:

                      Gelman, A (2005), "Comment: Anova as a tool for structuring and understanding hierarchical models," Chance, 18(3): 33


                      Comment


                      • #12
                        Originally posted by Carlo Lazzaro View Post
                        Cassie:
                        as far as I can see, -anova-'s popularity has been declining during the last years (exception made for same repeated measures studies).
                        Once you have a decent command of the OLS machinery, you almost forget -anova-.
                        Heteroskedasticty is not intimidating at all; as Clyde wisely pointed out, once detected, -robust- standard errors can manage it.
                        Thank you for letting me know. Just do double check but does that mean I should redo my code from:
                        Code:
                        regress DV IV
                        to
                        Code:
                        regress DV IV, robust

                        Comment


                        • #13
                          Originally posted by Rich Goldstein View Post
                          a bit off the point of the thread (but it has moved anyway) - just to note that at least in some senses, anova remains important as a way of looking at the world; for lots of detail about this, see

                          Gelman, A (2005), "Analysis OF VARIANCE—WHY IT IS MORE IMPORTANT THAN EVER", The Annals of Statistics, 33(1): 1-53 (with discussion)

                          A very much shorter version with a much more limited aim:

                          Gelman, A (2005), "Comment: Anova as a tool for structuring and understanding hierarchical models," Chance, 18(3): 33

                          Fascinating, thank you for your input. I shall give it a read!

                          Comment


                          • #14
                            Cassie:
                            correct.
                            Kind regards,
                            Carlo
                            (Stata 19.0)

                            Comment


                            • #15
                              Originally posted by Clyde Schechter View Post
                              At the risk of putting words in his mouth, I think Nick Cox's reference to "again and again" was meant at a population scale. That is, this question, or some mild variant thereof, is repeatedly asked by many people on this Forum.

                              Concerning normality as a condition for OLS regression, the smaller the sample, the more it matters for getting good standard errors. But the smaller the sample, the less useful any test is for answering the normality question. With even moderate size samples, regression results will be pretty robust to violations of normality unless the distribution of residuals is highly skewed. And in large samples, normality really is not an issue because the central limit theorem will assure that the statistics that the sampling distributions of the coefficient estimates will be (asymptotically) normal.

                              As for heteroscedasticity, the statistical tests for it that I am aware of are all targeted towards some specific form of heteroscedasticity. While I might use them, I generally prefer looking at a residuals vs fitted scatterplot, and perhaps some residuals vs predictor scatterplots. The good news is that if it is present, you can resolve the problem by just using robust standard errors. And also remember that heteroscedasticity does not introduce bias into the estimated coefficients--the effect is, again, on the standard errors. So if you are in a situation where only the coefficient estimates are wanted and the standard errors (and consequently the t-statistic, p-value, and confidence intervals) are irrelevant, then you don't even need to consider the question at all.
                              Hello! So I've done a robust regression, and for two of my indicator variables - although they have a significant p value - the CI includes zero. Does this mean that I have not rejected my null hypothesis?

                              Comment

                              Working...
                              X