Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Is it acceptable to log transform an independent variable even if the residuals of the non-transformed model are normal?

    I have a model with one dependent variable and 7 independent variables. When the model is run without transformations, the Q-Q plot of the residuals appears normal as does the Shapiro Wilk Test.

    Our main independent variable of interest however has a p-value of 0.056. The histogram of the independent variable is highly right skewed. When this predictor is log transformed it has a p-value of 0.000.

    Is it acceptable to log transform an independent variable when the Q-Q plot of the residuals of the non-transformed model does not indicate problems with normality?

  • #2
    Added variable plots can help you decide. In my view, the main aim of regression is not to produce normal residuals or to achieve low P-values but to model the systematic structure in the data.

    Comment


    • #3
      The appropriateness of log-transforming a variable (whether dependent or independent) in a regression model has nothing to do with normality of residuals. It has to do with whether the model is properly specified. Ideally, the use of such transformation is dictated by prior theory about the relationships among the variables. If there is no theory to guide you, graphical exploration of the relationship between the variables, or looking at fitted vs observed plots both ways will tell you which model is appropriate. Note that if the range of values of the variable is somewhat limited, then both the untransformed and log-transformed models may appear to fit the data equally well, and in that case either could be used for that range of data.

      All of that said, why are you wasting your time and pixels looking at normality of residuals? Unless you are trying to do normal-theory influence from a regression carried out on a small data set, this is pointless. The central limit theorem implies that for large samples, the regression coefficients will asymptotically have the mean and standard error that OLS estimation gives you, and will be approximately normally distributed around the population coefficient even if the residual distribution is nowhere near normal. If you are dealing with a small sample, then normality matters, but you should also be aware that the Shapiro-Wilk test will not be nearly powerful enough to give you any useful information about the validity of inferences from OLS regression results. So basically there are two possible situations: you have a large sample and normality is irrelevant, or you have a small sample and there is no useful way to identify normality or the lack thereof in your sample.

      Comment


      • #4
        Dear Kristen,

        Just to add to the excellent advice provided above, I would say that if you care about y you should not estimate a model for ln(y). Indeed, knowing about features of the conditional distribution of ln(y) in general gives you no information about the conditional distribution of y, and therefore models for ln(y) do not give you information about what you want. That is, the fact that the regressor of interest is important in a regression where ln(y) is the dependent variable doe not in any way imply that this regressor is important to explain y.

        Best wishes,

        Joao

        Comment


        • #5
          Dear Clyde Schechter, just one clarification (since most reviewers demand to see normality test results of the residuals when using OLS), how many observations consist a large enough sample?
          Thanks,
          Anat

          Comment


          • #6
            Hi all,

            Thank you for your responses. I should add a few clarifications. Our sample size is 133. We are considering a transformation of one of our predictors, not the outcome variable. The Q-Q plot that I looked at was of the residuals on the y axis and the inverse normal on the x-axis. I will also review the predicted and observed plots for the untransformed and transformed model and let you know what I find.

            Thanks,

            Kristen

            Comment


            • #7
              The specific point you want to address is whether using one log x is or is not a better idea than using one x. Quantile plots of residuals and predicted and observed plots are blunt tools for addressing that; added variable plots, as already urged, are more nearly dedicated to the problem.

              Comment


              • #8
                Re #5: maybe in your discipline users ask for that, not in mine. And the specific answer would depend on just how non-normal the residuals are. For the distributions encountered my line of work, as a rule of thumb it only takes several dozen observations to provide adequate normality for the regression coefficients. If you work with extremely skew distributions, you might need more.

                Comment


                • #9
                  First of all, my apologies to Kristen Ojo for misreading your post. In your case I would just estimate the model ln(x) and look for signs of functional form misspecification; if you cannot find them it is OK.

                  Regarding normality and sample size, the basic rule is that the square of the number of parameters divided by the sample size should be "small". You have 7 regressoes and presumably an intercept; 82/133 = 0.48, so I would say that normality of the errors is not a major issue, unless you are using an estimator that explicitly requires normality (e.g., if you have censoring or truncation).

                  Joao

                  Comment


                  • #10
                    Joao, can you provide any references for that rule of thumb in #9? I've not heard that one before. Thanks.

                    Bruce
                    --
                    Bruce Weaver
                    Email: [email protected]
                    Web: http://sites.google.com/a/lakeheadu.ca/bweaver/
                    Version: Stata/MP 18.0 (Windows)

                    Comment


                    • #11
                      Hi Bruce Weaver,

                      It is not a rule-of-thumb; the reference is:

                      Portnoy, S. (1988). "Asymptotic Behavior of Likelihood Methods for Exponential Families when the Number of Parameters Tends to Infinity," Ann. Statist.,16, 356-366.

                      All the best,

                      Joao

                      Comment


                      • #12
                        Re #11, thank you, Joao.
                        --
                        Bruce Weaver
                        Email: [email protected]
                        Web: http://sites.google.com/a/lakeheadu.ca/bweaver/
                        Version: Stata/MP 18.0 (Windows)

                        Comment


                        • #13
                          Less formal than the suggestions offered above, but still useful to bear in mind, is the wise—and presumably tongue-in-cheek—advice offered by Art Goldberger in A Course in Econometrics: "Most of the time in econometric analysis, when n is close to zero, it is also far from infinity."

                          As Goldberger's observation provides no practical guidance, however, I would defer to Joao's recommendation that k^2/n be "small."

                          Comment


                          • #14
                            I am reminded very obliquely of a serious comment in a pedology textbook that soil is anything so defined by a competent authority.

                            Comment


                            • #15
                              Thank you for your input on sample size.

                              I ran "avplot" after running the "regress" command for both the untransformed and log transformed models. When I say log transformed model, I mean the model with 7 independent variables, one of which was transformed (ln(x+1)). I had to add the 1 because we had values that were 0.

                              I have attached the avplots and think you'll agree that the avplot of the transformed model is more evenly scattered around the trendline--better homoscedasticity. Let me know if you agree and would be inclined to use the transformed model.

                              The first graph is the avplot using the the untransformed independent variable (left). The second graph is of the avplot using the transformed independent variable (right).

                              Click image for larger version

Name:	AVplot for untransformed model edited.png
Views:	1
Size:	25.3 KB
ID:	1414551Click image for larger version

Name:	AVplot transformed model edited.png
Views:	1
Size:	82.0 KB
ID:	1414549

                              .
                              Last edited by Kristen Ojo; 13 Oct 2017, 14:17. Reason: Graphs did not appear as I intended.

                              Comment

                              Working...
                              X