Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • ln transformation

    Hello Statalist.
    I am currently on my exam, and having trouble if its worth log transformating variables.
    If i log transformate a lot of variables, i get a higher adj rsq and better BIC stats.
    But should i be looking at the histogram and see if its really "skew", or just log transforming because im getting a better fit?

    Best regards

    Theis Wildgaard

  • #2
    Do I understand you to be asking for help on an exam question? Does your instructor/institution permit external consulting on eams?

    Comment


    • #3
      Yes. Its a 72 hour exam with all aids. So its perfectly fine asking questions, just as long as we refer to your potential answer.

      Theis Wildgaard

      Comment


      • #4
        Well, it depends a great deal on the specifics of the variables involved and the particular question you are trying to answer. But here are a few general principles you might want to apply in thinking about this.

        1. Throwing transformations and other tweaks at models to boost the adjusted R square ultimately results in over-fitting the noise in the data; it is poor statistical practice. Transformations should be applied when there is a good scientific reason for doing so. (Of course, if the question posed is to torture the data as hard as possible to get the best fitting model you can find, then transform as much as you like.)

        2. Applying log-transformation to a variable that takes on zero or negative values is almost always a bad idea (and trying to "fix it" by using log(x+C) for some chosen C is just as bad, often worse.)

        3. When you have a model y = a + bx you are asserting that changes in y remain proportional to x throughout the range of values If you model y = a + b ln(x), then you are saying that changes in y are subject to a (possibly severe) diminishing returns as x gets large. These are very different models and the choice between them is best made on scientific knowledge about the relationships between these variables in most settings. If there is no prior scientific knowledge about this, graphical exploration can be very helpful.

        4. Since you are talking about R square here, I suppose that you are working with an OLS regression. In that case, if you are concerned about the validity of inferences based on the regression, the distributional characteristics of the variables in the model are irrelevant. It is the distribution of the residuals that matters.

        Hope this helps.

        Comment


        • #5
          Thank you Clyde.
          That helped a great deal.
          It was exactly what i was worried about, torturing the data. I could be getting a higher Adj R squared but i might loose a lot of interpritation.
          I did look at some of the graphical displays, and it seemed that a log transformation might have been needed. But then i might be loosing adj r squared. So i guess ill have to think about doing proper statistical analysis or settling with a "decent" answer.
          But thanks, you gave me something to look into!

          Theis Wildgaard.

          Comment


          • #6
            Dear Theis,

            Keep in mind that for models with different dependent variables (e.g., logged and not logged) the R2, AIC, and BIC are not comparable.

            Good luck with the exam.

            Joao

            Comment


            • #7
              Thanks alot for the help.
              I decided to go both ways, i did one where i tweaked the data with ln to get the highest adj r^2 and lowest BIC, and one where i went with what i thought would be the best statistical analysis, so i tried to show that both ways were viable, but with some costs.

              I decided not to ln transform the depended variable because of that Joao, our professor told us about the dangers doing exactly that.

              Theis Wildgaard.

              Comment

              Working...
              X