Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Changing P-value after log transforming IVs in logistic regression

    Hello,

    My question is in regard to a logistic regression with the outcome being purchasing of a food group (0=no, 1=yes). There are binary, categorical and continuous independent variables. Two of the continuous IVs are measures of age (AGE_REF) and income (FWAGEXM). To improve interpretability of the odds ratios for these measures, I decided to log transform them using log base 2 for age (variable name is now log2AGE_REF) and log base 10 for income (now log10FWAGEXM). Upon creating log10FWAGEXM, 912 missing values were generated.

    I have read that log transforming continuous variables in a logistic regression model should not drastically change the p-values for those IVs. This seems to be the case for the age variable (log2AGE_REF), however the p-value for log10FWAGEXM changed from .001 to .526. Additionally, the odds ratios for the age IV is now more interpretable, however, the odds ratio for the income variable has not changed significantly.

    I'm attaching log file that shows two outputs. The first includes the original age and income variables and the second includes the transformed variables. I'm also including the commands I used to transform.

    Am I missing something here? Or, is it normal to have p-values change so dramatically? Is this happening because of the 912 missing values that were generated?

    Commands:
    generate log2AGE_REF =log(AGE_REF) / log(2)
    generate log10FWAGEXM = log10(FWAGEXM)

    Log Transformations.smcl

    Thanks,
    Ryan
    Attached Files

  • #2
    Your basic problem is that your data appear to have 912 observations where FWAGEXM is non-missing but is either zero or negative. Since the log function is only defined for positive arguments, this results in 912 missing values being created, and thus in 912 observations being excluded from your model (N is reduced from 3798 to 2886).

    While log transforming a variable may or may not drastically change the p-value, since you consequently removed from your model the ~25% of the observations with the smallest values for FWAGEXM, I am not surprised by the change in the estimated effect for that variable.

    Comment


    • #3
      Thanks for your prompt reply, William. What you noted makes sense. I'm wondering what you think of this as a work around for this issue: If I changed all the zero values to 1 and then log transformed like I did previously. This would solve the problem of creating a lot of missing values and since there is no substantial difference between an income of $0 and $1, the interpretation would be the same.

      Thanks again for your assistance,
      Ryan

      Comment


      • #4
        Generally speaking, this use of so-called "started logs" (adding 1 or some other small positive constant to "x" before logging) has a bad reputation, with one reason being that the associated slope with respect to x may change a fair amount depending on what small constant is used (e.g., 0.1, 0.5, 1, ....). I once encountered precisely your situation, with a binary logistic model and an x that I wanted to log, but which had a lot of 0s. After an unfavorable comment by a journal referee to my using log(x + 1), I tried different values (e.g., log(x+0.5)) and discovered that indeed, the estimated slope changed a fair amount. I then tried using sqrt(x) instead of log, since it's another simple way to model a relation to x that is concave downward. When I did this, I discovered that besides solving the missing value problem, I actually got a substantially better fit with sqrt(x) than with log(x +1). So, you might give that approach a try.

        Comment


        • #5
          Ryan:
          i do share Mike's good advice about avoiding started logs (Tukey JW. Exploratory data analysis. Reading, MA: Addison-Wesley, 1977), which are probably difficult to justify in our powerful computer era (that comes with sounder transformations of the original data metrics).
          That said, I find difficult to get why you did not use the natural log transformation for your IVs and, even more important, I wonder whether those transformations make your ORs more interpretatable.
          Kind regards,
          Carlo
          (Stata 19.0)

          Comment


          • #6
            Thanks Mike and Carlo, for your responses.

            Carlo, here is my thinking behind using log base 2 and log base 10 for age and income, respectively. The odds ratios for these untransformed IVs are difficult to interpret because the unit change is so small - one year of age and one dollar of income. After transforming, a two-fold change in age and a ten-fold change in income then become much easier to understand. For example, after transforming age using log base 2, the odds ratios changed from 1.008 to 1.46. To me, a 46% increase in odds for every two-fold change in age is more understandable than the natural log of 2.78, but perhaps my interpretation is wrong. My intention with using log base 10 for income was to provide a larger unit change, a ten-fold change in income.

            Mike and Carlo, taking a square root of the income variable seems like a good option for resolving the issue of missing values. However, what would be the interpretation of the resulting odds ratio? Would I need to back-transform to obtain an interpretable change in units?

            Thanks again for your assistance,
            Ryan

            Comment


            • #7
              Ryan:
              thanks for providing more details and clarifications.
              As far as square root transformation is concerned, you may be interested in http://stats.stackexchange.com/quest...-a-logit-model
              Kind regards,
              Carlo
              (Stata 19.0)

              Comment

              Working...
              X