EPI STUDIES - log transform data

Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#16

01 Aug 2017, 19:26

Table 1: the glm using family gaussian link identity, however these could now be considered "incorrect" as the log transform did not normalise the data.

The fact that the log transform did not normalize the data says nothing either way about the correctness of the results. I've been making the point that normalizing the variables is pointless and unnecessary.

The use of log-transformation or log-link is correct or not to the extent that it reflects the actual data generating model. If the effect of an increase in the exposure variable is to multiply the outcome by a certain fraction, then a log-linked or log-transformed model is correct. If the effect of an increase in the exposure variable is to add a fixed increment to the outcome, then a log-linked or log-transformed model is incorrect. One might get a sense of that by looking at the expected outcomes when the exposure = 0, 1, and 2. If the expected outcomes go up like a geometric sequence, then logging makes sense. If they look more like an arithmetic sequence, then an untransformed regression (or id link) would be more appropriate.

The two models you show, a linear regression on log-transformed data and a log-linked regression on untransformed data are related but different models. The former estimates the expectation of log(outcome); the later estimates log(expectation of outcome). Since logarithm is a non-linear function, these two expressions are different. So you cannot expect the two models to give closely matching results. Nonetheless, they might well be fairly close.

I have to say that reading your outputs, to me the models are saying more or less the same thing. They show minor quantitative differences, but nothing that strikes me as substantial. In particular, in the case of both the crude and adjusted models. the coefficient in the log-transformed model lies well within the confidence limits of the log-linked model and vice-versa. To the rather loose precision that these models' parameters are estimated by your data, they are not really distinguishable from each other. The models are really quite consistent with each other. The difference in principle between these models is subtle and it would take sharp estimation to distinguish them with confidence:: a model based on a noisy measure like an FFQ has little hope of doing that.
Comment
Melissa Bujtor

Join Date: Jul 2017

Posts: 29
#17

01 Aug 2017, 20:10

We are assuming a linear association between the predictor variable and the outcome, and both would be modeled as continuous. eg we want to see what 1 unit increase of the exposure, does to the dietary intake (decrease the g/day or increase the g/day).

Our issue us that the dependent variable, being dietary intake, is:

- positively skewed
- and given it has dietary intake data there are zero's present.

So am trying to understand / justify the use of GLM with gamma and choose the right link to use?

Then how we interpret the beta of the output for the GLM gamma, ie is it that every one unit increase in the exposure is = to an x unit increase / decrease in the outcome variable?

thanks again Clyde, am enjoying trying to get to the bottom of this to bring everyone in the team in to the new age!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#18

01 Aug 2017, 20:37

If you believe that the relationship between predictor and outcome is linear, then there is no basis for either a log-link or a log-transformation. The skewness of the distribution doesn't change that. Just use ordinary linear regression.

If you were looking to model a multiplicative effect of predictor on outcome, then either a log-link or a log-transformation could be appropriate in general, but with zeroes, the log-transformation is ruled out, and only the log-link model remains.
Comment
Melissa Bujtor

Join Date: Jul 2017

Posts: 29
#19

01 Aug 2017, 20:55

But with linear regression, we need to have normally distributed data do we not? which we dont have...#confused
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#20

01 Aug 2017, 22:51

But with linear regression, we need to have normally distributed data do we not?

No, you do not. That's what I've been saying since #2, and especially clearly in #4. No, you do not need normally distributed variables. That is an urban legend. Please re-read #2 and #4 with the understanding that the main message there is that no, you do not need normally distributed variables. The reference to Greene is all about why you don't need normally distributed anything in a linear regression with a large sample.
Comment
Melissa Bujtor

Join Date: Jul 2017

Posts: 29
#21

01 Aug 2017, 23:05

Got it! Thank you for persisting and for the patience. I just wanted to double check everything before going back with my arguments.

Appreciate you taking the time to step me through this issue - fantastic patience.

Regards
Mel
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment