Interpreting retransformation of dependent variable in regression analysis

Gunnar Ek

Join Date: Dec 2018

Posts: 17
#1

Interpreting retransformation of dependent variable in regression analysis

04 Dec 2018, 22:50

Hi statalisters,

I have question about interpreting the retransformed coefficient in a regression analysis.

I'm analyzing a biomarker which is sampled 2-3 times for approximately 30 patients (total 80 observations). The biomarker is not normally distributed, and therefore I calculated the square root of it. Then I performed a linear regression analysis, and now I have a hard time interpreting the data. I'd like to estimate the average % increase of the biomarker per year, but I'm not sure I can trust the results.

My code looks like this:

gen biomarker2 = sqrt(biomarker)
reg biomarker2 time, vce(cluster patientid)

Taking the coefficient^2 would yield the retransformed coefficient, right? But then I'd like to see the average increase in % per year. Can I just calculate (retransformed coefficient)/intercept, or do I need to retransform the intercept too? What if I'd like to see the average per month - can I divide the increase by 12? Or does not any of this make sense?

Many thanks in advance,
Gunnar
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#2

04 Dec 2018, 23:18

Taking the coefficient^2 would yield the retransformed coefficient, right?

No, no, no!

Let's start by taking the square root transformation of your biomarker, and the use of -reg- as a given. Bear in mind, first, that a simple linear regression of biomarker2 on time does not give you a constant percentage increase in biomarker2 (and still less so in biomarker itself). Rather the percentage increase in biomarker2 will vary over time. You can use the -margins- command to estimate the approximate percentage increases in biomarker itself per unit of time, at each of your three timepoints as follows:

Code:

gen biomarker2 = sqrt(biomarker) reg biomarker2 time, vce(cluster patientid) margins, eydx(time) at(time = (1 2 3)) expression(predict(xb)^2)

Note: What you will get is not expressed in percentage units. You might get numbers that look like 0.05, and you could interpret that as approximately 5%. The approximation involved in using this approach is very good up to about 5%, fair up to about 10 or maybe 15% depending on your needs. After that it is not a good approximation.

If you want an "average" percentage increase in biomarker per unit time, you could get that with

Code:

margins, eydx(time) expression(predict(xb^2))

but in all honesty I don't think that's very meaningful.

But let's dig deeper. First the rationale for the transformation itself is weak. Normality is highly over-rated. First of all, normality of the outcome variable itself is always and everywhere irrelevant. It is the normality of the regression residuals that occasionally matters. Non-normality does not bias the coefficient estimates in linear regression. It can result in incorrect standard errors, t-statistics and p-values, but only in small samples. If your sample is of reasonable size, the central limit theorem will assure that those things come out OK anyway. 80 is probably large enough for this unless the non-normality is really stupendous--but if that is the case the square root transformation won't be enough to help you anyway. (And if your sample is not large, you have no business using vce(cluster patientid) in the anyway!) The only good rationale for doing a square root transformation is if the relationship between sqrt(biomarker) and time is linear but the relationship between biomarker itself and time is not. A scatterplot of biomarker vs time would make that clear. So you may not need the transformation at all. If you don't, then do your regression on biomarker (not biomarker2) and omit -expression(predict(xb^2))- from the -margins- commands.

Next, the use of -regress- here is questionable. You have repeated observations on multiple patients. You are concerned enough about that to include -vce(cluster patientid)-, but you are ignoring the typically much larger and much more important fact that the patients themselves probably vary appreciably on this measure. It is probably more appropriate to include fixed or random effects at the patient level, using -xtreg-.

You refer to trying to change the percentage increase to a monthly rate by dividing by twelve. First, can I assume that your unit for the time variable is years? What you propose wouldn't make any sense without that, but you never say it explicitly. That would make sense if you were working in a model where the percentage increase does not change with time, but since your model doesn't have that property, you can't rely on that approach here. Even there, dividing by 12 would just be an approximation, and only a good one if the percentage is small.

Now, you could do a different model in which the percentage change per unit of time is the same at all times: that would require using log(biomarker) as the outcome variable. (This, however, would be impossible if the biomarker result can be 0; but a constant rate of change is not a possibility when the outcome can be 0 in any case since any change from a value of 0 is an infinite percentage.) Another approach, which is still feasible even if some values of biomarker are 0, is to use one of the log-linked generalized linear models, such as Poisson regression.

Last edited by Clyde Schechter; 04 Dec 2018, 23:23.
3 likes
Comment
Gunnar Ek

Join Date: Dec 2018

Posts: 17
#3

10 Dec 2018, 22:51

Hi,

Thanks for your comments. It's been very helpful!

I will have to read more about -xtreg- in order to understand the difference and what assumptions are underlying before switching from -regress-.

Kind regards
Gunnar
Comment

Announcement

Interpreting retransformation of dependent variable in regression analysis

Comment

Comment