Multiple lineal regression / Log transformed variables interpretation

Carla RAS

Join Date: Jan 2020

Posts: 10
#1

Multiple lineal regression / Log transformed variables interpretation

08 May 2020, 03:39

Hi.
I am using a database with information on food preservation methods (such as "frozen", "canned", expressed in tertiles of consumption in grams/day) and their effect on different variables (leukocytes, CRP, ...- continuous variables). I have difficulty selecting what is the appropriate model for this.

1. If dependent variables are kept as continuous variables, should the model be a multiple regression for each food preservation method and dependent variables?
For example:

Code:

regress leukocyte i.cannedtertile

+ other explanatory variables

Code:

regress crp i.cannedtertile

+ " " "

Code:

regress crp i.frozentertile

+ " " "

2. Most dependent variables are not normally distributed. For example, the continuous variable "leukocytes" (measured in 10^3 / mm3) does not have a normal distribution, so I have transformed it logarithmically.

Code:

gen logleukocyte = log(leukocyte)

a) I have found that the interpretation should be done like this: exponentiate the coefficient, subtract 1 and multiply by 100 (https://kenbenoit.net/assets/courses...logmodels2.pdf and https://stats.idre.ucla.edu/other/mu...g-transformed/).

Code:

regress logleukocyte b1.cannedtertil

Code:

------------------------------------------------------------------------------- logleukocyte | Coef. Std. Err. t P>|t| [95% Conf. Interval] --------------+---------------------------------------------------------------- cannedtertil | 2 | -.0365994 .0171835 -2.13 0.033 -.0702951 -.0029037 3 | -.0152055 .0171048 -0.89 0.374 -.0487469 .0183359 _cons | 1.784896 .0121318 147.13 0.000 1.761107 1.808686

So the first coefficient could be interpreted as:
-Coefficient = -.0365994
-Exponentiate: 0.9641
-Substract 1: -0.0394
-Result = -3,594
So: "compared to the lowest tertile, those in the second canned food consumption tertile have 3,59 10^3/mm3 less leukocytes"
--> Is this correct?

b) However, how would the confidence interval be interpreted?
I have read in this post (https://www.stata.com/stata-news/news34-2/spotlight/) that it is preferable to use log transform and linear regression or Poisson regression followed by the use of the "margins" command, so that the confidence interval is also on the original scale (given that: "It is tempting to simply exponentiate the predictions to convert them back to wages, but the reverse transformation results in a biased prediction (see references Abrevaya [2002]; Cameron and Trivedi [2010]; Duan [1983]; Wooldridge [2010]).")
c) If the above is correct, is it correct ot use it:

Code:

gsem logleukocyte <- b1.cannedtertil ------------------------------------------------------------------------------- | Coef. Std. Err. z P>|z| [95% Conf. Interval] --------------+---------------------------------------------------------------- logleukocyte | cannedtertil | 2 | -.0365994 .0171659 -2.13 0.033 -.070244 -.0029548 3 | -.0152055 .0170873 -0.89 0.374 -.048696 .018285 _cons | 1.784896 .0121194 147.28 0.000 1.761143 1.80865 --------------+---------------------------------------------------------------- var(e.leulog)| .0716775 .0020492 .0677717 .0758085

Code:

margins, expression(exp(predict(eta))*(exp((_b[/var(e.logleukocyte)])/2)))

Code:

margins, expression(exp(predict(eta))*(exp((_b[/var(e.logleukocyte)])/2))) at(cannedtertile=(1(1)3)) ------------------------------------------------------------------------------ | Delta-method | Margin Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- _at | 1 | 6.176397 .0751213 82.22 0.000 6.029162 6.323632 2 | 5.954431 .0726437 81.97 0.000 5.812052 6.09681 ------------------------------------------------------------------------------

...like this?

--> Also, how would the result be interpreted (6.176397 and 5.954431)? (this is not the same as obtained above: 3.59 10 ^ 3 / mm3)

c) If not, would you recommend the use of the Poisson model + margins (second option explained here: https://www.stata.com/stata-news/news34-2/spotlight/)? (I have used it too and similar results appear - coefficients around 5.- and 6.- and I don't know how to interpret them).

3. If I had to use the value of p, would I use the one obtained in the multiple linear regression with the transformed variables?

I would really appreciate your help.

Thank you in advance.

Last edited by Carla RAS; 08 May 2020, 03:58.
Tags: log-transformed variables, multiple regression

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17708

08 May 2020, 08:05

Carla:
as far as logging and subsequently exponentiating back are concerned, there's much to gain with switching to a -glm- model with -link(log) and -family(gamma)-, as you can see from the following toy-example:

Code:

use "C:\Program Files\Stata16\ado\base\a\auto.dta"
. glm price i.foreign mpg, family(gamma) link(log) vce(cluster foreign)

Iteration 0:   log pseudolikelihood = -717.61676
Iteration 1:   log pseudolikelihood = -717.56027
Iteration 2:   log pseudolikelihood = -717.56024

Generalized linear models                         Number of obs   =         74
Optimization     : ML                             Residual df     =         73
                                                  Scale parameter =   .1511033
Deviance         =  8.306873387                   (1/df) Deviance =   .1137928
Pearson          =  10.72833745                   (1/df) Pearson  =   .1469635

Variance function: V(u) = u^2                     [Gamma]
Link function    : g(u) = ln(u)                   [Log]

                                                  AIC             =   19.42055
Log pseudolikelihood = -717.5602423               BIC             =  -305.8899

                                (Std. Err. adjusted for 2 clusters in foreign)
------------------------------------------------------------------------------
             |               Robust
       price |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
     foreign |
    Foreign  |   .2654872   .0370845     7.16   0.000      .192803    .3381714
         mpg |  -.0432842   .0067542    -6.41   0.000    -.0565222   -.0300462
       _cons |   9.539667   .1328216    71.82   0.000     9.279341    9.799992
------------------------------------------------------------------------------

. margins

Predictive margins                              Number of obs     =         74
Model VCE    : Robust

Expression   : Predicted mean price, predict()

------------------------------------------------------------------------------
             |            Delta-method
             |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
       _cons |   6136.244    46.7901   131.14   0.000     6044.537    6227.951
------------------------------------------------------------------------------

. tabstat price

    variable |      mean
-------------+----------
       price |  6165.257
------------------------

.

As you can see, -margins- result overlaps the mean of -price- without any transformation.
As an aside, normality in -regress- is a (weak) requirement for residuals distribution only.

Kind regards,
Carlo
(Stata 19.0)

Comment

Carla RAS

Join Date: Jan 2020

Posts: 10
#3

08 May 2020, 09:11

Thank you very much for your quick reply.

I understand what you have explained to me. But, could I ask you three more questions?
1. How to interprete the value obtained in -margins- result in your example.
2. Is it correct to use the resulting p value of GLM? (for example, to present a result with its significance?
3. How do you specify the correct family name (family(gamma))? (I have read this: https://www.stata.com/manuals13/rglm.pdf, but I haven't found how this is determined)

Last edited by Carla RAS; 08 May 2020, 10:04.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17708
#4

08 May 2020, 09:59

Carla:
1) the value obtained in -margins- result is the predicted value of the mean for -price- on its original scale.
2) yes it would, as -glm- is basically a regression.
3) the idea behind -familly(gamma)- is that, if you log a continuous variable, in all likelihood it is positively skewed, with a long right tail (for instance, in my reserch field the gamma distribution fits total cost distribution pretty well).
You can find the following textbooks by Stata press very useful to get yourself familiar with -glm-:
- https://www.stata.com/bookstore/gene...d-extensions/;
-https://www.stata.com/bookstore/health-econometrics-using-stata/

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Carla RAS

Join Date: Jan 2020

Posts: 10
#5

08 May 2020, 10:07

Thank you, I am very grateful for your help.
Comment

Announcement

Multiple lineal regression / Log transformed variables interpretation

Comment

Comment

Comment

Comment