Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Multiple lineal regression / Log transformed variables interpretation

    Hi.
    I am using a database with information on food preservation methods (such as "frozen", "canned", expressed in tertiles of consumption in grams/day) and their effect on different variables (leukocytes, CRP, ...- continuous variables). I have difficulty selecting what is the appropriate model for this.

    1. If dependent variables are kept as continuous variables, should the model be a multiple regression for each food preservation method and dependent variables?
    For example:
    Code:
     regress leukocyte i.cannedtertile
    + other explanatory variables
    Code:
     regress crp i.cannedtertile
    + " " "
    Code:
     regress crp i.frozentertile
    + " " "

    2. Most dependent variables are not normally distributed. For example, the continuous variable "leukocytes" (measured in 10^3 / mm3) does not have a normal distribution, so I have transformed it logarithmically.
    Code:
    gen logleukocyte = log(leukocyte)
    a) I have found that the interpretation should be done like this: exponentiate the coefficient, subtract 1 and multiply by 100 (https://kenbenoit.net/assets/courses...logmodels2.pdf and https://stats.idre.ucla.edu/other/mu...g-transformed/).
    Code:
    regress logleukocyte b1.cannedtertil
    Code:
      
    -------------------------------------------------------------------------------
           logleukocyte |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
    --------------+----------------------------------------------------------------
    cannedtertil |
               2  |  -.0365994   .0171835    -2.13   0.033    -.0702951   -.0029037
               3  |  -.0152055   .0171048    -0.89   0.374    -.0487469    .0183359
        _cons |   1.784896   .0121318   147.13   0.000     1.761107    1.808686
    So the first coefficient could be interpreted as:
    -Coefficient = -.0365994
    -Exponentiate: 0.9641
    -Substract 1: -0.0394
    -Result = -3,594
    So: "compared to the lowest tertile, those in the second canned food consumption tertile have 3,59 10^3/mm3 less leukocytes"
    --> Is this correct?

    b) However, how would the confidence interval be interpreted?
    I have read in this post (https://www.stata.com/stata-news/news34-2/spotlight/) that it is preferable to use log transform and linear regression or Poisson regression followed by the use of the "margins" command, so that the confidence interval is also on the original scale (given that: "It is tempting to simply exponentiate the predictions to convert them back to wages, but the reverse transformation results in a biased prediction (see references Abrevaya [2002]; Cameron and Trivedi [2010]; Duan [1983]; Wooldridge [2010]).")
    c) If the above is correct, is it correct ot use it:
    Code:
    gsem logleukocyte <-  b1.cannedtertil
    -------------------------------------------------------------------------------
                  |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    --------------+----------------------------------------------------------------
    logleukocyte        |
    cannedtertil |
               2  |  -.0365994   .0171659    -2.13   0.033     -.070244   -.0029548
               3  |  -.0152055   .0170873    -0.89   0.374     -.048696     .018285
            _cons |   1.784896   .0121194   147.28   0.000     1.761143     1.80865
    --------------+----------------------------------------------------------------
     var(e.leulog)|   .0716775   .0020492                      .0677717    .0758085
    Code:
    margins, expression(exp(predict(eta))*(exp((_b[/var(e.logleukocyte)])/2)))
    Code:
    margins, expression(exp(predict(eta))*(exp((_b[/var(e.logleukocyte)])/2))) at(cannedtertile=(1(1)3))
    ------------------------------------------------------------------------------
                 |            Delta-method
                 |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
             _at |
              1  |   6.176397   .0751213    82.22   0.000     6.029162    6.323632
              2  |   5.954431   .0726437    81.97   0.000     5.812052     6.09681
    ------------------------------------------------------------------------------
    ...like this?

    --> Also, how would the result be interpreted (6.176397 and 5.954431)? (this is not the same as obtained above: 3.59 10 ^ 3 / mm3)

    c) If not, would you recommend the use of the Poisson model + margins (second option explained here: https://www.stata.com/stata-news/news34-2/spotlight/)? (I have used it too and similar results appear - coefficients around 5.- and 6.- and I don't know how to interpret them).

    3. If I had to use the value of p, would I use the one obtained in the multiple linear regression with the transformed variables?



    I would really appreciate your help.


    Thank you in advance.
    Last edited by Carla RAS; 08 May 2020, 03:58.

  • #2
    Carla:
    as far as logging and subsequently exponentiating back are concerned, there's much to gain with switching to a -glm- model with -link(log) and -family(gamma)-, as you can see from the following toy-example:
    Code:
    use "C:\Program Files\Stata16\ado\base\a\auto.dta"
    . glm price i.foreign mpg, family(gamma) link(log) vce(cluster foreign)
    
    Iteration 0:   log pseudolikelihood = -717.61676
    Iteration 1:   log pseudolikelihood = -717.56027
    Iteration 2:   log pseudolikelihood = -717.56024
    
    Generalized linear models                         Number of obs   =         74
    Optimization     : ML                             Residual df     =         73
                                                      Scale parameter =   .1511033
    Deviance         =  8.306873387                   (1/df) Deviance =   .1137928
    Pearson          =  10.72833745                   (1/df) Pearson  =   .1469635
    
    Variance function: V(u) = u^2                     [Gamma]
    Link function    : g(u) = ln(u)                   [Log]
    
                                                      AIC             =   19.42055
    Log pseudolikelihood = -717.5602423               BIC             =  -305.8899
    
                                    (Std. Err. adjusted for 2 clusters in foreign)
    ------------------------------------------------------------------------------
                 |               Robust
           price |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
         foreign |
        Foreign  |   .2654872   .0370845     7.16   0.000      .192803    .3381714
             mpg |  -.0432842   .0067542    -6.41   0.000    -.0565222   -.0300462
           _cons |   9.539667   .1328216    71.82   0.000     9.279341    9.799992
    ------------------------------------------------------------------------------
    
    . margins
    
    Predictive margins                              Number of obs     =         74
    Model VCE    : Robust
    
    Expression   : Predicted mean price, predict()
    
    ------------------------------------------------------------------------------
                 |            Delta-method
                 |     Margin   Std. Err.      z    P>|z|     [95% Conf. Interval]
    -------------+----------------------------------------------------------------
           _cons |   6136.244    46.7901   131.14   0.000     6044.537    6227.951
    ------------------------------------------------------------------------------
    
    . tabstat price
    
        variable |      mean
    -------------+----------
           price |  6165.257
    ------------------------
    
    .
    As you can see, -margins- result overlaps the mean of -price- without any transformation.
    As an aside, normality in -regress- is a (weak) requirement for residuals distribution only.
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      Thank you very much for your quick reply.

      I understand what you have explained to me. But, could I ask you three more questions?
      1. How to interprete the value obtained in -margins- result in your example.
      2. Is it correct to use the resulting p value of GLM? (for example, to present a result with its significance?
      3. How do you specify the correct family name (family(gamma))? (I have read this: https://www.stata.com/manuals13/rglm.pdf, but I haven't found how this is determined)
      Last edited by Carla RAS; 08 May 2020, 10:04.

      Comment


      • #4
        Carla:
        1) the value obtained in -margins- result is the predicted value of the mean for -price- on its original scale.
        2) yes it would, as -glm- is basically a regression.
        3) the idea behind -familly(gamma)- is that, if you log a continuous variable, in all likelihood it is positively skewed, with a long right tail (for instance, in my reserch field the gamma distribution fits total cost distribution pretty well).
        You can find the following textbooks by Stata press very useful to get yourself familiar with -glm-:
        - https://www.stata.com/bookstore/gene...d-extensions/;
        -https://www.stata.com/bookstore/health-econometrics-using-stata/
        Kind regards,
        Carlo
        (Stata 19.0)

        Comment


        • #5
          Thank you, I am very grateful for your help.

          Comment

          Working...
          X