Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • margins and log-transformed dependent variable

    Hello,

    My dependent variable is expressed as log(y + 1). This is done because some of my y values are equal to zero. My independent variables are raw scores. It is a panel dataset. I estimate the effects using fixed-effects OLS.

    When only the dependent variable is log-transformed, we exponentiate the coefficient to obtain the multiplicative factor for every 1-unit increase in the independent variable x. This is clear.

    However, what if I'm willing to understand the effect for a given value of x in the original units of y. Would this procedure be correct? I am using Stata's dataset as an example since I cannot share my data.

    I understand that using expression(exp(xb()) - 1) under margins will make the conversion the original values of y, taking into account that 1 was added.

    Code:
    . use https://www.stata-press.com/data/r18/nlswork.dta, clear
    (National Longitudinal Survey of Young Women, 14-24 years old in 1968)
    
    . * Converting log-transformed wage to its original units
    
    . gen wage = exp(ln_wage)
    
    . * Taking the natural log of wage plus 1
    
    . gen ln_wage_plus1 = log(wage + 1)
    
    . xtset idcode year
    
    Panel variable: idcode (unbalanced)
     Time variable: year, 68 to 88, but with gaps
             Delta: 1 unit
    
    . xtreg ln_wage_plus1 i.year tenure union wks_work, fe robust
    
    Fixed-effects (within) regression               Number of obs     =     18,637
    Group variable: idcode                          Number of groups  =      4,112
    
    R-squared:                                      Obs per group:
         Within  = 0.1477                                         min =          1
         Between = 0.2128                                         avg =        4.5
         Overall = 0.1622                                         max =         12
    
                                                    F(14, 4111)       =      99.93
    corr(u_i, Xb) = 0.1698                          Prob > F          =     0.0000
    
                                 (Std. err. adjusted for 4,112 clusters in idcode)
    ------------------------------------------------------------------------------
                 |               Robust
    ln_wage_pl~1 | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
    -------------+----------------------------------------------------------------
            year |
             71  |   .0267133   .0085263     3.13   0.002     .0099971    .0434296
             72  |   .0299803   .0098336     3.05   0.002      .010701    .0492595
             73  |   .0290363   .0107442     2.70   0.007     .0079719    .0501007
             77  |   .0623787    .011784     5.29   0.000     .0392757    .0854816
             78  |   .0832801   .0122998     6.77   0.000     .0591659    .1073943
             80  |   .0207636   .0132531     1.57   0.117    -.0052197    .0467468
             82  |   .0334551   .0133158     2.51   0.012      .007349    .0595612
             83  |   .1083472   .0133449     8.12   0.000      .082184    .1345104
             85  |   .0721576   .0142641     5.06   0.000     .0441922     .100123
             87  |    .086793   .0151642     5.72   0.000     .0570629     .116523
             88  |   .1494352   .0155955     9.58   0.000     .1188596    .1800109
                 |
          tenure |    .012772    .000996    12.82   0.000     .0108192    .0147248
           union |   .0782881   .0082139     9.53   0.000     .0621845    .0943917
        wks_work |   .0015437   .0001197    12.89   0.000      .001309    .0017784
           _cons |   1.699111   .0113788   149.32   0.000     1.676802    1.721419
    -------------+----------------------------------------------------------------
         sigma_u |  .33375129
         sigma_e |  .21308098
             rho |   .7104247   (fraction of variance due to u_i)
    ------------------------------------------------------------------------------
    
    . summarize wks_work if e(sample)
    
        Variable |        Obs        Mean    Std. dev.       Min        Max
    -------------+---------------------------------------------------------
        wks_work |     18,637     63.2597    28.42125          0        104
    
    . margins, at(wks_work = (0(10)100)) expression(exp(xb()) - 1)
    
    Predictive margins                                      Number of obs = 18,637
    Model VCE: Robust
    
    Expression: exp(xb()) - 1
    1._at:  wks_work =   0
    2._at:  wks_work =  10
    3._at:  wks_work =  20
    4._at:  wks_work =  30
    5._at:  wks_work =  40
    6._at:  wks_work =  50
    7._at:  wks_work =  60
    8._at:  wks_work =  70
    9._at:  wks_work =  80
    10._at: wks_work =  90
    11._at: wks_work = 100
    
    ------------------------------------------------------------------------------
                 |            Delta-method
                 |     Margin   std. err.      z    P>|z|     [95% conf. interval]
    -------------+----------------------------------------------------------------
             _at |
              1  |   5.282635    .047974   110.11   0.000     5.188608    5.376663
              2  |   5.380371   .0410873   130.95   0.000     5.299841    5.460901
              3  |   5.479627   .0339773   161.27   0.000     5.413033    5.546221
              4  |   5.580427    .026641   209.47   0.000     5.528212    5.632642
              5  |   5.682795   .0190793   297.85   0.000     5.645401     5.72019
              6  |   5.786756   .0113115   511.58   0.000     5.764586    5.808926
              7  |   5.892334   .0035904  1641.14   0.000     5.885297    5.899371
              8  |   5.999554   .0055614  1078.78   0.000     5.988654    6.010454
              9  |   6.108443   .0139617   437.51   0.000     6.081078    6.135807
             10  |   6.219025   .0227721   273.10   0.000     6.174392    6.263657
             11  |   6.331327   .0318807   198.59   0.000     6.268842    6.393812
    ------------------------------------------------------------------------------

  • #2
    If you estimate the specification
    Code:
    log(y+1) = xb + u
    you are perhaps implicitly assuming that E[u|x]=0.

    Retransformation will give
    Code:
    y = exp(xb)*exp(u) - 1
    so that
    Code:
    E[y|x] = exp(xb)*E[exp(u)|x] - 1
    However E[u|x]=0 does not imply E[exp(u)|x]=1. So in general using exp(xb)-1 as you've specified will not correctly describe the conditional mean of y.

    Edward Norton and I discuss such issues in a recent paper https://onlinelibrary.wiley.com/doi/10.1111/obes.12583 in which we also raise concerns about the use of log(y+1)-type transformations of dependent variables.

    Comment


    • #3
      Originally posted by John Mullahy View Post
      If you estimate the specification
      Code:
      log(y+1) = xb + u
      you are perhaps implicitly assuming that E[u|x]=0.

      Retransformation will give
      Code:
      y = exp(xb)*exp(u) - 1
      so that
      Code:
      E[y|x] = exp(xb)*E[exp(u)|x] - 1
      However E[u|x]=0 does not imply E[exp(u)|x]=1. So in general using exp(xb)-1 as you've specified will not correctly describe the conditional mean of y.

      Edward Norton and I discuss such issues in a recent paper https://onlinelibrary.wiley.com/doi/10.1111/obes.12583 in which we also raise concerns about the use of log(y+1)-type transformations of dependent variables.
      Dear John,

      Thank you for your reply and sharing your paper. I will try the alternative estimation methods your study suggests.

      Still, if one would use the xtreg, fe robust specification and log(y + 1) as the dependent variable (as described in my initial message), would it be possible to say that expression(exp(xb()) - 1) gives the most accurate conversion to the original values of y? Following my initial example, it seems that the point estimates do not deviate much if simple log(y) is used. This is also the case if the retransformation for the log function using the standard Duan homoskedastic smearing estimate (like in your paper) is used.
      Code:
      . use https://www.stata-press.com/data/r18/nlswork.dta, clear
      (National Longitudinal Survey of Young Women, 14-24 years old in 1968)
      
      . xtset idcode year
      
      Panel variable: idcode (unbalanced)
       Time variable: year, 68 to 88, but with gaps
               Delta: 1 unit
      
      . xtreg ln_wage i.year tenure union wks_work, fe robust
      
      Fixed-effects (within) regression               Number of obs     =     18,637
      Group variable: idcode                          Number of groups  =      4,112
      
      R-squared:                                      Obs per group:
           Within  = 0.1435                                         min =          1
           Between = 0.2147                                         avg =        4.5
           Overall = 0.1633                                         max =         12
      
                                                      F(14, 4111)       =      99.61
      corr(u_i, Xb) = 0.1738                          Prob > F          =     0.0000
      
                                   (Std. err. adjusted for 4,112 clusters in idcode)
      ------------------------------------------------------------------------------
                   |               Robust
           ln_wage | Coefficient  std. err.      t    P>|t|     [95% conf. interval]
      -------------+----------------------------------------------------------------
              year |
               71  |    .032752   .0104792     3.13   0.002     .0122071    .0532969
               72  |   .0356803   .0120706     2.96   0.003     .0120153    .0593453
               73  |   .0342248   .0131951     2.59   0.010     .0083553    .0600944
               77  |   .0745899   .0144043     5.18   0.000     .0463497    .1028301
               78  |   .1014739   .0149435     6.79   0.000     .0721766    .1307712
               80  |   .0238239   .0160676     1.48   0.138    -.0076773    .0553251
               82  |   .0367923   .0161919     2.27   0.023     .0050474    .0685371
               83  |   .1264629   .0161343     7.84   0.000      .094831    .1580948
               85  |   .0790787   .0172319     4.59   0.000     .0452949    .1128625
               87  |   .0947541   .0183261     5.17   0.000      .058825    .1306832
               88  |   .1711229   .0186361     9.18   0.000     .1345859    .2076598
                   |
            tenure |   .0147531   .0011576    12.74   0.000     .0124835    .0170226
             union |   .0957137   .0096782     9.89   0.000     .0767392    .1146882
          wks_work |   .0018948   .0001449    13.07   0.000     .0016106    .0021789
             _cons |    1.48285    .013827   107.24   0.000     1.455742    1.509958
      -------------+----------------------------------------------------------------
           sigma_u |  .39694574
           sigma_e |  .25262166
               rho |  .71173251   (fraction of variance due to u_i)
      ------------------------------------------------------------------------------
      
      . margins, at(wks_work = (0(10)100)) expression(exp(xb()))
      
      Predictive margins                                      Number of obs = 18,637
      Model VCE: Robust
      
      Expression: exp(xb())
      1._at:  wks_work =   0
      2._at:  wks_work =  10
      3._at:  wks_work =  20
      4._at:  wks_work =  30
      5._at:  wks_work =  40
      6._at:  wks_work =  50
      7._at:  wks_work =  60
      8._at:  wks_work =  70
      9._at:  wks_work =  80
      10._at: wks_work =  90
      11._at: wks_work = 100
      
      ------------------------------------------------------------------------------
                   |            Delta-method
                   |     Margin   std. err.      z    P>|z|     [95% conf. interval]
      -------------+----------------------------------------------------------------
               _at |
                1  |   5.178712     .04788   108.16   0.000     5.084869    5.272555
                2  |   5.277771   .0411533   128.25   0.000     5.197112     5.35843
                3  |   5.378726    .034155   157.48   0.000     5.311783    5.445668
                4  |   5.481611   .0268803   203.93   0.000     5.428927    5.534295
                5  |   5.586464   .0193292   289.02   0.000      5.54858    5.624349
                6  |   5.693323   .0115257   493.97   0.000     5.670733    5.715913
                7  |   5.802226   .0037896  1531.08   0.000     5.794799    5.809654
                8  |   5.913212   .0057828  1022.54   0.000     5.901878    5.924546
                9  |   6.026321   .0143722   419.30   0.000     5.998152     6.05449
               10  |   6.141594   .0234836   261.53   0.000     6.095567    6.187621
               11  |   6.259072   .0329747   189.81   0.000     6.194442    6.323701
      ------------------------------------------------------------------------------

      Comment


      • #4
        would it be possible to say that expression(exp(xb()) - 1) gives the most accurate conversion to the original values of y?
        I suppose I'm conservative when it comes to making such statements, Marco. So to me an assertion of "most accurate" would be hard to support.

        Instead is there any reason you couldn't use expression(exp(xb()) - 1) and then simply report something like what you wrote:

        the point estimates do not deviate much if simple log(y) is used. This is also the case if the retransformation for the log function using the standard Duan homoskedastic smearing estimate...is used.
        without advancing a claim of "most accurate"?

        Comment


        • #5
          Originally posted by John Mullahy View Post

          I suppose I'm conservative when it comes to making such statements, Marco. So to me an assertion of "most accurate" would be hard to support.

          Instead is there any reason you couldn't use expression(exp(xb()) - 1) and then simply report something like what you wrote:

          without advancing a claim of "most accurate"?
          Thank you, John. Duly noted.

          Comment


          • #6
            You could also use an exponential mean and the Poisson fixed effects estimator to directly get the semi-elasticities. This requires no assumptions of the type John mentioned. If you get similar results, it's another robustness check.

            A word of caution about your example: The variable wage never takes the value zero, and so adding one before taking the log is going to be less harmful. If you have lots of zeros in your application then it can have a huge effect -- see Mullahy and Norton!

            Also, the estimated effects you obtain are not invariant to how you measure wage. If you change from dollars to cents, say, the estimated percentage effects will change. That's a bad thing. This won't happen with the Poisson FE estimator and an exponential mean.

            Comment


            • #7
              Originally posted by Jeff Wooldridge View Post
              You could also use an exponential mean and the Poisson fixed effects estimator to directly get the semi-elasticities. This requires no assumptions of the type John mentioned. If you get similar results, it's another robustness check.

              A word of caution about your example: The variable wage never takes the value zero, and so adding one before taking the log is going to be less harmful. If you have lots of zeros in your application then it can have a huge effect -- see Mullahy and Norton!

              Also, the estimated effects you obtain are not invariant to how you measure wage. If you change from dollars to cents, say, the estimated percentage effects will change. That's a bad thing. This won't happen with the Poisson FE estimator and an exponential mean.
              Dear Jeff,

              Thank you for this additional clarification.

              In my raw data (again, sorry for not being able to show the results), the standard deviation of the dependent variable is 3 times higher than the mean. In this case, using the raw data with zeroes, would it be appropriate to use a conditional (xtnbreg y i.year x, fe) or unconditional (nbreg y i.id i.year x) fixed-effects negative binomial instead of the fixed-effects Poisson estimator (xtpoisson y i.year x, fe)?

              My fixed-effects OLS results with log-transformed dependent variable plus 1 (xtreg ln_y_plus1 i.year x, fe) are more aligned with either conditional or unconditional negative binomial than with the Poisson estimator.

              Would be grateful for your help.

              Comment


              • #8
                the standard deviation of the dependent variable is 3 times higher than the mean
                Here, you are examining the distribution of the outcome whereas overdispersion/ underdispersion is a property of the conditional distribution. In any case, the Poisson estimator with -vce(robust)- allows any kind of variance-mean relationship.

                In this case, using the raw data with zeroes, would it be appropriate to use a conditional (xtnbreg y i.year x, fe) or unconditional (nbreg y i.id i.year x) fixed-effects negative binomial instead of the fixed-effects Poisson estimator (xtpoisson y i.year x, fe)?
                Jeff has compiled a list of reasons why you should almost never use the FE NegBin estimator. See #3 https://www.statalist.org/forums/for...-poisson-model
                Last edited by Andrew Musau; 08 Jan 2024, 11:29.

                Comment

                Working...
                X