Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Is it ok to apply the inverse hyperbolic sine transformation to one variable and the natural log transformation to other variables?

    Hi Statalist. I have a conundrum. I have several variables to which I need to apply the log transformation, such as GDP per capita. However, one of those variables - "cumulative experience" (a count of times a firm has manufactured a nuclear reactor prior to the current observation) - has several instances of zeros in the data. Therefore, I am taking the standard advice of using the inverse hyperbolic sine transformation. For all my other variables, there are no zero-valued observations, so the log transformation works fine. Moreover, the untransformed values of these non-problematic variables are large enough that the approximation asinh(x)=ln(x)+ln(2) is effectively true for my data. So it's purely a stylistic choice (as far as I can tell) as to which one I use.

    My question is this: should I apply the same transformation (inverse hyperbolic sine) to all the variables that need to be transformed for the sake of consistency? Or would it be okay if I apply the log transformation to the variables that don't have the issue of zero-valued observations? It would be nice if I could refer to "log GDP per capita" when writing and speaking, since that is such a common transformation.

    In case it matters, cumulative experience is a count variable, but it is not the outcome of interest. My outcome of interest is continuous. So that's why I'm not using count methods here. Are there even methods for when count data is on the right-hand side of the regression? Is that an issue?


  • #2
    I take it that you're logarithmically transforming your outcome variable because that's the convention in your field of study, but why are you transforming your explanatory variables (right-hand side)? I thought that you could get estimates of elasticity from -margins-.

    Comment


    • #3
      I am transforming my outcome of interest (how long does it take to build a nuclear power plant) because it has a long-right tail (some of them get very badly delayed, none of them magically finish super early). When I log-transform it, it much more closely approximates a normal distribution. (In the future, I plan to use duration methods / survival analysis, but that's beyond the scope of the current work.)

      I am transforming the following explanatory variables:
      • GDP per capita (for reasons that should be self-explanatory)
      • cumulative experience (because the theory of learning-by-doing requires the transformation)
      • possibly some technical characteristics of the reactor (however I lno theory to guide this decision...)
      I was always taught to transform the data as appropriate and interpret the regression coefficients directly from the regression output (with the interpretation varying by level-level, log-level, level-log, and log-log.) Consequently, I have never used the -margins- command for the purpose you propose. So I went and tested it out. I found that estimating a level-level model and then using -margins- to calculate the elasticity does not produce the same estimate of elasticity as estimating a log-log model.

      Code:
      sysuse auto
      reg price mpg
      
            Source |       SS           df       MS      Number of obs   =        74
      -------------+----------------------------------   F(1, 72)        =     20.26
             Model |   139449474         1   139449474   Prob > F        =    0.0000
          Residual |   495615923        72  6883554.48   R-squared       =    0.2196
      -------------+----------------------------------   Adj R-squared   =    0.2087
             Total |   635065396        73  8699525.97   Root MSE        =    2623.7
      
      ------------------------------------------------------------------------------
             price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
               mpg |  -238.8943   53.07669    -4.50   0.000    -344.7008   -133.0879
             _cons |   11253.06   1170.813     9.61   0.000     8919.088    13587.03
      ------------------------------------------------------------------------------
      
      margins, eyex(mpg)
      
      Average marginal effects                        Number of obs     =         74
      Model VCE    : OLS
      
      Expression   : Linear prediction, predict()
      ey/ex w.r.t. : mpg
      
      ------------------------------------------------------------------------------
                   |            Delta-method
                   |      ey/ex   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
               mpg |  -.9859019   .3341136    -2.95   0.004    -1.651945   -.3198587
      ------------------------------------------------------------------------------
      
      gen ln_price=log(price)
      
      gen ln_mpg=log(mpg)
      
      reg ln_price ln_mpg
      
            Source |       SS           df       MS      Number of obs   =        74
      -------------+----------------------------------   F(1, 72)        =     31.00
             Model |  3.37819527         1  3.37819527   Prob > F        =    0.0000
          Residual |  7.84533782        72  .108963025   R-squared       =    0.3010
      -------------+----------------------------------   Adj R-squared   =    0.2913
             Total |  11.2235331        73  .153747029   Root MSE        =     .3301
      
      ------------------------------------------------------------------------------
          ln_price |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
      -------------+----------------------------------------------------------------
            ln_mpg |   -.826847   .1484986    -5.57   0.000    -1.122873   -.5308204
             _cons |   11.14146   .4507755    24.72   0.000     10.24286    12.04007
      ------------------------------------------------------------------------------
      My intuition says that -- if I think the underlying data-generating process is better approximated by a log-log model than a level-level model -- then estimates of the elasticity from the log-log model will be closer to the truth. Is this wrong?

      Comment

      Working...
      X