Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Interpretation of natural logarithm in model


    Hello,

    I am working on two DiD-models in stata, one where the dependent variable is sales, and one where it is the natural logarithm of sales.
    I've listed the constant term and DiD-estimator (measuring causal effect of the treatment on sales) below.

    constant DiD
    ln_sales 13,5147 -0,1109463
    sales 794 255,80 -132 841,10


    I don't understand why the numbers vary. I expected the results to be the same, i.e. that I could convert the ln_numbers by taking the exponential from the first row and get the same result as in the second tow. Here the model suggest two quite different effects of the treatment on sales. The constant terms are not the same either, as exp(13,5147) = 740 217.98, much lower than the constant term in the second model.


    Is there a natural explanation to this? Or have I simply done something wrong in the coding?
    Last edited by Milla Hanzon; 09 Nov 2018, 01:42.

  • #2
    You've done nothing wrong in the coding. Your expectation is simply way off base. When

    Code:
    log y = a + b x
    and you exponentiate both sides of that equation, you get

    Code:
    y = ea * (ex)b
    which isn't even remotely like y = exp(a) + exp(b)*x.

    Comment


    • #3
      Thank you, Clyde. Still a bit confused, though. Coded a simple model as you suggested, sales = constant + change*x, and ln_sales = constant + change*x.

      Click image for larger version

Name:	Skjermbilde 2018-11-11 kl. 09.05.27.png
Views:	1
Size:	11.9 KB
ID:	1469842


      So, here are the constants and the coefficients from the two different models. In the models sum I've calculated 1 713 489,10 as the sum of a + 1*x, and 1 674 184,43 as exp(13,87)*exp(0,45) (note that all decimals do not show in the excel screenshot, the ln calculations are actually more accurate than what it seems like). Shouldn't the two calculations have given me the same numbers?

      Comment


      • #4
        No, but I think I see what you're thinking here. In #2, I simplified matters by leaving out the error terms in the model. If you include them, then you have:

        Code:
        log y = a + b x + error_log
        and exponentiating gives
        Code:
        y = ea * (ex)b*eerror_log
        Now, here's the thing. The a and b in the first regression are chosen so as to minimize the sum of the squares of the error terms. (This, by the way, does not correspond to minimizing the product of the eerror_log terms.)

        But when you regress
        Code:
        y = c + d x + error_nolog
        c and d are chosen to minimize the sum of the error_nolog2, which is going to be a very different matter. There is no reason that exp(c) and exp(d), with c and d satisfying this condition, should turn out equal to a and b. There is simply no way to transform these equations into each other.

        It may be clearer if we go to an even simpler case. Let's forget about x and just regress on the constant term. The untransformed equation becomes:

        Code:
        y = a + error_nolog
        and it is well known that the value of a that minimizes the sum of the squared error_nolog terms is a = mean of y.

        Now let's look at the log-transformed dependent variable:
        Code:
        log y = c + error_log
        By the same algebra as for the untransformed y, the value of c that minimizes the sum of the squarred error_log terms is c = mean of log y. But when you exponentiate mean of log y, you do not get the mean of y, you get the geometric mean of y, which is only equal to the mean of y if all the values of y are equal, and is otherwise always smaller.


        Comment


        • #5
          Thank you so much, this really helped me a lot.

          One final question:

          I'm running this model over several different samples. To be more specific, this is a DiD-model where I'm looking at the change in sales and ln_sales for different shops when a competing store is opening nearby.

          The reason for my worries was that the DiD-estimator showed different treatment effects depending on whether I was studying sales or ln_sales. As you have explained the results should I believe this means that a negative change in ln_sales and a positive change in sales can make sense if the confidence interval is broad?

          Another thing I find confusing: One of my regressions shows negative treatment effects on ln_sales for two of my shops. However, when I code the shops to be interpreted as if they were one unit, the overall treatment effect turns out to be negative. This might be difficult to answer without seeing my data/coding, but is there anyway this can make sense?

          Comment


          • #6
            As you have explained the results should I believe this means that a negative change in ln_sales and a positive change in sales can make sense if the confidence interval is broad?
            I assume here you mean a negative effect on log sales and a positive effect on sales? Yes, this can happen, and it actually has nothing to do with the confidence interval being broad. Remember that an additive change in log sales (which is what you are estimating with a regression using log sales as the dependent variable) corresponds exactly to a multiplicative change in sales. If log sales goes up by 1, sales goes up by a factor of 2.718... So here's a simple example. Suppose that before the competing store opens the sales in store 1 are 1000 units per month and in store 2 they are 2000 units per month. Now a competitor opens near store 1, but not near store 2. Sales in store 1 drop to 500 units per month. For extraneous reasons, suppose sales in store 2 drop to 1200 units per month. Then looking just a sales, the decrease at store 1 (500 units) is smaller than the decrease at store 2 (800 units), so the DID estimator is 500 - 800 = -300. But multiplicatively, sales in store 1 dropped by 50%, whereas in store 2 they dropped by only 40%, so if we were to regress on log sales, the DID would be log 0.5 - log 0.4 = +0.223... Notice that confidence intervals never into the calculations here--this conclusion would hold even if there were no noise or uncertainty in the system at all.

            One of my regressions shows negative treatment effects on ln_sales for two of my shops. However, when I code the shops to be interpreted as if they were one unit, the overall treatment effect turns out to be negative.
            Why does that surprise you? Did you mean to say that the overall treatment effect turns out to be positive when the two shops are coded as 1, even though it is negative for each shop separately? That kind of result often surprises people. But it is, in fact, perfectly sensible. It is known as Simpson's paradox. There is a really good explanation of Simpson's paradox on Wikipedia--better than anything I could write.

            Comment

            Working...
            X