Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Cost Analysis - Log Transform and Smearing

    I am completing a cost analysis using a log transformation I.E. log(cost) = treatment + var1 + var2. I am interested in testing for the marginal difference between treatments. I can easily find the marginal log(cost) for each treatment group. Three questions:
    1. If I exponentiate the marginal log(cost) for each treatment, is that a biased estimator of cost?
    2. Can I exponentiate the log(cost) for each treatment and find the difference to estimate the difference between treatments?
    3. If either 1 or 2 is biased, how do I use a smearing estimator?
    I have the stata code below. The top table is the stata output followed by my work to exponentiate the log(cost) for each treatment and calculating the difference. Appreciate the help.

    Code:
     
     regress log_cost i.treatment i.var1 i.var2   margins treatment
    Delta-method
    Treatment Margin std. err. t P>t [95% conf. interval]
    0 11.315 0.01456 551.1 0 11.28646 11.34354
    1 11.05746 0.03548 442.32 0 10.98792 11.127
    exp(0) 82043.1
    exp(1) 63415.27
    Diff 18627.83

  • #2
    1. yes, this is biased; see Miller, Don M. (May 1984), "Reducing transformation bias in curve fitting," The American Statistician, 38: 124-126

    3. there may be more modern programs for this but many years ago I wrote "predlog" for the STB - you can find and download using -search predlog-; for the caveats mentioned in the help file, you may want to look at the STB article (cited in the help file)

    Comment


    • #3
      Jon:
      smearing in back transforming could let you down (and painfully so).
      My preference for -gml- with a log link and gamma family comes from several dreaedful experiences with healthcare cost data logged and then back-transformed via Duan's smear (https://www.jstor.org/stable/2288126) with disappointing results when contrasted against their raw scale.
      This issue is well covered in https://www.stata.com/bookstore/heal...cs-using-stata , pages 96-99.
      Kind regards,
      Carlo
      (Stata 19.0)

      Comment


      • #4

        Here is a command smear from my files. My memory isn't more than that this was discussed on Statalist; I thought my code was a smidgen more modern than Rich's command; and evidently that I never got around to posting this on SSC or publishing in the Stata Journal, perhaps because the deal boils down to one line of algebra.

        Code:
        *! NJC 1.1.0 8 January 2005 
        * NJC 1.0.0 13 September 2002 
        program smear, rclass  
            version 8.0
            syntax [if] [in] [, Generate(str) OUTofsample ]  
        
            if "`generate'" != "" { 
                capture confirm new variable `generate' 
                if _rc {
                    di as err "option syntax is generate(newvar)" 
                    exit _rc 
                }
            }     
        
            marksample touse 
            qui count if `touse' 
            if r(N) == 0 error 2000
            
            tempvar resid yhatraw
            tempname rmse cf 
        
            qui { 
                * will exit with error message if no estimates 
                scalar `rmse' = e(rmse)
                
                if "`outofsample'" != "" predict double `yhatraw' 
                else predict double `yhatraw' if e(sample) 
                
                predict double `resid', res
                replace `resid' = exp(`resid') 
                su `resid', meanonly 
                scalar `cf' = `r(mean)'
        
                if "`generate'" != "" { 
                    gen double `generate' = exp(`yhatraw') * `cf' if `touse'
                    la var `generate' "smeared retransformation"
                }     
            }     
        
            di as res scalar(`cf') 
            return scalar smearcf = `cf' 
        end
        Code:
        {smcl}
        {* 6 September 2002/4 January 2005}{...}
        {hline}
        help for {hi:smear}
        {hline}
        
        {title:Smearing retransformation after regression with logged response}
        
        {p 8 17 2}
        {cmd:smear} 
        [
        {cmd:,}
        {cmdab:g:enerate(}{it:newvar}{cmd:)} 
        {cmdab:out:ofsample} 
        ]  
        
        
        {title:Description}
        
        {p 4 4 2}{cmd:smear} produces a smearing estimate of the expected response on 
        the untransformed scale. It may be used after fitting a linear 
        regression model on the logged response. Without options just 
        the correction factor ave(exp(residual)) is displayed. 
        
        
        {title:Remarks}
        
        {p 4 4 2}Suppose you have response {it:y} and a set of covariates {it:X} and you 
        fit a regression model to ln {it:y} and {it:X}. {cmd:smear} calculates 
        the smearing estimate of the expected response proposed by Duan (1983), 
        which is, for given {it:x_0}, estimates {it:b} and residuals 
        {it:e}, ave(exp({it:x_0 b} + {it:e})) = exp({it:x_0 b}) * ave(exp({it:e})). 
        
        {p 4 4 2}{cmd:smear} is based on {cmd:predlog} (Goldstein 1996). 
        
        
        {title:Options}
        
        {p 4 8 2}{cmd:generate(}{it:newvar}{cmd:)} specifies the name of a new  
        variable to hold the smearing estimate.  
        
        {p 4 8 2}{cmd:outofsample} specifies that estimates are to be produced 
        for all values of the variables in the model. The 
        default is to use only observations in the estimation sample.  
        
        
        {title:Examples}
        
        {p 4 8 2}{cmd:. gen lny = ln(y)}
        
        {p 4 8 2}{cmd:. regress lny x1 x2 x3}
        
        {p 4 8 2}{cmd:. smear, g(smear)}
        
          
        {title:Saved results} 
        
        {p 4 8 2}{cmd:r(smearcf)}{space 8}correction factor ave(exp({it:e}))
        
        
        {title:Author} 
        
        {p 4 4 2}Nicholas J. Cox, University of Durham, U.K.{break} 
                [email protected]
        
        
        {title:References}     
        
        {p 4 8 2}Duan, N. 1983. Smearing estimate: a nonparametric retransformation method. 
        {it:Journal, American Statistical Association} 78: 605-610.
        
        {p 4 8 2}Goldstein, R. 1996. Predictions in the original metric for log-transformed 
        models. {it:Stata Technical Bulletin} 29: 27-29 ({it:STB Reprints} 5: 145-147)


        Comment


        • #5
          Carlo Lazzaro, I whole heartedly agree with you. I have had a few cost project recently and have relied heavily on a gamma GLM with log link. Unfortunately, the this strategy didn't work with this particular dataset because of the outlier situation. With the log transformation, the results matched with and without outliers while with the gamma GLM they did not.

          Comment


          • #6
            Rich Goldstein and Nick Cox, thank you both for the insights

            Comment


            • #7
              Dear Jon Heintz,

              With regards to #5, I suggest you try the Poisson rather than gamma family.

              Best wishes,

              Joao

              Comment


              • #8
                Joao Santos Silva, interesting, and the Poisson family seems to provide the best predications. I must admit, I am a little confused, though, because I thought Poisson was used for count data, not continuous data like cost.

                Comment


                • #9
                  see the following Stata blog: https://blog.stata.com/2011/08/22/us...tell-a-friend/

                  more generally, note that poisson can be used in many other cases including survival analysis and to estimate risk ratios for binary outcomes

                  Comment


                  • #10
                    Interesting. On closer inspection, I clearly have overdispersion. Can quasipoisson models be used with continuous data, and if so, is there a stata command for it?

                    Comment


                    • #11
                      Dear Jon Heintz,

                      In response to #8, please see

                      Santos Silva, J.M.C. and Tenreyro, S. (2006), The Log of Gravity, The Review of Economics and Statistics, 88(4), pp. 641-658.

                      About #9, please note that overdispersion is only defined for count data. In your case, the relation between the mean and the variance changes if you change the scale of the data and therefore it does not make sense to talk about overdispersion. In any case, Poisson regression is robust to overdispersion.

                      Best wishes,

                      Joao

                      Comment

                      Working...
                      X