Cost Analysis - Log Transform and Smearing

Jon Heintz

Join Date: Sep 2022
Posts: 24

Cost Analysis - Log Transform and Smearing

24 Oct 2022, 06:48

I am completing a cost analysis using a log transformation I.E. log(cost) = treatment + var1 + var2. I am interested in testing for the marginal difference between treatments. I can easily find the marginal log(cost) for each treatment group. Three questions:

If I exponentiate the marginal log(cost) for each treatment, is that a biased estimator of cost?
Can I exponentiate the log(cost) for each treatment and find the difference to estimate the difference between treatments?
If either 1 or 2 is biased, how do I use a smearing estimator?

I have the stata code below. The top table is the stata output followed by my work to exponentiate the log(cost) for each treatment and calculating the difference. Appreciate the help.

Code:

 
 regress log_cost i.treatment i.var1 i.var2   margins treatment

		Delta-method
Treatment	Margin	std. err.	t	P>t	[95% conf.	interval]
0	11.315	0.01456	551.1	0	11.28646	11.34354
1	11.05746	0.03548	442.32	0	10.98792	11.127

exp(0)	82043.1
exp(1)	63415.27

Diff	18627.83

Tags: cost, log transformation, smear

Rich Goldstein

Join Date: Mar 2014

Posts: 4545
#2

24 Oct 2022, 07:09

1. yes, this is biased; see Miller, Don M. (May 1984), "Reducing transformation bias in curve fitting," The American Statistician, 38: 124-126

3. there may be more modern programs for this but many years ago I wrote "predlog" for the STB - you can find and download using -search predlog-; for the caveats mentioned in the help file, you may want to look at the STB article (cited in the help file)
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17851
#3

24 Oct 2022, 07:49

Jon:
smearing in back transforming could let you down (and painfully so).
My preference for -gml- with a log link and gamma family comes from several dreaedful experiences with healthcare cost data logged and then back-transformed via Duan's smear (https://www.jstor.org/stable/2288126) with disappointing results when contrasted against their raw scale.
This issue is well covered in https://www.stata.com/bookstore/heal...cs-using-stata , pages 96-99.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment

Nick Cox

Join Date: Mar 2014
Posts: 36053

24 Oct 2022, 08:16

Here is a command smear from my files. My memory isn't more than that this was discussed on Statalist; I thought my code was a smidgen more modern than Rich's command; and evidently that I never got around to posting this on SSC or publishing in the Stata Journal, perhaps because the deal boils down to one line of algebra.

Code:

*! NJC 1.1.0 8 January 2005 
* NJC 1.0.0 13 September 2002 
program smear, rclass  
    version 8.0
    syntax [if] [in] [, Generate(str) OUTofsample ]  

    if "`generate'" != "" { 
        capture confirm new variable `generate' 
        if _rc {
            di as err "option syntax is generate(newvar)" 
            exit _rc 
        }
    }     

    marksample touse 
    qui count if `touse' 
    if r(N) == 0 error 2000
    
    tempvar resid yhatraw
    tempname rmse cf 

    qui { 
        * will exit with error message if no estimates 
        scalar `rmse' = e(rmse)
        
        if "`outofsample'" != "" predict double `yhatraw' 
        else predict double `yhatraw' if e(sample) 
        
        predict double `resid', res
        replace `resid' = exp(`resid') 
        su `resid', meanonly 
        scalar `cf' = `r(mean)'

        if "`generate'" != "" { 
            gen double `generate' = exp(`yhatraw') * `cf' if `touse'
            la var `generate' "smeared retransformation"
        }     
    }     

    di as res scalar(`cf') 
    return scalar smearcf = `cf' 
end

Code:

{smcl}
{* 6 September 2002/4 January 2005}{...}
{hline}
help for {hi:smear}
{hline}

{title:Smearing retransformation after regression with logged response}

{p 8 17 2}
{cmd:smear} 
[
{cmd:,}
{cmdab:g:enerate(}{it:newvar}{cmd:)} 
{cmdab:out:ofsample} 
]  


{title:Description}

{p 4 4 2}{cmd:smear} produces a smearing estimate of the expected response on 
the untransformed scale. It may be used after fitting a linear 
regression model on the logged response. Without options just 
the correction factor ave(exp(residual)) is displayed. 


{title:Remarks}

{p 4 4 2}Suppose you have response {it:y} and a set of covariates {it:X} and you 
fit a regression model to ln {it:y} and {it:X}. {cmd:smear} calculates 
the smearing estimate of the expected response proposed by Duan (1983), 
which is, for given {it:x_0}, estimates {it:b} and residuals 
{it:e}, ave(exp({it:x_0 b} + {it:e})) = exp({it:x_0 b}) * ave(exp({it:e})). 

{p 4 4 2}{cmd:smear} is based on {cmd:predlog} (Goldstein 1996). 


{title:Options}

{p 4 8 2}{cmd:generate(}{it:newvar}{cmd:)} specifies the name of a new  
variable to hold the smearing estimate.  

{p 4 8 2}{cmd:outofsample} specifies that estimates are to be produced 
for all values of the variables in the model. The 
default is to use only observations in the estimation sample.  


{title:Examples}

{p 4 8 2}{cmd:. gen lny = ln(y)}

{p 4 8 2}{cmd:. regress lny x1 x2 x3}

{p 4 8 2}{cmd:. smear, g(smear)}

  
{title:Saved results} 

{p 4 8 2}{cmd:r(smearcf)}{space 8}correction factor ave(exp({it:e}))


{title:Author} 

{p 4 4 2}Nicholas J. Cox, University of Durham, U.K.{break} 
        [email protected]


{title:References}     

{p 4 8 2}Duan, N. 1983. Smearing estimate: a nonparametric retransformation method. 
{it:Journal, American Statistical Association} 78: 605-610.

{p 4 8 2}Goldstein, R. 1996. Predictions in the original metric for log-transformed 
models. {it:Stata Technical Bulletin} 29: 27-29 ({it:STB Reprints} 5: 145-147)

Comment

Jon Heintz

Join Date: Sep 2022

Posts: 24
#5

24 Oct 2022, 08:57

Carlo Lazzaro, I whole heartedly agree with you. I have had a few cost project recently and have relied heavily on a gamma GLM with log link. Unfortunately, the this strategy didn't work with this particular dataset because of the outlier situation. With the log transformation, the results matched with and without outliers while with the gamma GLM they did not.
Comment
Jon Heintz

Join Date: Sep 2022

Posts: 24
#6

24 Oct 2022, 08:58

Rich Goldstein and Nick Cox, thank you both for the insights
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3063
#7

24 Oct 2022, 09:39

Dear Jon Heintz,

With regards to #5, I suggest you try the Poisson rather than gamma family.

Best wishes,

Joao
1 like
Comment
Jon Heintz

Join Date: Sep 2022

Posts: 24
#8

24 Oct 2022, 11:12

Joao Santos Silva, interesting, and the Poisson family seems to provide the best predications. I must admit, I am a little confused, though, because I thought Poisson was used for count data, not continuous data like cost.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4545
#9

24 Oct 2022, 11:15

see the following Stata blog: https://blog.stata.com/2011/08/22/us...tell-a-friend/

more generally, note that poisson can be used in many other cases including survival analysis and to estimate risk ratios for binary outcomes
Comment
Jon Heintz

Join Date: Sep 2022

Posts: 24
#10

24 Oct 2022, 11:31

Interesting. On closer inspection, I clearly have overdispersion. Can quasipoisson models be used with continuous data, and if so, is there a stata command for it?
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3063
#11

24 Oct 2022, 12:21

Dear Jon Heintz,

In response to #8, please see

Santos Silva, J.M.C. and Tenreyro, S. (2006), The Log of Gravity, The Review of Economics and Statistics, 88(4), pp. 641-658.

About #9, please note that overdispersion is only defined for count data. In your case, the relation between the mean and the variance changes if you change the scale of the data and therefore it does not make sense to talk about overdispersion. In any case, Poisson regression is robust to overdispersion.

Best wishes,

Joao
2 likes
Comment

Announcement

Cost Analysis - Log Transform and Smearing

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment