Question about log-logistic for health care cost

Liang Wang Jack

Join Date: Dec 2016

Posts: 169
#1

Question about log-logistic for health care cost

05 Jan 2018, 12:07

Dear scientist,

My question is how to use log-logistic to model health cost data. As we know, the distribution of health cost data skewness to the right. I am considering to use methods modeling this cost data: 1, GLM with gamma, 2 OLS with lognormal, 3 log-logistic. For example, y is the cost, covariates are age and group.

For GLM I am thinking to use

Code:

glm y age i.group, family(gamma) link(log)

For OLS with log-normal I am thinking to use

Code:

reg ln(y) age i.group

Would you please tell me whether these two modeling methods are correct and how can I model health cost with log-logistic?

Thank you very much!

Jack Liang Wang
Tags: None
John Mullahy

Join Date: Dec 2016

Posts: 751
#2

05 Jan 2018, 12:35

Jack: There's no way to determine an absolutely "correct" approach. Years ago Will Manning and I published a paper that might give you some guidance on model selection given your actual data:
https://www.ncbi.nlm.nih.gov/pubmed/11469231

I would also recommend a recent Stata Press book that may be helpful in this regard:
https://www.stata.com/bookstore/heal...s-using-stata/

As for the specifications you describe, I would suggest adding the vce(robust) option to both specifications. Also, in your second specification you would need to define your LHS variable before estimation, e.g.

Code:

gen lny=ln(y) reg lny age i.group, vce(robust)

As for log-logistic estimation, I would recommend first looking at help streg .
Comment
Liang Wang Jack

Join Date: Dec 2016

Posts: 169
#3

05 Jan 2018, 13:05

Originally posted by John Mullahy View Post

Jack: There's no way to determine an absolutely "correct" approach. Years ago Will Manning and I published a paper that might give you some guidance on model selection given your actual data:
https://www.ncbi.nlm.nih.gov/pubmed/11469231

I would also recommend a recent Stata Press book that may be helpful in this regard:
https://www.stata.com/bookstore/heal...s-using-stata/

As for the specifications you describe, I would suggest adding the vce(robust) option to both specifications. Also, in your second specification you would need to define your LHS variable before estimation, e.g.

Code:

gen lny=ln(y) reg lny age i.group, vce(robust)

As for log-logistic estimation, I would recommend first looking at help streg .

Hi Dr. Mullahy,

Thanks for your suggestion!

Yes streg has options distribution (lognormal), (loglogistic). I am confusing that streg is used for survival analysis where the interest is in observing time to death either of patients or of laboratory animals, but our is health cost data. I am wondering how can I connect them?

Best,

Jack
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#4

05 Jan 2018, 13:10

Originally posted by John Mullahy View Post

Jack: There's no way to determine an absolutely "correct" approach. Years ago Will Manning and I published a paper that might give you some guidance on model selection given your actual data:
https://www.ncbi.nlm.nih.gov/pubmed/11469231

I would also recommend a recent Stata Press book that may be helpful in this regard:
https://www.stata.com/bookstore/heal...s-using-stata/

As for the specifications you describe, I would suggest adding the vce(robust) option to both specifications. Also, in your second specification you would need to define your LHS variable before estimation, e.g.

Code:

gen lny=ln(y) reg lny age i.group, vce(robust)

As for log-logistic estimation, I would recommend first looking at help streg .

Or, instead of streg, which would (I think) require setting up the data as survival data, he could use GSEM.

Code:

use http://www.stata-press.com/data/r15/mus03sub gen medexp = exp(lmedexp) gsem medexp <- income c.age##c.age totchr i.sex, family(loglogistic) margins

The output would make some noise about this being an accelerated failure time model, and it would say that everyone failed, and it would display the time at "risk". The latter two correspond to everyone having positive expenditures and the total dollars spent.

I can't say if this model makes sense or not, but if anyone wants to try it, this syntax will run. If you ran margins on it, you would see predicted spending amounts that at least look like they came from the same universe as the data. Although, interestingly enough, the grand mean is quite wrong. And for the record, you can simply change the family and link options to fit basically anything allowed by GLM. GSEM allows fewer link options.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
1 like
Comment
Liang Wang Jack

Join Date: Dec 2016

Posts: 169
#5

05 Jan 2018, 13:30

Originally posted by Weiwen Ng View Post

Or, instead of streg, which would (I think) require setting up the data as survival data, he could use GSEM.

Code:

use http://www.stata-press.com/data/r15/mus03sub gen medexp = exp(lmedexp) gsem medexp <- income c.age##c.age totchr i.sex, family(loglogistic) margins

The output would make some noise about this being an accelerated failure time model, and it would say that everyone failed, and it would display the time at "risk". The latter two correspond to everyone having positive expenditures and the total dollars spent.

I can't say if this model makes sense or not, but if anyone wants to try it, this syntax will run. If you ran margins on it, you would see predicted spending amounts that at least look like they came from the same universe as the data. Although, interestingly enough, the grand mean is quite wrong. And for the record, you can simply change the family and link options to fit basically anything allowed by GLM. GSEM allows fewer link options.

Thank you very much for your response! I am confusing why we can use survival analysis to model our cost data?

Best,

Jack
Comment
John Mullahy

Join Date: Dec 2016

Posts: 751
#6

05 Jan 2018, 13:38

Jack: This paper may give you some intuitions regarding using survival models for cost modeling. In essence, it's just a trick to set up an estimator and doesn't have anything to do with survival times per se. https://www.ncbi.nlm.nih.gov/pubmed/15322988
Comment
Liang Wang Jack

Join Date: Dec 2016

Posts: 169
#7

05 Jan 2018, 13:48

Originally posted by John Mullahy View Post

Jack: This paper may give you some intuitions regarding using survival models for cost modeling. In essence, it's just a trick to set up an estimator and doesn't have anything to do with survival times per se. https://www.ncbi.nlm.nih.gov/pubmed/15322988

Got it! Thank you very much and have a good weekend!
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#8

05 Jan 2018, 15:55

Originally posted by Liang Wang Jack View Post

Got it! Thank you very much and have a good weekend!

My answer crossed with John's. His response is exactly correct. Log-logistic distributions are used in survival analysis. They (and many other distributions) produce skewed survival times. You're basically telling Stata that everyone in your dataset has a bunch of covariates, and their survival time is the amount they spent. Stata will think it is modeling everyone's conditional "survival time" with a log-logistic model. You will know better.

I was confused why you mentioned log-logistic models, because I haven't seen them used to do anything apart from survival analysis in health services research. If you want to estimate a log-logistic model and examine its goodness of fit, I think you have the tools. To be honest, as Jack indicated, properly modeling healthcare spending is a formidable endeavor, and much ink has been spilled by people far smarter than I. If you're an applied analyst, you will probably be OK choosing something that is good enough. GLM or GEE with log link and gamma distribution is something I have seen a lot of people use. And, in fact, I have sometimes seen GLM with a Poisson distribution used as well.

One of my lecturers advised us that there is a formal test for which GLM family is a closer approximation to the truth, in terms of the relationship between the variance and the mean. For example, as a population's mean healthcare spending rises, should its variance rise, or remain the same? Most high spenders probably have hospital visits, which tend to be unpredictable for most people. I'll leave you with a link to this presentation, which covers that and a few other issues. It's the family test for GLM, and no, I don't understand it well enough to explain it in English.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
1 like
Comment
Liang Wang Jack

Join Date: Dec 2016

Posts: 169
#9

08 Jan 2018, 09:53

Originally posted by Weiwen Ng View Post

My answer crossed with John's. His response is exactly correct. Log-logistic distributions are used in survival analysis. They (and many other distributions) produce skewed survival times. You're basically telling Stata that everyone in your dataset has a bunch of covariates, and their survival time is the amount they spent. Stata will think it is modeling everyone's conditional "survival time" with a log-logistic model. You will know better.

I was confused why you mentioned log-logistic models, because I haven't seen them used to do anything apart from survival analysis in health services research. If you want to estimate a log-logistic model and examine its goodness of fit, I think you have the tools. To be honest, as Jack indicated, properly modeling healthcare spending is a formidable endeavor, and much ink has been spilled by people far smarter than I. If you're an applied analyst, you will probably be OK choosing something that is good enough. GLM or GEE with log link and gamma distribution is something I have seen a lot of people use. And, in fact, I have sometimes seen GLM with a Poisson distribution used as well.

One of my lecturers advised us that there is a formal test for which GLM family is a closer approximation to the truth, in terms of the relationship between the variance and the mean. For example, as a population's mean healthcare spending rises, should its variance rise, or remain the same? Most high spenders probably have hospital visits, which tend to be unpredictable for most people. I'll leave you with a link to this presentation, which covers that and a few other issues. It's the family test for GLM, and no, I don't understand it well enough to explain it in English.

Thank you very much for your sugggestion Weiwen Ng. I am reading survival analysis materials now. Have a great week! Jack Wangliang
Comment
Stephen Jenkins

Join Date: Apr 2014

Posts: 1435
#10

08 Jan 2018, 10:16

An additional reference: APPLYING BETA-TYPE SIZE DISTRIBUTIONS TO HEALTHCARE COST REGRESSIONS, by ANDREW M. JONES, JAMES LOMAS AND NIGEL RICE, JOURNAL OF APPLIED ECONOMETRICS (wileyonlinelibrary.com) DOI: 10.1002/jae.2334
SUMMARY: This paper extends the literature on modelling healthcare cost data by applying the generalised beta of the second kind (GB2) distribution to English hospital inpatient cost data. A quasi-experimental design, estimating models on a sub-population of the data and evaluating performance on another sub-population, is used to compare this distribution with its nested and limiting cases. While for these data the beta of the second kind (B2) distribution and generalised gamma (GG) distribution outperform the GB2, our results illustrate that the GB2 can be used as a device for choosing among competing parametric distributions for healthcare cost data.

Andrew Jones told me their project fitted GB2 distributions using gb2fit (on SSC), see also gb2lfit (also SSC)
Comment
Liang Wang Jack

Join Date: Dec 2016

Posts: 169
#11

10 Jan 2018, 20:05

Originally posted by Stephen Jenkins View Post

An additional reference: APPLYING BETA-TYPE SIZE DISTRIBUTIONS TO HEALTHCARE COST REGRESSIONS, by ANDREW M. JONES, JAMES LOMAS AND NIGEL RICE, JOURNAL OF APPLIED ECONOMETRICS (wileyonlinelibrary.com) DOI: 10.1002/jae.2334
SUMMARY: This paper extends the literature on modelling healthcare cost data by applying the generalised beta of the second kind (GB2) distribution to English hospital inpatient cost data. A quasi-experimental design, estimating models on a sub-population of the data and evaluating performance on another sub-population, is used to compare this distribution with its nested and limiting cases. While for these data the beta of the second kind (B2) distribution and generalised gamma (GG) distribution outperform the GB2, our results illustrate that the GB2 can be used as a device for choosing among competing parametric distributions for healthcare cost data.

Andrew Jones told me their project fitted GB2 distributions using gb2fit (on SSC), see also gb2lfit (also SSC)

Thank you very much for your advising.
Comment
Liang Wang Jack

Join Date: Dec 2016

Posts: 169
#12

15 Jan 2018, 10:34

Originally posted by John Mullahy View Post

Jack: This paper may give you some intuitions regarding using survival models for cost modeling. In essence, it's just a trick to set up an estimator and doesn't have anything to do with survival times per se. https://www.ncbi.nlm.nih.gov/pubmed/15322988

Hi Dr. Mullahy,

I read your paper,Comparing alternative models: log vs Cox proportional hazard?. But I do not understand in what are the coefficients interpretation for cox model and parametric model (loglogistic). Would you please help me to figure them out?

For example,

first,

Code:

stcox $xvar,nolog

second,

Code:

streg $xvar , distribution(ll) nolog

Thank you very much!

Jack LiangWang
Comment
John Mullahy

Join Date: Dec 2016

Posts: 751
#13

15 Jan 2018, 15:51

Jack: For streg you can compute marginal effects on mean outcomes. See help streg_postestimation##margins I'm not exactly sure how to translate the log-logistic parameters into this framework (I don't think we worked this out explicitly in the paper), but margins should be helpful. As for stcox it looks like the options available for margins are much more limited. The formulae for some of the conditional mean computations are in the paper. I hope this is useful.
Comment
Liang Wang Jack

Join Date: Dec 2016

Posts: 169
#14

24 Jan 2018, 12:02

Originally posted by John Mullahy View Post

Jack: For streg you can compute marginal effects on mean outcomes. See help streg_postestimation##margins I'm not exactly sure how to translate the log-logistic parameters into this framework (I don't think we worked this out explicitly in the paper), but margins should be helpful. As for stcox it looks like the options available for margins are much more limited. The formulae for some of the conditional mean computations are in the paper. I hope this is useful.

Hi Dr. Mullahy,

By running model below, distribution (ll) has the smaller AIC. What test should I perform in Stata to measure whether this distribution fits my data?

Code:

streg $xvar , distribution(ll) nolog

Code:

streg $xvar , distribution(lognormal) nolog

Thanks,

Jack
Comment
John Mullahy

Join Date: Dec 2016

Posts: 751
#15

24 Jan 2018, 14:34

Jack: Thanks for your question, but I don't have a good answer for you unfortunately. Others may have opinions, however, and hopefully they will weigh in if so.
Comment

Announcement

Question about log-logistic for health care cost

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment