Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Question about log-logistic for health care cost

    Dear scientist,

    My question is how to use log-logistic to model health cost data. As we know, the distribution of health cost data skewness to the right. I am considering to use methods modeling this cost data: 1, GLM with gamma, 2 OLS with lognormal, 3 log-logistic. For example, y is the cost, covariates are age and group.

    For GLM I am thinking to use
    Code:
     glm y age i.group, family(gamma) link(log)
    For OLS with log-normal I am thinking to use
    Code:
     reg ln(y) age i.group
    Would you please tell me whether these two modeling methods are correct and how can I model health cost with log-logistic?

    Thank you very much!

    Jack Liang Wang

  • #2
    Jack: There's no way to determine an absolutely "correct" approach. Years ago Will Manning and I published a paper that might give you some guidance on model selection given your actual data:
    https://www.ncbi.nlm.nih.gov/pubmed/11469231

    I would also recommend a recent Stata Press book that may be helpful in this regard:
    https://www.stata.com/bookstore/heal...s-using-stata/

    As for the specifications you describe, I would suggest adding the vce(robust) option to both specifications. Also, in your second specification you would need to define your LHS variable before estimation, e.g.
    Code:
    gen lny=ln(y)
    reg lny age i.group, vce(robust)
    As for log-logistic estimation, I would recommend first looking at help streg .

    Comment


    • #3
      Originally posted by John Mullahy View Post
      Jack: There's no way to determine an absolutely "correct" approach. Years ago Will Manning and I published a paper that might give you some guidance on model selection given your actual data:
      https://www.ncbi.nlm.nih.gov/pubmed/11469231

      I would also recommend a recent Stata Press book that may be helpful in this regard:
      https://www.stata.com/bookstore/heal...s-using-stata/

      As for the specifications you describe, I would suggest adding the vce(robust) option to both specifications. Also, in your second specification you would need to define your LHS variable before estimation, e.g.
      Code:
      gen lny=ln(y)
      reg lny age i.group, vce(robust)
      As for log-logistic estimation, I would recommend first looking at help streg .
      Hi Dr. Mullahy,

      Thanks for your suggestion!

      Yes streg has options distribution (lognormal), (loglogistic). I am confusing that streg is used for survival analysis where the interest is in observing time to death either of patients or of laboratory animals, but our is health cost data. I am wondering how can I connect them?

      Best,

      Jack

      Comment


      • #4
        Originally posted by John Mullahy View Post
        Jack: There's no way to determine an absolutely "correct" approach. Years ago Will Manning and I published a paper that might give you some guidance on model selection given your actual data:
        https://www.ncbi.nlm.nih.gov/pubmed/11469231

        I would also recommend a recent Stata Press book that may be helpful in this regard:
        https://www.stata.com/bookstore/heal...s-using-stata/

        As for the specifications you describe, I would suggest adding the vce(robust) option to both specifications. Also, in your second specification you would need to define your LHS variable before estimation, e.g.
        Code:
        gen lny=ln(y)
        reg lny age i.group, vce(robust)
        As for log-logistic estimation, I would recommend first looking at help streg .
        Or, instead of streg, which would (I think) require setting up the data as survival data, he could use GSEM.

        Code:
        use http://www.stata-press.com/data/r15/mus03sub
        gen medexp = exp(lmedexp)
        gsem medexp <- income c.age##c.age totchr i.sex, family(loglogistic)
        margins
        The output would make some noise about this being an accelerated failure time model, and it would say that everyone failed, and it would display the time at "risk". The latter two correspond to everyone having positive expenditures and the total dollars spent.

        I can't say if this model makes sense or not, but if anyone wants to try it, this syntax will run. If you ran margins on it, you would see predicted spending amounts that at least look like they came from the same universe as the data. Although, interestingly enough, the grand mean is quite wrong. And for the record, you can simply change the family and link options to fit basically anything allowed by GLM. GSEM allows fewer link options.
        Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

        When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

        Comment


        • #5
          Originally posted by Weiwen Ng View Post

          Or, instead of streg, which would (I think) require setting up the data as survival data, he could use GSEM.

          Code:
          use http://www.stata-press.com/data/r15/mus03sub
          gen medexp = exp(lmedexp)
          gsem medexp <- income c.age##c.age totchr i.sex, family(loglogistic)
          margins
          The output would make some noise about this being an accelerated failure time model, and it would say that everyone failed, and it would display the time at "risk". The latter two correspond to everyone having positive expenditures and the total dollars spent.

          I can't say if this model makes sense or not, but if anyone wants to try it, this syntax will run. If you ran margins on it, you would see predicted spending amounts that at least look like they came from the same universe as the data. Although, interestingly enough, the grand mean is quite wrong. And for the record, you can simply change the family and link options to fit basically anything allowed by GLM. GSEM allows fewer link options.
          Thank you very much for your response! I am confusing why we can use survival analysis to model our cost data?

          Best,

          Jack

          Comment


          • #6
            Jack: This paper may give you some intuitions regarding using survival models for cost modeling. In essence, it's just a trick to set up an estimator and doesn't have anything to do with survival times per se. https://www.ncbi.nlm.nih.gov/pubmed/15322988

            Comment


            • #7
              Originally posted by John Mullahy View Post
              Jack: This paper may give you some intuitions regarding using survival models for cost modeling. In essence, it's just a trick to set up an estimator and doesn't have anything to do with survival times per se. https://www.ncbi.nlm.nih.gov/pubmed/15322988
              Got it! Thank you very much and have a good weekend!

              Comment


              • #8
                Originally posted by Liang Wang Jack View Post

                Got it! Thank you very much and have a good weekend!
                My answer crossed with John's. His response is exactly correct. Log-logistic distributions are used in survival analysis. They (and many other distributions) produce skewed survival times. You're basically telling Stata that everyone in your dataset has a bunch of covariates, and their survival time is the amount they spent. Stata will think it is modeling everyone's conditional "survival time" with a log-logistic model. You will know better.

                I was confused why you mentioned log-logistic models, because I haven't seen them used to do anything apart from survival analysis in health services research. If you want to estimate a log-logistic model and examine its goodness of fit, I think you have the tools. To be honest, as Jack indicated, properly modeling healthcare spending is a formidable endeavor, and much ink has been spilled by people far smarter than I. If you're an applied analyst, you will probably be OK choosing something that is good enough. GLM or GEE with log link and gamma distribution is something I have seen a lot of people use. And, in fact, I have sometimes seen GLM with a Poisson distribution used as well.

                One of my lecturers advised us that there is a formal test for which GLM family is a closer approximation to the truth, in terms of the relationship between the variance and the mean. For example, as a population's mean healthcare spending rises, should its variance rise, or remain the same? Most high spenders probably have hospital visits, which tend to be unpredictable for most people. I'll leave you with a link to this presentation, which covers that and a few other issues. It's the family test for GLM, and no, I don't understand it well enough to explain it in English.
                Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

                When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.

                Comment


                • #9
                  Originally posted by Weiwen Ng View Post

                  My answer crossed with John's. His response is exactly correct. Log-logistic distributions are used in survival analysis. They (and many other distributions) produce skewed survival times. You're basically telling Stata that everyone in your dataset has a bunch of covariates, and their survival time is the amount they spent. Stata will think it is modeling everyone's conditional "survival time" with a log-logistic model. You will know better.

                  I was confused why you mentioned log-logistic models, because I haven't seen them used to do anything apart from survival analysis in health services research. If you want to estimate a log-logistic model and examine its goodness of fit, I think you have the tools. To be honest, as Jack indicated, properly modeling healthcare spending is a formidable endeavor, and much ink has been spilled by people far smarter than I. If you're an applied analyst, you will probably be OK choosing something that is good enough. GLM or GEE with log link and gamma distribution is something I have seen a lot of people use. And, in fact, I have sometimes seen GLM with a Poisson distribution used as well.

                  One of my lecturers advised us that there is a formal test for which GLM family is a closer approximation to the truth, in terms of the relationship between the variance and the mean. For example, as a population's mean healthcare spending rises, should its variance rise, or remain the same? Most high spenders probably have hospital visits, which tend to be unpredictable for most people. I'll leave you with a link to this presentation, which covers that and a few other issues. It's the family test for GLM, and no, I don't understand it well enough to explain it in English.
                  Thank you very much for your sugggestion Weiwen Ng. I am reading survival analysis materials now. Have a great week! Jack Wangliang

                  Comment


                  • #10
                    An additional reference: APPLYING BETA-TYPE SIZE DISTRIBUTIONS TO HEALTHCARE COST REGRESSIONS, by ANDREW M. JONES, JAMES LOMAS AND NIGEL RICE, JOURNAL OF APPLIED ECONOMETRICS (wileyonlinelibrary.com) DOI: 10.1002/jae.2334
                    SUMMARY: This paper extends the literature on modelling healthcare cost data by applying the generalised beta of the second kind (GB2) distribution to English hospital inpatient cost data. A quasi-experimental design, estimating models on a sub-population of the data and evaluating performance on another sub-population, is used to compare this distribution with its nested and limiting cases. While for these data the beta of the second kind (B2) distribution and generalised gamma (GG) distribution outperform the GB2, our results illustrate that the GB2 can be used as a device for choosing among competing parametric distributions for healthcare cost data.

                    Andrew Jones told me their project fitted GB2 distributions using gb2fit (on SSC), see also gb2lfit (also SSC)

                    Comment


                    • #11
                      Originally posted by Stephen Jenkins View Post
                      An additional reference: APPLYING BETA-TYPE SIZE DISTRIBUTIONS TO HEALTHCARE COST REGRESSIONS, by ANDREW M. JONES, JAMES LOMAS AND NIGEL RICE, JOURNAL OF APPLIED ECONOMETRICS (wileyonlinelibrary.com) DOI: 10.1002/jae.2334
                      SUMMARY: This paper extends the literature on modelling healthcare cost data by applying the generalised beta of the second kind (GB2) distribution to English hospital inpatient cost data. A quasi-experimental design, estimating models on a sub-population of the data and evaluating performance on another sub-population, is used to compare this distribution with its nested and limiting cases. While for these data the beta of the second kind (B2) distribution and generalised gamma (GG) distribution outperform the GB2, our results illustrate that the GB2 can be used as a device for choosing among competing parametric distributions for healthcare cost data.

                      Andrew Jones told me their project fitted GB2 distributions using gb2fit (on SSC), see also gb2lfit (also SSC)
                      Thank you very much for your advising.

                      Comment


                      • #12
                        Originally posted by John Mullahy View Post
                        Jack: This paper may give you some intuitions regarding using survival models for cost modeling. In essence, it's just a trick to set up an estimator and doesn't have anything to do with survival times per se. https://www.ncbi.nlm.nih.gov/pubmed/15322988
                        Hi Dr. Mullahy,

                        I read your paper,Comparing alternative models: log vs Cox proportional hazard?. But I do not understand in what are the coefficients interpretation for cox model and parametric model (loglogistic). Would you please help me to figure them out?

                        For example,

                        first,

                        Code:
                        stcox $xvar,nolog
                        Click image for larger version

Name:	stcox.png
Views:	1
Size:	59.9 KB
ID:	1425835


                        second,
                        Code:
                        streg $xvar , distribution(ll) nolog
                        Click image for larger version

Name:	streg with loglogistic.png
Views:	1
Size:	76.2 KB
ID:	1425836




                        Thank you very much!

                        Jack LiangWang

                        Comment


                        • #13
                          Jack: For streg you can compute marginal effects on mean outcomes. See help streg_postestimation##margins I'm not exactly sure how to translate the log-logistic parameters into this framework (I don't think we worked this out explicitly in the paper), but margins should be helpful. As for stcox it looks like the options available for margins are much more limited. The formulae for some of the conditional mean computations are in the paper. I hope this is useful.

                          Comment


                          • #14
                            Originally posted by John Mullahy View Post
                            Jack: For streg you can compute marginal effects on mean outcomes. See help streg_postestimation##margins I'm not exactly sure how to translate the log-logistic parameters into this framework (I don't think we worked this out explicitly in the paper), but margins should be helpful. As for stcox it looks like the options available for margins are much more limited. The formulae for some of the conditional mean computations are in the paper. I hope this is useful.
                            Hi Dr. Mullahy,

                            By running model below, distribution (ll) has the smaller AIC. What test should I perform in Stata to measure whether this distribution fits my data?

                            Code:
                            streg $xvar , distribution(ll) nolog
                            Code:
                            streg $xvar , distribution(lognormal) nolog
                            Thanks,

                            Jack

                            Comment


                            • #15
                              Jack: Thanks for your question, but I don't have a good answer for you unfortunately. Others may have opinions, however, and hopefully they will weigh in if so.

                              Comment

                              Working...
                              X