Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • GLM with pileup at zero

    I am trying to run a model to estimate how well catastrophic illnesses such as TB, AIDS etc affect spending on hospitalization. Now I have per hospitalization cost as the dependent variavle and various individual markers as independent variables, almost all of which are dummy such as gender, head of household status, poverty status and ofcourse a dummy for whether you have the illness.

    As is to be expected, there is a significant- and I mean a lot, of pile up at zero- no expenditure on hosptalization in the 12 month reference period. What would be the best way to deal with a model such as this.

    As of now I decided to convert the cost into ln(1+cost) so as to include all observations and then run a GLM model.

    Am I on the right track?

    Also posted on http://stats.stackexchange.com/quest...led-up-at-zero
    Last edited by Fatima Alvi; 30 Jun 2014, 15:39.

  • #2
    Perhaps -tpm- (available from SSC) would be appropriate. The help says "tpm fits a two-part regression model of depvar on indepvars. The first part models the probability that depvar>0 using a binary choice model (logit or probit). The second part models the distribution of depvar | depvar>0 using linear (regress) and generalized linear models (glm)."
    -------------------------------------------
    Richard Williams, Notre Dame Dept of Sociology
    StataNow Version: 19.5 MP (2 processor)

    EMAIL: [email protected]
    WWW: https://www3.nd.edu/~rwilliam

    Comment


    • #3
      Cross-posted at http://stats.stackexchange.com/quest...led-up-at-zero

      Please read the FAQ Advice to see, among other details,

      1. Our cross-posting policy, which is that you are asked to tell us about it.

      2. Our preference for full real names, meaning minimally first name and family name.

      Note that a response variable being zero is itself totally consistent with many kinds of generalized linear model. Transforming the response in many ways makes the application of a generalized linear model moot, as much of the idea is that a link function does away with a need for transformation.

      Nor is clear why cost is considered to be a count.

      Nevertheless the point made by Richard is probably the most crucial.

      Comment


      • #4
        I didn't pay attention to the part about the dependent variable being a count. I'm not sure why that would be, but you could try zero-inflated models (estimated by zip and zinb) or hurdle models (e.g. use hnblogit, available from SSC). Even plain old nbreg might be enough. These are briefly discussed on pp. 19-20 of http://www3.nd.edu/~rwilliam/stats3/CountModels.pdf .

        Adding 1 to a count and then running a count model is, I believe, almost always a bad/terrible idea.
        -------------------------------------------
        Richard Williams, Notre Dame Dept of Sociology
        StataNow Version: 19.5 MP (2 processor)

        EMAIL: [email protected]
        WWW: https://www3.nd.edu/~rwilliam

        Comment


        • #5
          Sorry for not posting about the cross-posting. I wasn't aware that was required.

          A small clarification- I am not treating cost as a count variable- I meant I transform the cost variable as ln(1+cost) so that the observations where cost=0 do not drop out. So for example cost= 0 becomes ln(1) and cost=765.17 becomes ln(765.17+1).

          Comment


          • #6
            It's not clear what defines an observation here. But what's to explain if an observation is one with cost zero?

            Comment


            • #7
              I am new to this site but have a related question. I'm trying to use a two part model in stata to explain health expenditures (there are a lot of zeros). But I have an endogenous variable in my model. I would like to incorporate an instrumental variable approach. Do you know if this is possible with either TPM or TWOPM in stata?

              Comment


              • #8
                Not sure if I understood your comment correctly but
                An observation with cost zero represents no hospitalization in most of the cases.
                Very rarely does zero cost imply cost free hospitalization in this dataset

                So in a sense the coefficients capture not only the increase in cost associated with illness but also the chance that you will be hopsitaized (not sure if thats the correct way of looking at it).

                Comment


                • #9
                  Correct. My model is on health spending, not necessarily hospital costs (like yours). I am just asking if anyone know if the TPM or TWOPM commands in stata allow for instrumental variables to be incorporated. As for why you have zeros, that is exactly what the two part model is for. The first part uses all of your data to estimate the probability that you will have nonzero hospital costs. The second part of your model explains those hospital costs as a function of your explanatory variables given that the hospital costs are nonzero.

                  Comment


                  • #10
                    As far as I can tell, tpm and twopm are the same program with different names. Neither seems to allow for instrumental variables.
                    -------------------------------------------
                    Richard Williams, Notre Dame Dept of Sociology
                    StataNow Version: 19.5 MP (2 processor)

                    EMAIL: [email protected]
                    WWW: https://www3.nd.edu/~rwilliam

                    Comment

                    Working...
                    X