Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Model for overdispersed continuous dependent variable

    Hi,
    I have a dependent variable absent days that is a continuous variable and is overdispersed as shown in the summary statistics below.
    Code:
    tab absdays_vacationX_2
    
       combined |
            for |
      1990-2011 |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              0 |        740       40.28       40.28
              1 |          3        0.16       40.45
              2 |         15        0.82       41.26
              3 |         16        0.87       42.13
              4 |         16        0.87       43.00
              5 |        194       10.56       53.57
              6 |          2        0.11       53.67
              7 |          5        0.27       53.95
              8 |          9        0.49       54.44
             10 |        195       10.62       65.05
             14 |          2        0.11       65.16
             15 |        105        5.72       70.88
             16 |          2        0.11       70.99
             17 |          1        0.05       71.04
             20 |         47        2.56       73.60
           21.5 |         14        0.76       74.36
             25 |         27        1.47       75.83
           26.5 |          1        0.05       75.88
             28 |          1        0.05       75.94
             30 |         61        3.32       79.26
           31.5 |          1        0.05       79.31
             32 |          1        0.05       79.37
             35 |         32        1.74       81.11
             40 |         35        1.91       83.02
           41.5 |          1        0.05       83.07
             43 |         18        0.98       84.05
           43.5 |          3        0.16       84.21
             45 |         26        1.42       85.63
             50 |         25        1.36       86.99
           53.5 |          1        0.05       87.04
             55 |         14        0.76       87.81
             60 |         21        1.14       88.95
           63.5 |          2        0.11       89.06
           64.5 |         34        1.85       90.91
             65 |         25        1.36       92.27
             70 |         17        0.93       93.20
             75 |         15        0.82       94.01
             80 |         11        0.60       94.61
             85 |         13        0.71       95.32
             86 |         11        0.60       95.92
           86.5 |          4        0.22       96.14
             90 |          5        0.27       96.41
           90.5 |          1        0.05       96.46
             95 |          3        0.16       96.62
            100 |          7        0.38       97.01
            105 |          3        0.16       97.17
          107.5 |          8        0.44       97.60
          108.5 |          1        0.05       97.66
            110 |          7        0.38       98.04
            115 |          3        0.16       98.20
            125 |          1        0.05       98.26
            129 |          7        0.38       98.64
            130 |         11        0.60       99.24
            145 |          1        0.05       99.29
            150 |          1        0.05       99.35
          150.5 |          2        0.11       99.46
            172 |          1        0.05       99.51
            185 |          1        0.05       99.56
          193.5 |          2        0.11       99.67
            195 |          2        0.11       99.78
            210 |          1        0.05       99.84
            215 |          1        0.05       99.89
            230 |          1        0.05       99.95
            255 |          1        0.05      100.00
    ------------+-----------------------------------
          Total |      1,837      100.00
    
    . 
    . sum absdays_vacationX_2
    
        Variable |        Obs        Mean    Std. Dev.       Min        Max
    -------------+---------------------------------------------------------
    absdays_v~_2 |      1,837    19.17719    31.16322          0        255
    Click image for larger version

Name:	absent days vacation.png
Views:	1
Size:	13.0 KB
ID:	1361942


    Would it be appropriate to convert this into a count variable by rounding up the half days and then run a negative binomial regression? Otherwise, I have read that a gamma regression with a log-link may also be used for overdispersed continuous dependent variables, would this be better?

    Any other suggestions would be very helpful!

    Thank you!

    Best,
    Surya
    Attached Files

  • #2
    Hello Surya,
    The model you choose must not only be data-driven but also adapted to the question you're investigating, what is the null assumption you test, etc.
    It should be adapted to your data (and its structure : pannel, etc.) but this comes second.
    So without knowing what you want to do with these data, we cannot give you much advices on which model to use.

    Also I'm not sure about the ``overdisperesed" term, I would rather say that its distribution is very concentrated, as would be a power law. But this is very common (at least in my field : economics), and doesn't rule out any models per se (as far I know). Have you compared your data to some empirical example or theoretical prediction to qualify it ``overdispersed"?
    Have you though about log-transforming your data (again it depends on your research question)? it would reduce the upward variability (but causes you some issue with the zero values).

    Best,
    Charlie

    Comment


    • #3
      Hi,

      Charlie Joyez makes some important points about model selection. It may be useful to read up on count data models where you will find information on issues related to over dispersion and how to tackle them. Example Hilbe ,Modelling Count Data

      Comment


      • #4
        Hi Surya,

        You could consider a "two-part" model -- one for the zeros and one for the positive values. If you go the route of using a gamma regression, you'll have to do that because zeros are beyond the support of a gamma distribution. You can do this in Stata using "gsem". We discuss such a model and show Stata syntax in:

        Baldwin, S. A., Fellingham, G. W., & Baldwin, A. S. (2016). Statistical models for multilevel skewed physical activity data in health research and behavioral medicine. Health Psychology, 35(6), 552–562. http://doi.org/10.1037/hea0000292

        We also cite a number of other sources that discuss similar models (e.g., two-part log-normal models).

        Lastly, I don't know if it will fit your situation, but sometimes Poisson and negative-binomial models make sense for situations where the variable is positive and highly skewed:

        http://blog.stata.com/2011/08/22/use...tell-a-friend/

        Best,
        Scott

        Comment


        • #5
          Dear All,

          Adding to Scott's comment, I would say that the obvious starting point for this kind of data is Poisson regression. The fact that there are some observations with half days is not a problem at all because Poisson regression may be used even if the dependent variable is not a count.

          As an aside, I would say that these data are not continuous; it appears to be measured in half-days and hence it has a discrete distribution.

          Best wishes,

          Joao

          Comment


          • #6
            Originally posted by Joao Santos Silva View Post
            Dear All,

            Adding to Scott's comment, I would say that the obvious starting point for this kind of data is Poisson regression. The fact that there are some observations with half days is not a problem at all because Poisson regression may be used even if the dependent variable is not a count.

            As an aside, I would say that these data are not continuous; it appears to be measured in half-days and hence it has a discrete distribution.

            Best wishes,

            Joao
            Hi Joao,

            Thank you for your response! I have started with the Poisson and then the negative binomial and it seems that the NB model is better fitted to the data. With NB, is it also ok not to have count data?

            Comment


            • #7
              Originally posted by Scott Baldwin View Post
              Hi Surya,

              You could consider a "two-part" model -- one for the zeros and one for the positive values. If you go the route of using a gamma regression, you'll have to do that because zeros are beyond the support of a gamma distribution. You can do this in Stata using "gsem". We discuss such a model and show Stata syntax in:

              Baldwin, S. A., Fellingham, G. W., & Baldwin, A. S. (2016). Statistical models for multilevel skewed physical activity data in health research and behavioral medicine. Health Psychology, 35(6), 552–562. http://doi.org/10.1037/hea0000292

              We also cite a number of other sources that discuss similar models (e.g., two-part log-normal models).

              Lastly, I don't know if it will fit your situation, but sometimes Poisson and negative-binomial models make sense for situations where the variable is positive and highly skewed:

              http://blog.stata.com/2011/08/22/use...tell-a-friend/

              Best,
              Scott
              Hi Scott,

              Thank you for your suggestions! It seems then that a regular gamma with log-link does not work because of my excess zeros, I will try out the two part model though.

              Best,
              Surya

              Comment


              • #8
                Originally posted by Charlie Joyez View Post
                Hello Surya,
                The model you choose must not only be data-driven but also adapted to the question you're investigating, what is the null assumption you test, etc.
                It should be adapted to your data (and its structure : pannel, etc.) but this comes second.
                So without knowing what you want to do with these data, we cannot give you much advices on which model to use.

                Also I'm not sure about the ``overdisperesed" term, I would rather say that its distribution is very concentrated, as would be a power law. But this is very common (at least in my field : economics), and doesn't rule out any models per se (as far I know). Have you compared your data to some empirical example or theoretical prediction to qualify it ``overdispersed"?
                Have you though about log-transforming your data (again it depends on your research question)? it would reduce the upward variability (but causes you some issue with the zero values).

                Best,
                Charlie
                Hi Charlie,

                Thank you for your response! My dataset is compromised of a pooled observations over 22 years. I am investigating a policy impact on absenteeism, either days or rates. My hypothesis is that it would decrease absent days. For days, I was thinking a count model with marginal effects after would give me what I want or with the IRR option, would give me rates.
                I have log-transformed my data but as you said it did cause problems because of my excess zeros.

                Best,
                Surya


                Comment


                • #9
                  Dear Surya,

                  I would stick to the Poisson regression because it is much more robust. Also, the statistic you used to compare the fit of both models is unlikely to be valid in this context.

                  Best regards,

                  Joao

                  Comment

                  Working...
                  X