Model for overdispersed continuous dependent variable

Surya Singh

Join Date: Sep 2014
Posts: 54

Model for overdispersed continuous dependent variable

27 Oct 2016, 04:12

Hi,
I have a dependent variable absent days that is a continuous variable and is overdispersed as shown in the summary statistics below.

Code:

tab absdays_vacationX_2

   combined |
        for |
  1990-2011 |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        740       40.28       40.28
          1 |          3        0.16       40.45
          2 |         15        0.82       41.26
          3 |         16        0.87       42.13
          4 |         16        0.87       43.00
          5 |        194       10.56       53.57
          6 |          2        0.11       53.67
          7 |          5        0.27       53.95
          8 |          9        0.49       54.44
         10 |        195       10.62       65.05
         14 |          2        0.11       65.16
         15 |        105        5.72       70.88
         16 |          2        0.11       70.99
         17 |          1        0.05       71.04
         20 |         47        2.56       73.60
       21.5 |         14        0.76       74.36
         25 |         27        1.47       75.83
       26.5 |          1        0.05       75.88
         28 |          1        0.05       75.94
         30 |         61        3.32       79.26
       31.5 |          1        0.05       79.31
         32 |          1        0.05       79.37
         35 |         32        1.74       81.11
         40 |         35        1.91       83.02
       41.5 |          1        0.05       83.07
         43 |         18        0.98       84.05
       43.5 |          3        0.16       84.21
         45 |         26        1.42       85.63
         50 |         25        1.36       86.99
       53.5 |          1        0.05       87.04
         55 |         14        0.76       87.81
         60 |         21        1.14       88.95
       63.5 |          2        0.11       89.06
       64.5 |         34        1.85       90.91
         65 |         25        1.36       92.27
         70 |         17        0.93       93.20
         75 |         15        0.82       94.01
         80 |         11        0.60       94.61
         85 |         13        0.71       95.32
         86 |         11        0.60       95.92
       86.5 |          4        0.22       96.14
         90 |          5        0.27       96.41
       90.5 |          1        0.05       96.46
         95 |          3        0.16       96.62
        100 |          7        0.38       97.01
        105 |          3        0.16       97.17
      107.5 |          8        0.44       97.60
      108.5 |          1        0.05       97.66
        110 |          7        0.38       98.04
        115 |          3        0.16       98.20
        125 |          1        0.05       98.26
        129 |          7        0.38       98.64
        130 |         11        0.60       99.24
        145 |          1        0.05       99.29
        150 |          1        0.05       99.35
      150.5 |          2        0.11       99.46
        172 |          1        0.05       99.51
        185 |          1        0.05       99.56
      193.5 |          2        0.11       99.67
        195 |          2        0.11       99.78
        210 |          1        0.05       99.84
        215 |          1        0.05       99.89
        230 |          1        0.05       99.95
        255 |          1        0.05      100.00
------------+-----------------------------------
      Total |      1,837      100.00

. 
. sum absdays_vacationX_2

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
absdays_v~_2 |      1,837    19.17719    31.16322          0        255

Click image for larger version

Name: absent days vacation.png
Views: 1
Size: 13.0 KB
ID: 1361942

Would it be appropriate to convert this into a count variable by rounding up the half days and then run a negative binomial regression? Otherwise, I have read that a gamma regression with a log-link may also be used for overdispersed continuous dependent variables, would this be better?

Any other suggestions would be very helpful!

Thank you!

Best,
Surya

Attached Files

absentdaysvacation.gph (6.9 KB, 1 view)

Tags: None

Charlie Joyez

Join Date: Dec 2014

Posts: 421
#2

27 Oct 2016, 05:00

Hello Surya,
The model you choose must not only be data-driven but also adapted to the question you're investigating, what is the null assumption you test, etc.
It should be adapted to your data (and its structure : pannel, etc.) but this comes second.
So without knowing what you want to do with these data, we cannot give you much advices on which model to use.

Also I'm not sure about the ``overdisperesed" term, I would rather say that its distribution is very concentrated, as would be a power law. But this is very common (at least in my field : economics), and doesn't rule out any models per se (as far I know). Have you compared your data to some empirical example or theoretical prediction to qualify it ``overdispersed"?
Have you though about log-transforming your data (again it depends on your research question)? it would reduce the upward variability (but causes you some issue with the zero values).

Best,
Charlie
Comment
naveed ahmed

Join Date: Jun 2016

Posts: 40
#3

27 Oct 2016, 06:00

Hi,

Charlie Joyez makes some important points about model selection. It may be useful to read up on count data models where you will find information on issues related to over dispersion and how to tackle them. Example Hilbe ,Modelling Count Data
Comment
Scott Baldwin

Join Date: Apr 2014

Posts: 15
#4

27 Oct 2016, 08:37

Hi Surya,

You could consider a "two-part" model -- one for the zeros and one for the positive values. If you go the route of using a gamma regression, you'll have to do that because zeros are beyond the support of a gamma distribution. You can do this in Stata using "gsem". We discuss such a model and show Stata syntax in:

Baldwin, S. A., Fellingham, G. W., & Baldwin, A. S. (2016). Statistical models for multilevel skewed physical activity data in health research and behavioral medicine. Health Psychology, 35(6), 552–562. http://doi.org/10.1037/hea0000292

We also cite a number of other sources that discuss similar models (e.g., two-part log-normal models).

Lastly, I don't know if it will fit your situation, but sometimes Poisson and negative-binomial models make sense for situations where the variable is positive and highly skewed:

http://blog.stata.com/2011/08/22/use...tell-a-friend/

Best,
Scott
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3015
#5

27 Oct 2016, 14:18

Dear All,

Adding to Scott's comment, I would say that the obvious starting point for this kind of data is Poisson regression. The fact that there are some observations with half days is not a problem at all because Poisson regression may be used even if the dependent variable is not a count.

As an aside, I would say that these data are not continuous; it appears to be measured in half-days and hence it has a discrete distribution.

Best wishes,

Joao
Comment
Surya Singh

Join Date: Sep 2014

Posts: 54
#6

28 Oct 2016, 04:35

Originally posted by Joao Santos Silva View Post

Dear All,

Adding to Scott's comment, I would say that the obvious starting point for this kind of data is Poisson regression. The fact that there are some observations with half days is not a problem at all because Poisson regression may be used even if the dependent variable is not a count.

As an aside, I would say that these data are not continuous; it appears to be measured in half-days and hence it has a discrete distribution.

Best wishes,

Joao

Hi Joao,

Thank you for your response! I have started with the Poisson and then the negative binomial and it seems that the NB model is better fitted to the data. With NB, is it also ok not to have count data?
Comment
Surya Singh

Join Date: Sep 2014

Posts: 54
#7

28 Oct 2016, 04:37

Originally posted by Scott Baldwin View Post

Hi Surya,

You could consider a "two-part" model -- one for the zeros and one for the positive values. If you go the route of using a gamma regression, you'll have to do that because zeros are beyond the support of a gamma distribution. You can do this in Stata using "gsem". We discuss such a model and show Stata syntax in:

Baldwin, S. A., Fellingham, G. W., & Baldwin, A. S. (2016). Statistical models for multilevel skewed physical activity data in health research and behavioral medicine. Health Psychology, 35(6), 552–562. http://doi.org/10.1037/hea0000292

We also cite a number of other sources that discuss similar models (e.g., two-part log-normal models).

Lastly, I don't know if it will fit your situation, but sometimes Poisson and negative-binomial models make sense for situations where the variable is positive and highly skewed:

http://blog.stata.com/2011/08/22/use...tell-a-friend/

Best,
Scott

Hi Scott,

Thank you for your suggestions! It seems then that a regular gamma with log-link does not work because of my excess zeros, I will try out the two part model though.

Best,
Surya
Comment
Surya Singh

Join Date: Sep 2014

Posts: 54
#8

28 Oct 2016, 04:41

Originally posted by Charlie Joyez View Post

Hello Surya,
The model you choose must not only be data-driven but also adapted to the question you're investigating, what is the null assumption you test, etc.
It should be adapted to your data (and its structure : pannel, etc.) but this comes second.
So without knowing what you want to do with these data, we cannot give you much advices on which model to use.

Also I'm not sure about the ``overdisperesed" term, I would rather say that its distribution is very concentrated, as would be a power law. But this is very common (at least in my field : economics), and doesn't rule out any models per se (as far I know). Have you compared your data to some empirical example or theoretical prediction to qualify it ``overdispersed"?
Have you though about log-transforming your data (again it depends on your research question)? it would reduce the upward variability (but causes you some issue with the zero values).

Best,
Charlie

Hi Charlie,

Thank you for your response! My dataset is compromised of a pooled observations over 22 years. I am investigating a policy impact on absenteeism, either days or rates. My hypothesis is that it would decrease absent days. For days, I was thinking a count model with marginal effects after would give me what I want or with the IRR option, would give me rates.
I have log-transformed my data but as you said it did cause problems because of my excess zeros.

Best,
Surya
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3015
#9

28 Oct 2016, 11:25

Dear Surya,

I would stick to the Poisson regression because it is much more robust. Also, the statistic you used to compare the fit of both models is unlikely to be valid in this context.

Best regards,

Joao
Comment

Announcement

Model for overdispersed continuous dependent variable

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment