Hello,
I have a question regarding which type of regression model is right to use for a zero-inflated distribution.
Some info about the data:
- The dependent variable for one of my hypotheses is ‘distvolatility’ (shown in the table below).
- Its distribution is heavily zero-inflated (1304 out of 1459 observations are 0) and positively skewed. These zeroes are real/true values (not censored/truncated).
- There are 15 possible ‘distvolatility’ scores for respondents with a non-zero value for ‘distvolatility’, ranging from .2959995 to 32.373 (there are no other possible values other than those shown below).
My question is which type of regression model is best to use for this type of zero-inflated distribution. I have done a lot of research and have seen a lot of different suggestions, although none seem to be completely correct for my data.
- Zero-inflated poisson/zero-inflated binomial - These both assume count data. Would it severely bias the results if I were to use one of these forms of regression model (likely zinb as the variance is much higher than the mean), as my data are discrete but not count data?
- Two-step generalised linear model - Another option is to model the probability of distvolatility being 0/1 as a binary logistic regression, and then use a GLM function on the non-zero values.
- Tobit regression - I have also seen this mentioned as an option for zero-inflated distributions, although it assumes the zeroes are censored, which is not the case here.
The zero-inflated negative binomial seems to be the best option at the moment, but any advice would be greatly appreciated!
I have a question regarding which type of regression model is right to use for a zero-inflated distribution.
Some info about the data:
- The dependent variable for one of my hypotheses is ‘distvolatility’ (shown in the table below).
- Its distribution is heavily zero-inflated (1304 out of 1459 observations are 0) and positively skewed. These zeroes are real/true values (not censored/truncated).
- There are 15 possible ‘distvolatility’ scores for respondents with a non-zero value for ‘distvolatility’, ranging from .2959995 to 32.373 (there are no other possible values other than those shown below).
Code:
tab distvolatility distvolatil | ity | Freq. Percent Cum. ------------+----------------------------------- 0 | 1,304 89.38 89.38 .2959995 | 10 0.69 90.06 .8690004 | 39 2.67 92.73 4.661 | 7 0.48 93.21 11.673 | 7 0.48 93.69 12.542 | 12 0.82 94.52 14.874 | 17 1.17 95.68 15.17 | 4 0.27 95.96 16.334 | 5 0.34 96.30 17.203 | 11 0.75 97.05 19.535 | 8 0.55 97.60 19.831 | 3 0.21 97.81 31.208 | 9 0.62 98.42 31.504 | 3 0.21 98.63 32.077 | 15 1.03 99.66 32.373 | 5 0.34 100.00 ------------+----------------------------------- Total | 1,459 100.00
- Zero-inflated poisson/zero-inflated binomial - These both assume count data. Would it severely bias the results if I were to use one of these forms of regression model (likely zinb as the variance is much higher than the mean), as my data are discrete but not count data?
- Two-step generalised linear model - Another option is to model the probability of distvolatility being 0/1 as a binary logistic regression, and then use a GLM function on the non-zero values.
- Tobit regression - I have also seen this mentioned as an option for zero-inflated distributions, although it assumes the zeroes are censored, which is not the case here.
The zero-inflated negative binomial seems to be the best option at the moment, but any advice would be greatly appreciated!
Comment