Hello Statalisters.
I hope that you might be able to advise on a problem with many zero values that come from observations of zero response, not missing observations. This is a long post, but contains a lot of analysis thus far.
The problem concerns a developmental toxicity study, in which groups of animals were administered graded doses of a test substance. The highest doses were associated with an increase in post-implantation loss (lost embryos), but also severe body weight reduction. My hypothesis is that the former observation is related to the latter.

Post-Implantation loss can be expressed as a count value (i.e. the number of lost implants), but also, and probably more usefully, as a proportion of the total number of implantation sites. In the dataset below, these are designated post_c and post_r respectively, while bwg is bodyweight gain and is the number of implants:
The first thing to note is that post_c, but not post_r, is overly dispersed:
The data are also heavily skewed to the left (this is normal for these types of data):

Because the variance for post_r was similar to the mean, I applied a Poisson regression (with and without bwg as a covariate):
Inclusion of bwg as a covariate improved the model fit, and showed that it is an influence on the post-implantation loss outcome.
But, a statistician who is involved in this project states that Poisson regression can only be used for count (integer) data, despite me pointing out Joseph Coveney's nicely-argued article on Statalist
Since post_c is overly dispersed, a negative binomial regression with ln(is) was applied:
When I ran -nbreg- against post_r I got, unsurprisingly, the same result as with -poisson-. But the result above doesn't make sense, either in terms of the biology or the earlier Poisson regression.
Can anyone explain why the two approaches (one ratio, the other count with offset) differ so much? And which is the better model?
I did consider zero-inflated models, but since the high proportion of zeros come from observations of zero rather than missing observations, then I don't think that they are applicable. This FAQ article seems to agree.
I hope that you might be able to advise on a problem with many zero values that come from observations of zero response, not missing observations. This is a long post, but contains a lot of analysis thus far.
The problem concerns a developmental toxicity study, in which groups of animals were administered graded doses of a test substance. The highest doses were associated with an increase in post-implantation loss (lost embryos), but also severe body weight reduction. My hypothesis is that the former observation is related to the latter.
Post-Implantation loss can be expressed as a count value (i.e. the number of lost implants), but also, and probably more usefully, as a proportion of the total number of implantation sites. In the dataset below, these are designated post_c and post_r respectively, while bwg is bodyweight gain and is the number of implants:
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input byte(group id is) float(bwg post_r post_c) 0 1 7 194 0 0 0 2 8 42 0 0 0 3 4 -63 1 4 0 4 8 171 0 0 0 5 8 99 0 0 0 6 7 126 0 0 0 7 6 119 0 0 0 9 7 235 0 0 0 10 8 147 0 0 0 11 7 230 0 0 0 12 8 234 .125 1 0 13 6 187 0 0 0 14 8 149 0 0 0 15 3 64 0 0 0 16 10 174 0 0 0 17 12 162 .16666667 2 0 18 7 112 0 0 0 19 11 157 .1818182 2 0 20 10 233 .1 1 0 21 11 86 0 0 1 23 5 104 0 0 1 24 7 124 0 0 1 25 7 232 0 0 1 26 10 297 0 0 1 27 8 63 0 0 1 28 8 201 0 0 1 29 6 211 0 0 1 31 9 164 0 0 1 32 12 158 .25 3 1 33 9 123 0 0 1 34 9 264 0 0 1 35 8 216 .5 4 1 36 7 182 0 0 1 37 8 269 0 0 1 38 9 241 0 0 1 39 5 75 .2 1 1 40 8 194 0 0 1 41 13 109 .15384616 2 1 42 8 77 0 0 1 43 11 67 0 0 2 45 7 198 0 0 2 46 2 -101 1 2 2 47 12 136 .16666667 2 2 48 10 6 0 0 2 49 7 157 .2857143 2 2 50 9 100 0 0 2 51 8 126 0 0 2 52 11 320 .1818182 2 2 53 6 111 .16666667 1 2 54 9 47 .7777778 7 2 55 8 51 0 0 2 56 7 231 0 0 2 57 7 154 .14285715 1 2 58 7 112 .5714286 4 2 59 4 79 .25 1 2 60 8 150 0 0 2 62 10 -14 .9 9 2 63 8 107 0 0 2 64 11 222 .0909091 1 2 65 . 208 . . 2 66 9 101 .11111111 1 3 67 6 -8 1 6 3 68 9 87 .3333333 3 3 69 7 -135 1 7 3 70 8 66 .625 5 3 71 9 -139 1 9 3 72 2 -105 1 2 3 73 6 147 .5 3 3 74 . -16 . . 3 76 6 -89 1 6 3 77 10 -15 1 10 3 78 8 -14 1 8 3 79 7 78 .5714286 4 3 80 9 141 .11111111 1 3 81 8 -331 1 8 3 82 8 95 0 0 3 83 5 107 0 0 3 84 8 105 .125 1 3 85 1 -187 1 1 3 86 9 -100 1 9 3 87 7 76 .4285714 3 end
Code:
. summarize post_c post_r, detail post-implantation loss (count) ------------------------------------------------------------- Percentiles Smallest 1% 0 0 5% 0 0 10% 0 0 Obs 79 25% 0 0 Sum of Wgt. 79 50% 0 Mean 1.759494 Largest Std. Dev. 2.690035 75% 2 9 90% 7 9 Variance 7.236287 95% 9 9 Skewness 1.646021 99% 10 10 Kurtosis 4.627665 post-implantation loss (ratio) ------------------------------------------------------------- Percentiles Smallest 1% 0 0 5% 0 0 10% 0 0 Obs 79 25% 0 0 Sum of Wgt. 79 50% 0 Mean .2533763 Largest Std. Dev. .3703084 75% .4285714 1 90% 1 1 Variance .1371283 95% 1 1 Skewness 1.237554 99% 1 1 Kurtosis 2.899826
Because the variance for post_r was similar to the mean, I applied a Poisson regression (with and without bwg as a covariate):
Code:
. poisson post_r i.group note: you are responsible for interpretation of noncount dep. variable Iteration 0: log likelihood = -37.021363 Iteration 1: log likelihood = -36.948318 Iteration 2: log likelihood = -36.948232 Iteration 3: log likelihood = -36.948232 Poisson regression Number of obs = 79 LR chi2(3) = 16.76 Prob > chi2 = 0.0008 Log likelihood = -36.948232 Pseudo R2 = 0.1849 ------------------------------------------------------------------------------ post_r | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- group | 1 | -.3544922 1.241554 -0.29 0.775 -2.787893 2.078908 2 | 1.082488 .9223988 1.17 0.241 -.7253808 2.890356 3 | 2.139165 .8451666 2.53 0.011 .4826688 3.795661 | _cons | -2.542439 .7972026 -3.19 0.001 -4.104928 -.979951 ------------------------------------------------------------------------------ . estat ic Akaike's information criterion and Bayesian information criterion ----------------------------------------------------------------------------- Model | Obs ll(null) ll(model) df AIC BIC -------------+--------------------------------------------------------------- . | 79 -45.32978 -36.94823 4 81.89646 91.37425 ----------------------------------------------------------------------------- Note: N=Obs used in calculating BIC; see [R] BIC note. . poisson post_r i.group c.bwg note: you are responsible for interpretation of noncount dep. variable Iteration 0: log likelihood = -34.704699 Iteration 1: log likelihood = -34.236766 Iteration 2: log likelihood = -34.235014 Iteration 3: log likelihood = -34.235014 Poisson regression Number of obs = 79 LR chi2(4) = 22.19 Prob > chi2 = 0.0002 Log likelihood = -34.235014 Pseudo R2 = 0.2448 ------------------------------------------------------------------------------ post_r | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- group | 1 | -.2393556 1.242709 -0.19 0.847 -2.675021 2.19631 2 | .940722 .9248465 1.02 0.309 -.8719438 2.753388 3 | 1.380574 .920795 1.50 0.134 -.4241513 3.185299 | bwg | -.004207 .0017531 -2.40 0.016 -.0076431 -.000771 _cons | -1.992704 .8232568 -2.42 0.015 -3.606257 -.37915 ------------------------------------------------------------------------------ . estat ic Akaike's information criterion and Bayesian information criterion ----------------------------------------------------------------------------- Model | Obs ll(null) ll(model) df AIC BIC -------------+--------------------------------------------------------------- . | 79 -45.32978 -34.23501 5 78.47003 90.31727 ----------------------------------------------------------------------------- Note: N=Obs used in calculating BIC; see [R] BIC note.
But, a statistician who is involved in this project states that Poisson regression can only be used for count (integer) data, despite me pointing out Joseph Coveney's nicely-argued article on Statalist
All the values, in fact, can even be less than 1!
Code:
. gen log_is=ln(is) (2 missing values generated) . nbreg post_c i.group c.bwg, offset(log_is) Fitting Poisson model: Iteration 0: log likelihood = -124.58556 Iteration 1: log likelihood = -121.30488 Iteration 2: log likelihood = -121.29454 Iteration 3: log likelihood = -121.29454 Fitting constant-only model: Iteration 0: log likelihood = -144.25986 Iteration 1: log likelihood = -138.41096 Iteration 2: log likelihood = -137.84527 Iteration 3: log likelihood = -137.8446 Iteration 4: log likelihood = -137.8446 Fitting full model: Iteration 0: log likelihood = -126.4518 Iteration 1: log likelihood = -122.80367 Iteration 2: log likelihood = -115.63584 Iteration 3: log likelihood = -115.10791 Iteration 4: log likelihood = -115.03336 Iteration 5: log likelihood = -115.03299 Iteration 6: log likelihood = -115.03299 Negative binomial regression Number of obs = 79 LR chi2(4) = 45.62 Dispersion = mean Prob > chi2 = 0.0000 Log likelihood = -115.03299 Pseudo R2 = 0.1655 ------------------------------------------------------------------------------ post_c | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------- group | 1 | .0519224 .507331 0.10 0.918 -.942428 1.046273 2 | .9845104 .431245 2.28 0.022 .1392857 1.829735 3 | 1.454757 .4485487 3.24 0.001 .5756181 2.333897 | bwg | -.0053099 .0013785 -3.85 0.000 -.0080117 -.0026081 _cons | -2.036627 .3913169 -5.20 0.000 -2.803594 -1.26966 log_is | 1 (offset) -------------+---------------------------------------------------------------- /lnalpha | -.7299079 .5161397 -1.741523 .2817073 -------------+---------------------------------------------------------------- alpha | .4819534 .2487553 .1752532 1.325391 ------------------------------------------------------------------------------ LR test of alpha=0: chibar2(01) = 12.52 Prob >= chibar2 = 0.000 . estat ic Akaike's information criterion and Bayesian information criterion ----------------------------------------------------------------------------- Model | Obs ll(null) ll(model) df AIC BIC -------------+--------------------------------------------------------------- . | 79 -137.8446 -115.033 6 242.066 256.2827 ----------------------------------------------------------------------------- Note: N=Obs used in calculating BIC; see [R] BIC note.
Can anyone explain why the two approaches (one ratio, the other count with offset) differ so much? And which is the better model?
I did consider zero-inflated models, but since the high proportion of zeros come from observations of zero rather than missing observations, then I don't think that they are applicable. This FAQ article seems to agree.