Goodness-of-fit tests and variable selection for a zero-inflated negative binomial model

Kellan Baker

Join Date: May 2017

Posts: 7
#1

Goodness-of-fit tests and variable selection for a zero-inflated negative binomial model

12 May 2017, 15:19

Hello - I am working on an analysis of over-dispersed count data using zero-inflated binomial regression and am having difficulty figuring out the appropriate goodness-of-fit test(s) to use and selecting the parameters for the model. Here is an overview of my data:
Dependent variable: "phys_hlth," which is the number of days in the last month when respondent's self-reported physical health was not good (0-30)

Main predictor: "DOV_LGBT," which is LGBT identity (0=not LGBT-identified, 1=LGBT-identified)

Other possible predictors/controls: age (continuous; "PPAGE" in the commands below), and recent experience of discrimination (binary; "discrim" in the commands below), as well as others such as race (5 categories), gender (binary), insurance status (binary), etc.

The histogram of phys_hlth shows a large spike at days=0 that largely tapers off, except for a much smaller bump at days=30; here's an overview of the phys_hlth variable:

Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
phys_hlth | 1,727 4.10249 8.11829 0 30

Because of the size of the variance relative to the mean, I moved from a ZIP model to ZINB. Putting some of the predictors mentioned above into the ZINB model returns good numbers for overall chi2, alpha, and the Vuong test, as well as for my main predictor (DOV_LGBT):

. zinb phys_hlth DOV_LGBT discrim PPAGE, inflate (PPAGE DOV_LGBT) vuong zip

Zero-inflated negative binomial regression Number of obs = 1,554
Nonzero obs = 634
Zero obs = 920

Inflation model = logit LR chi2(3) = 12.45
Log likelihood = -3113.399 Prob > chi2 = 0.0060

------------------------------------------------------------------------------
phys_hlth | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
phys_hlth |
DOV_LGBT | .2159001 .096089 2.25 0.025 .0275691 .404231
discrim | .2479829 .1694438 1.46 0.143 -.0841209 .5800867
PPAGE | .0087431 .0030843 2.83 0.005 .0026981 .0147882
_cons | 1.592725 .1875341 8.49 0.000 1.225165 1.960286
-------------+----------------------------------------------------------------
inflate |
PPAGE | .0126378 .0039695 3.18 0.001 .0048578 .0204179
DOV_LGBT | -.4053969 .1249087 -3.25 0.001 -.6502134 -.1605805
_cons | -.4047859 .2509455 -1.61 0.107 -.8966301 .0870582
-------------+----------------------------------------------------------------
/lnalpha | .2658189 .104051 2.55 0.011 .0618826 .4697553
-------------+----------------------------------------------------------------
alpha | 1.304499 .1357345 1.063837 1.599603
------------------------------------------------------------------------------
Likelihood-ratio test of alpha=0: chibar2(01) = 3673.90 Pr>=chibar2 = 0.0000
Vuong test of zinb vs. standard negative binomial: z = 6.47 Pr>z = 0.0000

What I am struggling with is the following:

1) Are there other goodness-of-fit tests that I should be running to ensure that the ZINB model is a good fit? I have used the margins command to estimate the predicted means after doing a robust ZINB regression, and these estimates are close to the actual means (even without the analytical weight), but I want to make sure I'm not stumbling blindly into the ZINB model because I can't think of any other approaches:

. margins DOV_LGBT

Predictive margins Number of obs = 1,554
Model VCE : Robust

Expression : Predicted number of events, predict()

------------------------------------------------------------------------------
| Delta-method
| Margin Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
DOV_LGBT |
0 | 3.432679 .2549347 13.46 0.000 2.933016 3.932342
LGBT | 5.237526 .35971 14.56 0.000 4.532507 5.942545
------------------------------------------------------------------------------

Actual means:

. mean phys_hlth [aw=weight_1], over(DOV_LGBT)

Mean estimation Number of obs = 1,727

_subpop_1: DOV_LGBT = 0
LGBT: DOV_LGBT = LGBT

--------------------------------------------------------------
Over | Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
phys_hlth |
_subpop_1 | 3.555177 .2547231 3.055579 4.054776
LGBT | 4.984303 .3125431 4.3713 5.597306
--------------------------------------------------------------

2) Is there a method to selecting which variables to inflate in the ZINB model?

3) Relatedly, what is the best method to use with ZINB regression to select which variables to include in the model at all? E.g., I have taken gender out because it was consistently showing up as nonsignificant no matter whether I inflated it or not, but is testing each of my ~15 possible independent variable one-by-one like that my only option? Can I use something like <gvselect> or forwards/backwards selection with a ZINB model, and if so, how?

Thank you!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#2

12 May 2017, 16:04

There is a strong preponderance of opinion among the most frequent responders of this forum that automatic variable selection methods such as forward/backward selection are simply not valid and should not be used.

As for whether there is a role for statistical significance of the variable in the model in deciding whether to include it, there is more diversity of opinion. My view is that this is almost always inappropriate. (Perhaps one of those who favor this approach will also respond.) In my view, there are three reasons for including a predictor in a model. The first, and best, is that the underlying science says it should be there. The other two reasons are if leaving it out leads to omitted variable bias or leaves a great deal of removable noise in the model. Omitted variable bias will occur when, and only when, the predictor in question is differently distributed in the groups defined by your main predictor variable (which I take to be DOV_LGBT here), and is also appreciably associated with the outcome. Removable noise can be found when the predictor in question is appreciably associated with the outcome, but is distributed similarly in the groups defined by the main predictor variable. Note that I have not used the word "significant" at all here. Statistical significance tests have no bearing on these issues because omitted variable bias and removable noise are sample-level issues that cannot be assessed through inferences about a population. So the judgment of what is "appreciably associated" and "differently distributed" is based on descriptive statistics and requires scientific judgment.

Another approach I use to decide on removing variables from models is to just look at how much the variable can change the predicted outcome. This is particularly easy with dichotomous variables like sex. What is the coefficient of sex? When sex changes from 0 to 1 (or whatever the coding you used is), how much does that change the predicted outcome? If it's a negligible amount, then there is no need to have sex in the model. It can similarly be applied to continuous predictors: if you change the predictor from the lowest plausible value to the highest one and the corresponding change in the predicted outcome is negligible, then that predictor isn't doing any real work and can be omitted. (Of course, if you have a model with interaction terms, it is more complicated than that and all bets are off.)
Comment
Kellan Baker

Join Date: May 2017

Posts: 7
#3

12 May 2017, 19:56

Thank you for the feedback!

Unfortunately, my biostats professor is not, um, very current (to put it nicely) with actual useful methods, so I am going to be forced to use <gvselect> and/or forward/backward selection as at least one step in this process. That nonsense aside, I am curious to know whether there is a systematic way of approaching this question for my own knowledge/future use purposes. From what you said, it seems I should regress phys_hlth on each predictor separately using ZINB and also use chi2 or another appropriate test to assess the distribution of each variable across the two groups of DOV_LGBT; any predictors that address either OVB or removable noise should be included. Is that correct?

With regard to my other two questions, any thoughts on how to select which variables to inflate and whether there are additional goodness-of-fit tests I should run on the ZINB model once I get it built would be much appreciated
Comment

Announcement

Goodness-of-fit tests and variable selection for a zero-inflated negative binomial model

Comment

Comment