Which count model to use?

Max Coleman

Join Date: Mar 2017

Posts: 24
#1

Which count model to use?

24 Jun 2021, 08:00

I have a dependent variable that ranges from 0-11, and it's the number of Covid-19 precautions the respondent took (mean: 7.4, SD: 1.5). I'm not sure which estimator to use. The obvious model (poisson) has a higher AIC and BIC than many of the other options.

If I rely on AIC and BIC alone, the models that are most preferred are OLS regression and

Code:

glm ... family(bin 11) link(logit)

I think the generalized linear model with binomial distribution makes more sense than OLS regression, but I'm hesitant because these aren't really "independent trials" as the binomial distribution would assume. That is, a person who took "at least 10 precautions" is more likely to take the eleventh precaution.

Any thoughts? By the way, my sample size is small (N=114) in case that matters.

Thanks,
Max

_______

P.S. Here are the AIC & BIC values:

Poisson: AIC = 485, BIC = 510
Ordered logit: AIC = 412, BIC = 450
OLS: AIC = 412, BIC = 436
GLM, binomial distribution: AIC = 413, BIC = 438
Truncreg: AIC = 411, BIC = 439
Tags: None
Richard Williams

Join Date: Apr 2014

Posts: 5008
#2

24 Jun 2021, 08:34

Have you tried nbreg? Also, if truncation is an issue, the user-written rcpoisson might be worth looking at. I could also see some sort of zero-inflated model being used. See

https://statisticalhorizons.com/zero-inflated-models

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
Comment
Max Coleman

Join Date: Mar 2017

Posts: 24
#3

24 Jun 2021, 08:54

I didn't think nbreg was appropriate because there's no evidence of overdisperson (the standard deviation is much smaller than the mean). I also don't see much justification for zero-inflation because there's no systematic reason why someone would score 0 on # of covid precautions.
1 like
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2175
#4

24 Jun 2021, 09:09

Max: Your instincts on using a linear model estimated by OLS and binomial regression are sound. The binomial regression is completely robust provided you have the mean correctly specified. In particular, the individual choices do not have to be independent Bernoulli trials. I discuss this in Chapter 18 of my 2010 MIT Press book. However, you should use the vce(robust) option for standard errors. I wouldn't use nbreg because of the upper bound on y in your case. An exponential functional form isn't ideal when there is a firm upper bound.
1 like
Comment
Max Coleman

Join Date: Mar 2017

Posts: 24
#5

28 Jun 2021, 12:34

Thank you, Dr. Wooldridge—I really appreciate your help! But just to clarify, how can I ensure the mean is "correctly specified"?

Last edited by Max Coleman; 28 Jun 2021, 12:36.
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4423
#6

28 Jun 2021, 20:59

If all of your explanatory variables are categorical and you have included terms for all of their interactions, then your model will correctly specify the mean response.

By the way, does your questionnaire list exactly eleven possible possible precautions that the respondent ticks yes or no to each one of? That is, can each of a universe of eleven precautions be individually identified? Or is it the case that there just happened to be an observed maximum of eleven precautions that respondents listed, say, in free-form written format? If the former, you might want to consider fitting a generalized linear mixed model, e.g., using -melogit- instead.
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2175
#7

29 Jun 2021, 05:38

Max: Generally, you can't ever be sure the mean is correct. Joseph listed one case but it's a special case. The idea is to choose a mean function that coheres to your outcome variable. In your case, if the logical range of y is an integer between zero and 11 then binomial regression does that. The mean function is 11*logit(x*b). You can put in flexible functions in x, such as squares and interactions. But focus on the average partial effects.

If we want to claim we're estimating something useful we usually assume something about our model is correctly specified. Binomial regression is one of those nice methods where we only have to assume the minimum: that our conditional mean function is correct. It never is, of course, but that doesn't mean it isn't useful.
1 like
Comment
Max Coleman

Join Date: Mar 2017

Posts: 24
#8

29 Jun 2021, 10:54

Thank you, that's very helpful. And to answer Joseph's question: there were exactly 11 options provided, so the range really was constrained in advance (0-11).
Comment
Joseph Coveney

Join Date: Apr 2014

Posts: 4423
#9

29 Jun 2021, 19:06

Originally posted by Max Coleman View Post

there were exactly 11 options provided, so the range really was constrained in advance (0-11).

Then you're throwing away information when you model the outcome as

Code:

glm ... family(bin 11) link(logit)

that you show above in #1.

You might want to consider something like I mentioned above in #6, or even something along the following lines.

Code:

reshape long rsp?, i(pid) j(itm) xtgee rsp i.grp##c.bas##i.tim, i(pid) t(itm) family(binomial) link(probit) corr(unstructured) // or, if you insist xtgee rsp i.grp##c.bas##i.tim, i(pid) family(binomial) link(logit) corr(independent) vce(robust)

pid = participant ID, rsp = response on an item (Y/N), itm = questionnaire item ID (1–11), grp = some grouping variable, bas = some continuous covariate

The advantage of the first approach is that you can accommodate how the questionnaire items individually associate. (And that you can glean something about just how they associate by inspection of the working correlation matrix.) The model might be more difficult to fit (converge) with 114 respondents' data.
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2175
#10

29 Jun 2021, 20:15

Joseph: I agree with you, but sometimes one intentionally throws out information. Maybe Max doesn't care about the responses to the individual questions because then there are 11 outcomes to explain rather than one. Depending on the precautions and what are the key covariates are, studying the individual responses could be interesting. That can be done one probit at a time or using GEE for potential efficiency gains. Maybe only the sum is available, but only Max knows that.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#11

29 Jun 2021, 20:27

As a sidelight, I feel fine about using count models when the DV is something like # of articles or doctor visits. In such cases you are doing the same thing over and over.

But, when it is something like # of Covid precautions, I wonder about it. It isn't like you are doing something up to 11 times, you are doing up to to 11 different things one time. Further, some of those things may have been super easy to do while others may have been far harder. Two different people might have done 5 things without a single shared activity between them. And, of course, they might have done 5 other things that they weren't given as options.

Do my concerns have any merit? Is there any unambiguous definition as to when things should be considered countable and when they shouldn't?

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
1 like
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1134
#12

30 Jun 2021, 08:49

I think Richard Williams raises some excellent questions in #11. And to complicate things even further, some the same COVID precaution may be relatively easy for some people and very difficult for others. E.g., for those of us who are able to work from home (WFH), maintaining physical distance from folks outside our homes is relatively easy. For others who use public transit to go to work in a warehouse, grocery store, meat processing plant, etc., it is far more difficult.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#13

30 Jun 2021, 09:19

Originally posted by Bruce Weaver View Post

I think Richard Williams raises some excellent questions in #11. And to complicate things even further, some the same COVID precaution may be relatively easy for some people and very difficult for others. E.g., for those of us who are able to work from home (WFH), maintaining physical distance from folks outside our homes is relatively easy. For others who use public transit to go to work in a warehouse, grocery store, meat processing plant, etc., it is far more difficult.

Yes, I wonder if, instead of having one DV, you should have 11, because the determinants of each precaution may differ.

Or, maybe reshape the data long, so you have 11 records for each person, one for each precaution. Then you could use something like clogit or melogit. Maybe toss in some interactions so effects could differ depending on the type of precaution. Or include variables that reflect difficulty of the precaution, e.g. for Bruce working from home is easy, for an "essential worker" it is hard. But wearing a mask may be about as easy or difficult for everyone.

I suppose you could even consider constructing a scale or doing a factor analysis of the 11 items. You could see if all 11 items legitimately make a single scale, or whether two or more scales should be constructed. If you just add them up you are basically assuming that a single scale is fine. But in other instances, just adding up 11 variables would be considered questionable if you hadn't first determined they work fine as a scale. This handout briefly discusses scale construction and how you can test whether or not all items belong in the same scale: https://www3.nd.edu/~rwilliam/stats2/l23.pdf

I don't have any great statistical theory, but my intuition says some things are enough alike that they are countable (# of doctor visits) while others may not be.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
1 like
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2175
#14

30 Jun 2021, 11:48

I'm sympathetic to studying the outcomes individually, but I don't see how clogit or melogit does the job. Those are really intended for the case where you have the change to make the same choice repeatedly. On this grocery trip, did you use cash or credit? Neither is meant for "Out of these 10 items, which did you buy?" If you want to use the disaggregated responses, it's like a seemingly unrelated regression for binary responses. But that's way too hard to do in a maximum likelihood framework. As Joseph suggested, GEE can be uses. But so can modeling each one separately.

In any case, in using the total counts one can see which variables affect the count. One can then study individual components. They aren't mutually exclusive.
Comment
Richard Williams

Join Date: Apr 2014

Posts: 5008
#15

30 Jun 2021, 14:49

When in doubt, I think it is often good to try things multiple ways and hope that each way leads to more or less the same conclusions! Or that one methods yields insights that the other misses.

With Covid precautions, it seems to me you do have the chance to make the same choice repeatedly -- either use the precaution or don't use it. And if not, why is it legit to then count up the choices?

I also like the idea of using alternative-specific measures -- is this choice easy or hard? Or, maybe have interactions between gender and other variables, e.g. does difficulty of task affect men and women differently?

In the grocery store example, the cost of the groceries would be important, e.g. I may pay $20 cash for something but I would almost always use a credit card for a $200 purchase. With count models that gets glossed over.

Knowing what these 11 precautions are might also help! I suspect some tend to go together (if I mask I also social distance), others are deemed unnecessary if you are already taking some other precaution. For example, some people wear masks in stores, but others just get everything delivered.

In particular, if one of the precautions is getting vaccinated, use of that option may cause the use of other options to plummet.

Anyway, I am not really sure how to handle this, but my instincts say that other things besides count models should at least be considered.

-------------------------------------------
Richard Williams, Notre Dame Dept of Sociology
StataNow Version: 19.5 MP (2 processor)
EMAIL: [email protected]
WWW: https://www3.nd.edu/~rwilliam
2 likes
Comment

Announcement

Which count model to use?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment