Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • when is it more appropriate to model count data as binary data (i.e., 0 if count == 0 and 1 if count > 0)?

    Does anyone know of a test or criteria for deciding whether a count mode or a binary model is more appropriate under the following circumstances? I haven't been able find anything when I've tried to Google this question; and I have looked at numerous books on logistic regression and count data regression and have not found anything that answers this question.

    A colleague suggested that the following count data (numbers of traffic fatalities per county per day) have such a thin tail that they cannot "support" a count model such as negative binomial and that it would better to model as binary data (as described in headline). I posted a similar question to CrossValidated https://stats.stackexchange.com/ques...t588474_309702 but have not yet received any actionable answers (unless I misunderstood them ).

    I could understand if we had the same number of zeros and ones as below but only 20 cases with 2 or more counts - then it seems perhaps more appropriate to use a binary model.

    Anyway thanks in advance for any advice!

    David

    PS I apologize for any errors in posting - I now finally grasp what the sandbox is for!
    Click image for larger version

Name:	fatalities.png
Views:	1
Size:	5.3 KB
ID:	1416244

    Last edited by David Beede; 27 Oct 2017, 09:36.

  • #2
    I don't know a guide here that isn't almost circular: model count problems as presence-absence problems whenever (a) presence-absence is the real issue and/or (b) precise counts are unattainable, irregular or highly error-prone.

    A slightly contrived example is the number of books (or computers) in a household. Households without books (or computers) are distinct. Counting books or computers could be tricky otherwise.

    Even with computers, people often have to think of whatever computers are used by partners, children, pets or robots and what is in storage somewhere and whether this or that machine was thrown out. And there are definition problems.

    Books are naturally worse.

    If the number is tens or so, someone can count quickly. If the number is thousands, forget it (usually).

    I bought a new watch recently. The sales person was very good and knowledgeable and mentioned that he had two of the brand I was buying. Cheekily I asked him how many watches he owned and he said "About 35". If you had that many watches, you might be uncertain.

    Another true story: The statistician John Tukey was testifying as an expert witness and the lawyer calling him was establishing his credentials to the court. How many honorary degrees did he have? "About 5". Note not just Tukey's caution in avoiding incorrect statements under oath but also the statistical thinking that even small counts are subject to error.

    Comment


    • #3
      Thank you very much for your thoughtful comment, Nick - this is very helpful. Really frames the question nicely, and I like the anecdotes!
      I think now I'm inclined to estimate both types of models.

      Comment


      • #4
        There are models that do both, e.g. hurdle models. They make sense if you think the process that causes people to go from zero to one is fundamentally different than the process that leads to later transitions. For a very brief discussion see pp. 20-22 of

        https://www3.nd.edu/~rwilliam/stats3/CountModels.pdf
        -------------------------------------------
        Richard Williams, Notre Dame Dept of Sociology
        Stata Version: 17.0 MP (2 processor)

        EMAIL: [email protected]
        WWW: https://www3.nd.edu/~rwilliam

        Comment


        • #5
          Thank you Richard - in my experience (i.e., with the data I have), I never seem to have a good theory for why certain explanatory variables go into one stage (logit) versus the other (count).

          Comment


          • #6
            This may be both less entertaining and less useful than other replies here, but I'll note that since you are looking at "fatalities per county per day" there may be value in aggregating your time periods. If you are going to use a count model anyway, I don't think any information is lost in looking at 7 day intervals (say) if you have no daily risk factors.

            I have run into this looking at central line infections in ICUs, for which we have daily data - but events so rare that on daily basis we have 99% zeroes. However, they are very serious events, and so we are averse to aggregating all the positive events per your alternative. Instead we look at monthly counts and use count models with an offset for number of exposure days. This 'thickens' the tail as it were.

            hth,
            Jeph

            Comment


            • #7
              Thank you very much, Jeph. I am looking at daily weather conditions so I don't think I can implement your suggested strategy without losing too much information.

              Comment

              Working...
              X