Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • how to account for population (P) in Poisson regression on municipal level data: Y/P or offset(P)?

    I want to estimate the impact of deforestation (X) on a disease cases (Y), a count variable.The data is a panel aggregated at municipal level and municipalities are very heterogeneous in population (P) and area. I'm wondering how to correctly account for the differences in exposed population (which is all the population, no age or other restrictions) in each municipality.
    More specifically, when estimating the (fixed effects) Poisson or Negative Binomial, should I:

    a) compute the ration Y/P and use that as dependent variable
    b) include P as a regressor with coefficient locked at 1 (implemented with the "offset(P)" option).
    c) include P as a regressor without restricting its coefficient.

    I've found other threads that tangentially indicate that "b)" is the correct solution. Is there a good textbook section or paper explaining this and that I could cite as authoritative source? Also does this affects the coefficient interpretation?

    Many of the articles we reviewed just use "a)", the number of cases per thousand inhabitants as the dependent variable. Does this bias the estimates or s.d. in any particular way?

    regards
    Lucas

  • #2
    Actually, it's none of the above. You want to use P in the -exposure()- option. (Alternatively, you can generate a new variable equal to the natural logarithm of P and use that in the -offset()- option.) The Stata [R] manual's section on the -poisson- command actually has a very good explanation of how all this works, including some references.

    With regard to a), that is something more likely to be used in a linear probability model than in a Poisson model. In fact since the ratio Y/P is bounded between 0 an d 1, its use as a dependent variable in a Poisson model would be strange.

    Comment


    • #3
      Clyde, indeed, the explanation at the -poisson- Stata manual, on aggregating from small homogeneous groups to larger heterogeneous units is very intuitive. That solves my problem. Thank you. I was learning about count data models from Cameron and Trivedi (2013), now I see that they only mention the exposure problem briefly and in a "exposure time per person" example in Chapter 13. I'll take a look at the other references listed in the manual.

      I'm still curious about the bias in a). In the articles, people use (Y/P)*1000, cases per thousand people, as the dependent variable (DV). Off course that is just scaling issue, but the fact that dep. var. ranges from 0 to ~1000 "hides" the 'strangeness' of it. Poisson reg can deal with the non-integer values. Testing on my data, I found almost identical results from using "(Y/P)*1000" or "round((Y/P)*1000)" as DV. And a 20% difference between the coefficients of xtpoisson using DV=((Y/P)*1000) and the correct ones DV=Y + exposure(P)

      regards
      Lucas

      Comment


      • #4
        use of ratio variables requires great care; this has been discussed many times on Statalist (particularly the version run out of Harvard); at the very least you should read the following two articles:

        Kronmal, RA (1993), "Spurious correlation and the fallacy of the ratio standard revisited", _Journal of the Royal Statistical Society, Series A_, 156(3): 379-392

        Rosenbaum, PR and Rubin, DB (1984), "Difficulties with regression analyses of age-adjusted rates," _Biometrics_, 40(2): 437-443

        Comment


        • #5
          Let's go back to the basics of the general linear model. The distribution here is Poisson throughout, and I'll omit it because it adds nothing to the present discussion.

          Code:
          poisson DV X, exposure(P)
          
          corresponds to the model:
          
          log(E(DV | X, P)) = Xb + log P
          
          whereas
          
          gen ratio = DV/P
          poisson ratio X
          
          corresponds to the model:
          
          log(E(DV/P | X)) = Xb
          
          which is rather different.  First, you are no longer conditioning on P, and second, even if you ignore that subtlety, you cannot transform this model algebraically into the first model unless P is constant.
          So I would not consider either model as a "biased" version of the other. They are different models that lead to estimation of different things. If your goal is to get unbiased estimates of the counts, conditional on the exposure and covariates, you go with the first model. If what you need is an unbiased estimate of the count:exposure ratio condtional on the covariates, you go with the second. You began your original post by saying you were looking to estimate the impact on a count variable, so the first option seemed to be what you were looking for.

          Comment


          • #6
            Clyde and Rich, tks .
            I'm also having some other related difficulties with the xtpoisson model and finding little guidance from previous posts or a good reference book. (Although the topics are different I'm continuing the thread, let me know if this should be a different tread )

            1) I'm having difficulties calculating the Average Marginal Effects (AME). Some of what I tried bellow. Is there a command to compute AME automatically? How should I interpret these given that I'm controlling for P?



            Code:
            *DV=numbers of cases of the disease
            *X=deforested area in the municipality
            *P=population
            xtpoisson  DV  X i.year controls                                                        i.ano , fe exposure(pop)
            matrix B=e(b)
            * Bx is .0007524
            *AME:
            margins, dydx(X) // returns Bx= .0007524
            margins, dydx(X) predict(nu0) // dy/dx=20.94358  (s.e. .0000657)
            *does the above have an interpretation?
            
            * xtpoisson   postestimation  manual indicates that, given the control for exposure this is the correct marginal  efffect :
            margins, dydx(X)  predict(iru0) //  dy/dx=.0006343 (s.e. 7.05e-06)
            *should I interpret the above results as "per  person "
            
            *Following Cameron e Trivedi (2013) I  tried  to compute: mean(Bx*exp( xb )) . Which in the case of Bx*y_av
            *Bx*exp( xb )
            predict a
            gen b=exp(a)*B[1,1]
            summ b  // = 19.85133
            
            *Bx*y_av
            summ  DV
            di B[1,1]*r(mean) // = 493.757
            
            *results above differ,  shouldn't  they be the same?
            2) I'm looking the impact of deforestation on diseases at municipal level. But the total municipal areas (TA) are very heterogeneous. I'm wondering how to account for that. I tried using deforested area (X) and also deforestation 'ratio' = X/TA. I estimated with both specifications and I'm trying to reconcile the results (perhaps comparing the marginal effects, #1).

            any advice or reference is appreciated.
            Last edited by Lucas Mation; 26 Feb 2015, 08:22.

            Comment


            • #7
              Now also cross-posted at http://stats.stackexchange.com/quest...lation-and-siz

              Comment

              Working...
              X