No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Negative binomial alternative to Sergio Corriea’s ppmlhdfe command in Stata

    Hello everyone,

    I am working on a paper with a colleague using individual-level micro data from the US Census (2015-2019 ACS 5-year estimates). We are predicting the wages of individuals while controlling for variety of covariates. One of the covariates is the state puma in which an individual resides. This becomes an issue because there are over 2,000 different statepumas across the country. Since we are only interested in controlling for state puma, we could absorb this using areg command and it runs relatively quickly. However, since we are specifically interested in reporting accurate estimates of individuals wages to support our argument, we can’t rely on taking the anti-log of the dependent variable. Instead, to obtain accurate estimates we should use poission regression. But since there is dramatic over-dispersion of wages, we actually need to use negative binomial regression. We run into problems because the dataset is very large ( ~4 million observations) and the large number of categories on the state puma variable. Sometimes it does not converge.

    My colleague and I are wondering if there is a negative binomial counter part to Sergio Corriea’s ppmlhdfe command in Stata? This command uses a psuedo-likelihood procedure instead of a maximum likelihood one to dramatically speeds up the analysis. As I said, we don’t think we can use it for our analysis, because our dependent variable is over-dispersed and therefore requires an additional parameter to adequately model the over-dispersion.

    Any advice on this would be greatly appreciated.


  • #2
    Cross-posted at


    • #3
      But since there is dramatic over-dispersion of wages, we actually need to use negative binomial regression.
      See Jeff Wooldridge's comments in #3 of on why this assertion is incorrect. In general, you can implement high dimensional fixed effects in linear models and Poisson, but not in nonlinear models.


      • #4
        You should use Poisson regression in this context, as you already said you want effects on the mean. Negative Binomial is not even a close second. As Andrew points out, you should ignore the canard that says one should not use Poisson regression when there is overdispersion. Poisson regression is completely robust; NegBin is not.

        If you have lots of individuals per PUMA -- as I believe you must -- then including the Puma coefficients in a pooled Poisson estimation will work. I would suggest tricking stating by using xtpoisson after an xtset puma, but then your choice of standard errors is somewhat restricted. You're forced to compute the nonrobust standard errors or cluster at the puma level. The first should not be used and the second might not be needed.


        • #5
          Thanks everyone for all great feedback and pointing out our common, but incorrect assumptions about negative binomial.

          Jeff Wooldridge, we attempted the xtpoission trick you suggested (specifying xtset puma and xtpoisson), but the FE specification for xtpoisson does not allow for clustered standard errors. Instead we can either choose oim (the default), robust, bootstrap, or jackknife. We specified a null HLM to check the ICC and found that about 9% of the variance in wages occurred between pumas, so we think clustering the SEs is in order. In light of this, should we reconsider using Sergio Correia's ppmlhdfe command since it allows for clustering of standard errors?

          Thanks again, much appreciated.


          • #6
            Kasey: It's a quirk of xtpoisson (and a few other Stata commands) that vce(robust) and vce(cluster puma) are the same. The latter is not allowed for some reason. I think the idea is that with FE methods and T not so large, it's all or nothing. With xtreg, vce(robust) and vce(cluster puma) are both allowed and give identical standard errors: robust to serial correlation and heteroskedasticity. So if you want to cluster at the puma level, xtpoisson, fe vce(robust) does it. Jeff


            • #7
              Further to the excellent advice Jeff provided, I would just like to add that overdispersion does not even make sense when the dependent variable is not a count because we can change the relation between the mean and the variance simply by changing the scale. This also implies that the results of the negbin regression (or zero-inflated models) in this context will depend on the units used to measure the dependent variable.


              • #8
                Thanks Jeff Wooldridge, for the clarification on xtpoission. That is not an obvious quirk I would have picked up on. Yes, that's a good point Joao Santos Silva. I'll keep that in mind when I encounter overdispersion in the future. Thanks for the comments, both of you, very helpful.