Hello everyone,

I am working on a paper with a colleague using individual-level micro data from the US Census (2015-2019 ACS 5-year estimates). We are predicting the wages of individuals while controlling for variety of covariates. One of the covariates is the state puma in which an individual resides. This becomes an issue because there are over 2,000 different statepumas across the country. Since we are only interested in controlling for state puma, we could absorb this using areg command and it runs relatively quickly. However, since we are specifically interested in reporting accurate estimates of individuals wages to support our argument, we can’t rely on taking the anti-log of the dependent variable. Instead, to obtain accurate estimates we should use poission regression. But since there is dramatic over-dispersion of wages, we actually need to use negative binomial regression. We run into problems because the dataset is very large ( ~4 million observations) and the large number of categories on the state puma variable. Sometimes it does not converge.

My colleague and I are wondering if there is a negative binomial counter part to Sergio Corriea’s ppmlhdfe command in Stata? This command uses a psuedo-likelihood procedure instead of a maximum likelihood one to dramatically speeds up the analysis. As I said, we don’t think we can use it for our analysis, because our dependent variable is over-dispersed and therefore requires an additional parameter to adequately model the over-dispersion.

Any advice on this would be greatly appreciated.

Best,

Kasey

I am working on a paper with a colleague using individual-level micro data from the US Census (2015-2019 ACS 5-year estimates). We are predicting the wages of individuals while controlling for variety of covariates. One of the covariates is the state puma in which an individual resides. This becomes an issue because there are over 2,000 different statepumas across the country. Since we are only interested in controlling for state puma, we could absorb this using areg command and it runs relatively quickly. However, since we are specifically interested in reporting accurate estimates of individuals wages to support our argument, we can’t rely on taking the anti-log of the dependent variable. Instead, to obtain accurate estimates we should use poission regression. But since there is dramatic over-dispersion of wages, we actually need to use negative binomial regression. We run into problems because the dataset is very large ( ~4 million observations) and the large number of categories on the state puma variable. Sometimes it does not converge.

My colleague and I are wondering if there is a negative binomial counter part to Sergio Corriea’s ppmlhdfe command in Stata? This command uses a psuedo-likelihood procedure instead of a maximum likelihood one to dramatically speeds up the analysis. As I said, we don’t think we can use it for our analysis, because our dependent variable is over-dispersed and therefore requires an additional parameter to adequately model the over-dispersion.

Any advice on this would be greatly appreciated.

Best,

Kasey

## Comment