Large panel dataset and an unusual count outcome variable with zero-inflation

Salva Montenegro

Join Date: Mar 2023

Posts: 4
#1

Large panel dataset and an unusual count outcome variable with zero-inflation

08 Mar 2023, 20:49

I would like to ask for your kind opinion on predicting a count outcome variable that is zero-inflated and whose standard deviation is significantly larger than its mean. I apologize in advance for the long post and the multitude of questions.

I have over 1 million monthly observations from over 3,000 panels (unbalanced) over a course of 50 years. The outcome variable is the summation of two different counts (production volume) one of which is weighted (multiplied) with a non-integer multiplier. The variable contains around 250,000 observations with a value of “0” and no negative values.

Code:

Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- dv1 | 1,007,421 25475.76 297671.7 0 1.78e+07

1. Because the scale of the resulting variable is extremely large, I rescaled it by dividing by 1,000. I then rounded this variable to get rid of the decimals that occurred due to the weighting and rescaling. Should I be concerned about this rounding operation? Rounding down to the closest integer slightly increases the number of zeros. In terms of results and coefficients, I could not find significant differences among the results from xtpoisson and xtreg models with or without this rounding operation.

After rescaling and rounding to the closest integer:

Code:

Variable | Obs Mean Std. dev. Min Max -------------+--------------------------------------------------------- dv2 | 1,007,421 25.44277 297.6744 0 17789

On the other hand, without rescaling, some coefficients are extremely large and the AIC and BIC statistics are in billions.

2. Because I would like to utilize robust clustered standard errors as well as year and panel fixed effects, I lean toward xtpoisson/ppmlhdfe. Heteroscedasticity and omitted variable bias are definitely important concerns. I was able to obtain identical results from xtpoisson and ppmlhdfe. Inclusion of an exposure variable significantly improved AIC and BIC statistics from the xtpoisson models.

But I am also concerned about the extremely large difference between the mean and the standard deviation of this variable and potentially violating the assumptions for xtpoisson. I am aware that this topic has been covered in Statalist in the past. But the difference, in my case, is extreme. Should I be concerned? Alternatively, an xtnbreg model with random effects provided really odd (theoretically) results.

3. I also tried to transform this variable by adding a constant and then taking its log. This resulted in a more normal distribution, except for the spike caused by the large number of zeros.

I was able to obtain almost identical results from xtreg and xtpoisson (when I don’t use an exposure variable and instead include it as another covariate) with year and panel fixed effects and robust clustered standard errors. But I still don’t think that this variable could be predicted correctly with a linear model such as xtreg. Would you recommend another method of transformation to predict this outcome variable using a linear fixed effects model?

4. McFadden’s Pseudo R2 that I manually calculated and the one that is reported by ppmlhdfe as well as the ones that are reported by xtreg are all around 96%. These high R2 statistics seem to stem from the inclusion of fixed-effects and a lagged outcome variable (as it is the convention in my discipline). Should I be concerned? Should I try to somehow difference or de-trend this variable?

5. If using linear models is not an option, then how could I account for the potential endogeneity among my predictors using a count predictor? Theoretically, there might be a dynamic relationship between two of my predictors (of interest) and the lagged dependent variable.

Thank you.

Last edited by Salva Montenegro; 08 Mar 2023, 20:56.
Tags: fixed effects, regression, Time Series, xtpoisson

Announcement

Large panel dataset and an unusual count outcome variable with zero-inflation