Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Large panel dataset and an unusual count outcome variable with zero-inflation

    I would like to ask for your kind opinion on predicting a count outcome variable that is zero-inflated and whose standard deviation is significantly larger than its mean. I apologize in advance for the long post and the multitude of questions.

    I have over 1 million monthly observations from over 3,000 panels (unbalanced) over a course of 50 years. The outcome variable is the summation of two different counts (production volume) one of which is weighted (multiplied) with a non-integer multiplier. The variable contains around 250,000 observations with a value of “0” and no negative values.

    Code:
        Variable |        Obs        Mean    Std. dev.       Min        Max
    -------------+---------------------------------------------------------
             dv1 |  1,007,421    25475.76    297671.7          0   1.78e+07
    1. Because the scale of the resulting variable is extremely large, I rescaled it by dividing by 1,000. I then rounded this variable to get rid of the decimals that occurred due to the weighting and rescaling. Should I be concerned about this rounding operation? Rounding down to the closest integer slightly increases the number of zeros. In terms of results and coefficients, I could not find significant differences among the results from xtpoisson and xtreg models with or without this rounding operation.

    After rescaling and rounding to the closest integer:
    Code:
        Variable |        Obs        Mean    Std. dev.       Min        Max
    -------------+---------------------------------------------------------
             dv2 |  1,007,421    25.44277    297.6744          0      17789
    On the other hand, without rescaling, some coefficients are extremely large and the AIC and BIC statistics are in billions.


    2. Because I would like to utilize robust clustered standard errors as well as year and panel fixed effects, I lean toward xtpoisson/ppmlhdfe. Heteroscedasticity and omitted variable bias are definitely important concerns. I was able to obtain identical results from xtpoisson and ppmlhdfe. Inclusion of an exposure variable significantly improved AIC and BIC statistics from the xtpoisson models.

    But I am also concerned about the extremely large difference between the mean and the standard deviation of this variable and potentially violating the assumptions for xtpoisson. I am aware that this topic has been covered in Statalist in the past. But the difference, in my case, is extreme. Should I be concerned? Alternatively, an xtnbreg model with random effects provided really odd (theoretically) results.

    3. I also tried to transform this variable by adding a constant and then taking its log. This resulted in a more normal distribution, except for the spike caused by the large number of zeros.



    I was able to obtain almost identical results from xtreg and xtpoisson (when I don’t use an exposure variable and instead include it as another covariate) with year and panel fixed effects and robust clustered standard errors. But I still don’t think that this variable could be predicted correctly with a linear model such as xtreg. Would you recommend another method of transformation to predict this outcome variable using a linear fixed effects model?

    4. McFadden’s Pseudo R2 that I manually calculated and the one that is reported by ppmlhdfe as well as the ones that are reported by xtreg are all around 96%. These high R2 statistics seem to stem from the inclusion of fixed-effects and a lagged outcome variable (as it is the convention in my discipline). Should I be concerned? Should I try to somehow difference or de-trend this variable?

    5. If using linear models is not an option, then how could I account for the potential endogeneity among my predictors using a count predictor? Theoretically, there might be a dynamic relationship between two of my predictors (of interest) and the lagged dependent variable.

    Thank you.
    Last edited by Salva Montenegro; 08 Mar 2023, 20:56.
Working...
X