Which regression model to use for zero-inflated distribution?

Conor Cotton

Join Date: Jul 2020

Posts: 8
#1

Which regression model to use for zero-inflated distribution?

08 Aug 2020, 06:08

Hello,

I have a question regarding which type of regression model is right to use for a zero-inflated distribution.

Some info about the data:

- The dependent variable for one of my hypotheses is ‘distvolatility’ (shown in the table below).
- Its distribution is heavily zero-inflated (1304 out of 1459 observations are 0) and positively skewed. These zeroes are real/true values (not censored/truncated).
- There are 15 possible ‘distvolatility’ scores for respondents with a non-zero value for ‘distvolatility’, ranging from .2959995 to 32.373 (there are no other possible values other than those shown below).

Code:

tab distvolatility distvolatil | ity | Freq. Percent Cum. ------------+----------------------------------- 0 | 1,304 89.38 89.38 .2959995 | 10 0.69 90.06 .8690004 | 39 2.67 92.73 4.661 | 7 0.48 93.21 11.673 | 7 0.48 93.69 12.542 | 12 0.82 94.52 14.874 | 17 1.17 95.68 15.17 | 4 0.27 95.96 16.334 | 5 0.34 96.30 17.203 | 11 0.75 97.05 19.535 | 8 0.55 97.60 19.831 | 3 0.21 97.81 31.208 | 9 0.62 98.42 31.504 | 3 0.21 98.63 32.077 | 15 1.03 99.66 32.373 | 5 0.34 100.00 ------------+----------------------------------- Total | 1,459 100.00

My question is which type of regression model is best to use for this type of zero-inflated distribution. I have done a lot of research and have seen a lot of different suggestions, although none seem to be completely correct for my data.

- Zero-inflated poisson/zero-inflated binomial - These both assume count data. Would it severely bias the results if I were to use one of these forms of regression model (likely zinb as the variance is much higher than the mean), as my data are discrete but not count data?

- Two-step generalised linear model - Another option is to model the probability of distvolatility being 0/1 as a binary logistic regression, and then use a GLM function on the non-zero values.

- Tobit regression - I have also seen this mentioned as an option for zero-inflated distributions, although it assumes the zeroes are censored, which is not the case here.

The zero-inflated negative binomial seems to be the best option at the moment, but any advice would be greatly appreciated!
Tags: None
Joao Santos Silva

Join Date: Apr 2014

Posts: 3015
#2

08 Aug 2020, 09:56

Dear Conor Cotton,

First of all, let's get one thing clear: your data has zeros, but there are no grounds to say that is zero inflated. Zero inflation implies that you have more zeros that what would be natural in a given benchmark distribution and therefore there is no zero inflation if you do not specify a benchmark distribution. Unfortunately, many people say that the data is zero inflated if it has zeros, but that is misleading.

Now for the substance of your question: you are right in saying that the Tobit and zero inflated models are unlikely to work in this context. To be able to help, I would like to understand more about your data: what does it represent and what do the numbers mean? Also, you say that you want to use a regression; I presume that you just want to estimate the conditional mean, right?

Finally, although it will be far from perfect, I would suggest that you start by trying Poisson regression just to see what the results look like.

Best wishes,

Joao
3 likes
Comment
Conor Cotton

Join Date: Jul 2020

Posts: 8
#3

08 Aug 2020, 11:13

Hello Joao,

Thanks very much for your reply!

Some more detailed information about the data:

- My study focuses on electoral volatility - that is, shifts in voters' party preferences during an election campaign. The specific hypothesis for my question here relates to the ideological distance of voter party preference shifts during an election campaign.

- The dependent variable 'distvolatility' measures the difference between the ideological left-right score of the party a voter said they planned to vote for pre-election and the ideological left-right score of the party they actually voted for in the election.

- For example, a voter who said they planned for the farthest left party in the pre-election survey wave, but then switched to vote for the farthest right party in the election itself receives the maximum distance volatility score of 32.373.

- Most voters in the sample (1304 out of 1459), however, did not switch, meaning that they receive a distvolatility value of 0.

- The data is therefore positively skewed, with most observations at 0, and 15 possible distvolatility values for those voters that did switch party preference between the pre- and post-election wave. No other values of distvolatility are possible other than those shown above.

- The reason I believe the data to be zero-inflated is because another study in this area (see link p.592, if it helps) using a similar dependent variable stated that the data was "very skewed towards zero", and used a negative binomial model (https://ejpr.onlinelibrary.wiley.com...475-6765.12049).

- I have tested a variety of models (poisson, negative binomial, zero-inflated poisson, zero-inflated negative binomial, glm...), which have produced a variety of results. The unique distribution of distvolatility is the main reason why I am unsure which results/model type are most valid.

Thanks again for your help,
Conor
Comment

Conor Cotton

Join Date: Jul 2020
Posts: 8

08 Aug 2020, 11:26

I have included the output of different model types below, if it helps. The independent variables of interest are intmedia and socialmedia.

POISSON REGRESSION

Code:

 note: you are responsible for interpretation of noncount dep. variable

Iteration 0:   log pseudolikelihood = -4285.1848  
Iteration 1:   log pseudolikelihood = -4273.3653  
Iteration 2:   log pseudolikelihood =  -4273.308  
Iteration 3:   log pseudolikelihood = -4273.3078  

Poisson regression                              Number of obs     =      1,449
                                                Wald chi2(11)     =      75.79
Log pseudolikelihood = -4273.3078               Prob > chi2       =     0.0000

-----------------------------------------------------------------------------------
                  |               Robust
   distvolatility |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
------------------+----------------------------------------------------------------
         intmedia |  -.0228221     .10833    -0.21   0.833    -.2351451    .1895009
      socialmedia |  -.0749317   .1079076    -0.69   0.487    -.2864267    .1365633
            age_i |  -.0149114   .0114957    -1.30   0.195    -.0374425    .0076197
           female |  -.8162035   .3536464    -2.31   0.021    -1.509338   -.1230692
       highincome |  -.0323331   .2922719    -0.11   0.912    -.6051754    .5405093
         highknow |   -.288478   .3224235    -0.89   0.371    -.9204164    .3434604
partyclose_binary |  -3.736187   .6763477    -5.52   0.000    -5.061804    -2.41057
      leftright_i |  -.1947974   .0515355    -3.78   0.000    -.2958051   -.0937896
   political_mood |  -.0166545   .0110181    -1.51   0.131    -.0382496    .0049406
   networkhet12_i |   .0665276   .0632883     1.05   0.293    -.0575152    .1905705
         nptvnews |   -.015744   .0474749    -0.33   0.740    -.1087931    .0773052
            _cons |   1.879182    .932596     2.02   0.044     .0513273    3.707037
-----------------------------------------------------------------------------------

NEGATIVE BINOMIAL REGRESSION

Code:

 note: you are responsible for interpretation of non-count dep. variable

Fitting Poisson model:

Iteration 0:   log pseudolikelihood = -4285.1848  
Iteration 1:   log pseudolikelihood = -4273.3653  
Iteration 2:   log pseudolikelihood =  -4273.308  
Iteration 3:   log pseudolikelihood = -4273.3078  

Fitting constant-only model:

Iteration 0:   log pseudolikelihood = -1993.0908  
Iteration 1:   log pseudolikelihood = -803.25524  
Iteration 2:   log pseudolikelihood = -795.96579  
Iteration 3:   log pseudolikelihood =  -795.7392  
Iteration 4:   log pseudolikelihood = -795.73908  
Iteration 5:   log pseudolikelihood = -795.73908  

Fitting full model:

Iteration 0:   log pseudolikelihood = -783.37498  
Iteration 1:   log pseudolikelihood = -775.46255  
Iteration 2:   log pseudolikelihood = -775.19466  
Iteration 3:   log pseudolikelihood = -775.19343  
Iteration 4:   log pseudolikelihood = -775.19343  

Negative binomial regression                    Number of obs     =      1,449
                                                Wald chi2(11)     =     113.83
Dispersion           = mean                     Prob > chi2       =     0.0000
Log pseudolikelihood = -775.19343               Pseudo R2         =     0.0258

-----------------------------------------------------------------------------------
                  |               Robust
   distvolatility |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
------------------+----------------------------------------------------------------
         intmedia |   .1255122   .1321453     0.95   0.342    -.1334878    .3845123
      socialmedia |  -.1072578   .1286068    -0.83   0.404    -.3593225    .1448069
            age_i |  -.0256229   .0109925    -2.33   0.020    -.0471678   -.0040779
           female |  -1.506849   .2982102    -5.05   0.000     -2.09133   -.9223676
       highincome |  -.2024665   .3138774    -0.65   0.519    -.8176549    .4127219
         highknow |  -.3929619   .3987052    -0.99   0.324     -1.17441    .3884859
partyclose_binary |  -4.201325   .5646032    -7.44   0.000    -5.307927   -3.094723
      leftright_i |  -.2738756   .0679834    -4.03   0.000    -.4071206   -.1406306
   political_mood |  -.0419311   .0126475    -3.32   0.001    -.0667197   -.0171425
   networkhet12_i |   .1365838   .0632954     2.16   0.031     .0125271    .2606405
         nptvnews |  -.0993774   .0713641    -1.39   0.164    -.2392485    .0404937
            _cons |   2.221691   .7427555     2.99   0.003     .7659175    3.677465
------------------+----------------------------------------------------------------
         /lnalpha |   3.521293    .120265                      3.285577    3.757008
------------------+----------------------------------------------------------------
            alpha |   33.82812    4.06834                      26.72441     42.8201
-----------------------------------------------------------------------------------

ZERO-INFLATED POISSON

Code:

Fitting constant-only model:

Iteration 0:   log pseudolikelihood = -3904.5654  (not concave)
Iteration 1:   log pseudolikelihood = -1655.8721  
Iteration 2:   log pseudolikelihood = -1426.9471  
Iteration 3:   log pseudolikelihood = -1310.2937  
Iteration 4:   log pseudolikelihood = -1309.9727  
Iteration 5:   log pseudolikelihood = -1309.9727  

Fitting full model:

Iteration 0:   log pseudolikelihood = -1309.9727  
Iteration 1:   log pseudolikelihood = -1199.3679  
Iteration 2:   log pseudolikelihood = -1190.7897  
Iteration 3:   log pseudolikelihood = -1190.0327  
Iteration 4:   log pseudolikelihood = -1190.0138  
Iteration 5:   log pseudolikelihood = -1190.0138  

Zero-inflated Poisson regression                Number of obs     =      1,449
                                                Nonzero obs       =        148
                                                Zero obs          =      1,301

Inflation model      = logit                    Wald chi2(11)     =      26.23
Log pseudolikelihood = -1190.014                Prob > chi2       =     0.0060

-----------------------------------------------------------------------------------
                  |               Robust
   distvolatility |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
------------------+----------------------------------------------------------------
distvolatility    |
         intmedia |   .1457483   .0620504     2.35   0.019     .0241318    .2673647
      socialmedia |  -.0902421   .0615806    -1.47   0.143    -.2109379    .0304537
            age_i |   .0088801   .0067993     1.31   0.192    -.0044463    .0222065
           female |  -.3061134   .2075961    -1.47   0.140    -.7129942    .1007674
       highincome |   .0301107   .1847404     0.16   0.871    -.3319739    .3921952
         highknow |  -.5226967   .2232216    -2.34   0.019    -.9602031   -.0851904
partyclose_binary |  -2.342958   1.179573    -1.99   0.047    -4.654879    -.031037
      leftright_i |  -.0818301   .0426122    -1.92   0.055    -.1653484    .0016883
   political_mood |  -.0191852   .0096471    -1.99   0.047    -.0380932   -.0002773
   networkhet12_i |   .0478215   .0482742     0.99   0.322    -.0467942    .1424373
         nptvnews |  -.0049585   .0452432    -0.11   0.913    -.0936335    .0837166
            _cons |    1.78096   .6539628     2.72   0.006     .4992168    3.062704
------------------+----------------------------------------------------------------
inflate           |
           female |   .4496005   .2412327     1.86   0.062    -.0232069    .9224079
            age_i |   .0281381   .0077531     3.63   0.000     .0129422    .0433339
partyclose_binary |    1.43222   1.188294     1.21   0.228    -.8967947    3.761234
      leftright_i |   .1277518   .0454648     2.81   0.005     .0386424    .2168611
            _cons |   .1349397    .398905     0.34   0.735    -.6468997     .916779
-----------------------------------------------------------------------------------

ZERO-INFLATED NEGATIVE BINOMIAL

Code:

Fitting constant-only model:

Iteration 0:   log pseudolikelihood = -1510.8976  (not concave)
Iteration 1:   log pseudolikelihood = -978.52393  (not concave)
Iteration 2:   log pseudolikelihood = -826.25909  (not concave)
Iteration 3:   log pseudolikelihood = -786.21785  
Iteration 4:   log pseudolikelihood = -785.55506  
Iteration 5:   log pseudolikelihood =  -772.3916  (not concave)
Iteration 6:   log pseudolikelihood = -771.64816  (not concave)
Iteration 7:   log pseudolikelihood = -771.19226  
Iteration 8:   log pseudolikelihood = -769.78822  (backed up)
Iteration 9:   log pseudolikelihood = -767.21199  
Iteration 10:  log pseudolikelihood = -766.12564  
Iteration 11:  log pseudolikelihood = -766.01186  
Iteration 12:  log pseudolikelihood = -766.00875  
Iteration 13:  log pseudolikelihood = -766.00874  

Fitting full model:

Iteration 0:   log pseudolikelihood = -766.00874  
Iteration 1:   log pseudolikelihood = -765.25855  
Iteration 2:   log pseudolikelihood = -761.59647  
Iteration 3:   log pseudolikelihood = -760.83115  
Iteration 4:   log pseudolikelihood = -760.82859  
Iteration 5:   log pseudolikelihood = -760.82859  

Zero-inflated negative binomial regression      Number of obs     =      1,449
                                                Nonzero obs       =        148
                                                Zero obs          =      1,301

Inflation model      = logit                    Wald chi2(11)     =      27.46
Log pseudolikelihood = -760.8286                Prob > chi2       =     0.0039

-----------------------------------------------------------------------------------
                  |               Robust
   distvolatility |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
------------------+----------------------------------------------------------------
distvolatility    |
         intmedia |   .1294945   .1323249     0.98   0.328    -.1298575    .3888465
      socialmedia |  -.1075608   .1173758    -0.92   0.359     -.337613    .1224915
            age_i |    .008255   .0090306     0.91   0.361    -.0094447    .0259547
           female |   -.465647   .3159615    -1.47   0.141     -1.08492    .1536262
       highincome |  -.2211759   .2872466    -0.77   0.441    -.7841688     .341817
         highknow |  -.5708536   .3631838    -1.57   0.116    -1.282681    .1409735
partyclose_binary |  -1.474671   1.276883    -1.15   0.248    -3.977316    1.027973
      leftright_i |  -.2626629   .1262407    -2.08   0.037    -.5100902   -.0152356
   political_mood |  -.0343388   .0147464    -2.33   0.020    -.0632412   -.0054363
   networkhet12_i |   .0505506   .0644821     0.78   0.433     -.075832    .1769332
         nptvnews |  -.0657494   .0805689    -0.82   0.414    -.2236615    .0921626
            _cons |   .9160213   .7554607     1.21   0.225    -.5646544    2.396697
------------------+----------------------------------------------------------------
inflate           |
           female |   3.723743   3.270855     1.14   0.255    -2.687016     10.1345
            age_i |   .1687383   .0742475     2.27   0.023     .0232158    .3142608
partyclose_binary |   7.102586   5.893729     1.21   0.228    -4.448911    18.65408
      leftright_i |   .2254805   .3197976     0.71   0.481    -.4013114    .8522723
            _cons |  -13.35607   6.778797    -1.97   0.049    -26.64227   -.0698728
------------------+----------------------------------------------------------------
         /lnalpha |   3.233527   .2283771    14.16   0.000     2.785916    3.681137
------------------+----------------------------------------------------------------
            alpha |   25.36896    5.79369                      16.21466    39.69151
-----------------------------------------------------------------------------------

TOBIT MODEL

Code:

Refining starting values:

Grid node 0:   log likelihood = -1980.1584

Fitting full model:

Iteration 0:   log pseudolikelihood = -1980.1584  
Iteration 1:   log pseudolikelihood = -1248.2141  
Iteration 2:   log pseudolikelihood = -978.95445  
Iteration 3:   log pseudolikelihood =  -833.2176  
Iteration 4:   log pseudolikelihood = -788.53571  
Iteration 5:   log pseudolikelihood = -786.99313  
Iteration 6:   log pseudolikelihood = -786.96993  
Iteration 7:   log pseudolikelihood = -786.96986  

Tobit regression                                Number of obs     =      1,449
                                                   Uncensored     =        143
Limits: lower = 0                                  Left-censored  =      1,301
        upper = 32.37                              Right-censored =          5

                                                F(  11,   1438)   =       4.52
                                                Prob > F          =     0.0000
Log pseudolikelihood = -786.96986               Pseudo R2         =     0.0349

---------------------------------------------------------------------------------------
                      |               Robust
       distvolatility |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
----------------------+----------------------------------------------------------------
             intmedia |  -1.347353   1.209848    -1.11   0.266    -3.720609    1.025902
          socialmedia |   -.949716   1.177725    -0.81   0.420     -3.25996    1.360528
                age_i |   -.383981   .1181594    -3.25   0.001    -.6157643   -.1521977
               female |  -9.659573     3.4983    -2.76   0.006    -16.52189   -2.797255
           highincome |  -1.329862   3.355703    -0.40   0.692    -7.912459    5.252736
             highknow |  -.7423384    3.70838    -0.20   0.841    -8.016752    6.532075
    partyclose_binary |   -25.1309   8.518189    -2.95   0.003    -41.84031   -8.421494
          leftright_i |  -2.235994   .6241807    -3.58   0.000    -3.460396   -1.011592
       political_mood |  -.1670533   .1262273    -1.32   0.186    -.4146626     .080556
       networkhet12_i |   .6267707   .6811763     0.92   0.358    -.7094349    1.962976
             nptvnews |  -.2141815   .6432956    -0.33   0.739     -1.47608    1.047717
                _cons |   -2.00573   10.18467    -0.20   0.844    -21.98413    17.97267
----------------------+----------------------------------------------------------------
 var(e.distvolatility)|   778.3652   130.3482                      560.4252    1081.058
---------------------------------------------------------------------------------------

Comment

Joao Santos Silva

Join Date: Apr 2014

Posts: 3015
#5

10 Aug 2020, 03:08

Dear Conor Cotton,

Tanks for providing the additional information; just out of curiosity, can you also please post a histogram of your dependent variable?

Given what you say, my feeling is that none of the approaches you tried is ideal, and I certainly would not use the Tobit and zero inflated models. From what you say, I would consider either an odered model or even a discrete choice model.

Best wishes,

Joao
1 like
Comment
Conor Cotton

Join Date: Jul 2020

Posts: 8
#6

10 Aug 2020, 19:47

Hi Joao,

Thanks again for your reply. Attached is a screenshot of the dependent variable graph.

When you say an ordered model, do you mean an ordered logit model? I assume this would require recoding the dependent variable values into 16 different groups (ie. 0-15)?

Thanks,
Conor
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3015
#7

10 Aug 2020, 20:42

Yes, that was the idea. I do not know if you lose information by doing that because I do not know if the variable is cardinal or just ordinal.

Joao
Comment

Announcement

Which regression model to use for zero-inflated distribution?

Comment

Comment

Comment

Comment

Comment

Comment