Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Which regression model to use for zero-inflated distribution?

    Hello,

    I have a question regarding which type of regression model is right to use for a zero-inflated distribution.

    Some info about the data:

    - The dependent variable for one of my hypotheses is ‘distvolatility’ (shown in the table below).
    - Its distribution is heavily zero-inflated (1304 out of 1459 observations are 0) and positively skewed. These zeroes are real/true values (not censored/truncated).
    - There are 15 possible ‘distvolatility’ scores for respondents with a non-zero value for ‘distvolatility’, ranging from .2959995 to 32.373 (there are no other possible values other than those shown below).
    Code:
     tab distvolatility 
    
    distvolatil |
            ity |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              0 |      1,304       89.38       89.38
       .2959995 |         10        0.69       90.06
       .8690004 |         39        2.67       92.73
          4.661 |          7        0.48       93.21
         11.673 |          7        0.48       93.69
         12.542 |         12        0.82       94.52
         14.874 |         17        1.17       95.68
          15.17 |          4        0.27       95.96
         16.334 |          5        0.34       96.30
         17.203 |         11        0.75       97.05
         19.535 |          8        0.55       97.60
         19.831 |          3        0.21       97.81
         31.208 |          9        0.62       98.42
         31.504 |          3        0.21       98.63
         32.077 |         15        1.03       99.66
         32.373 |          5        0.34      100.00
    ------------+-----------------------------------
          Total |      1,459      100.00
    My question is which type of regression model is best to use for this type of zero-inflated distribution. I have done a lot of research and have seen a lot of different suggestions, although none seem to be completely correct for my data.

    - Zero-inflated poisson/zero-inflated binomial - These both assume count data. Would it severely bias the results if I were to use one of these forms of regression model (likely zinb as the variance is much higher than the mean), as my data are discrete but not count data?

    - Two-step generalised linear model - Another option is to model the probability of distvolatility being 0/1 as a binary logistic regression, and then use a GLM function on the non-zero values.

    - Tobit regression - I have also seen this mentioned as an option for zero-inflated distributions, although it assumes the zeroes are censored, which is not the case here.


    The zero-inflated negative binomial seems to be the best option at the moment, but any advice would be greatly appreciated!

  • #2
    Dear Conor Cotton,

    First of all, let's get one thing clear: your data has zeros, but there are no grounds to say that is zero inflated. Zero inflation implies that you have more zeros that what would be natural in a given benchmark distribution and therefore there is no zero inflation if you do not specify a benchmark distribution. Unfortunately, many people say that the data is zero inflated if it has zeros, but that is misleading.

    Now for the substance of your question: you are right in saying that the Tobit and zero inflated models are unlikely to work in this context. To be able to help, I would like to understand more about your data: what does it represent and what do the numbers mean? Also, you say that you want to use a regression; I presume that you just want to estimate the conditional mean, right?

    Finally, although it will be far from perfect, I would suggest that you start by trying Poisson regression just to see what the results look like.

    Best wishes,

    Joao

    Comment


    • #3
      Hello Joao,

      Thanks very much for your reply!

      Some more detailed information about the data:

      - My study focuses on electoral volatility - that is, shifts in voters' party preferences during an election campaign. The specific hypothesis for my question here relates to the ideological distance of voter party preference shifts during an election campaign.

      - The dependent variable 'distvolatility' measures the difference between the ideological left-right score of the party a voter said they planned to vote for pre-election and the ideological left-right score of the party they actually voted for in the election.

      - For example, a voter who said they planned for the farthest left party in the pre-election survey wave, but then switched to vote for the farthest right party in the election itself receives the maximum distance volatility score of 32.373.

      - Most voters in the sample (1304 out of 1459), however, did not switch, meaning that they receive a distvolatility value of 0.

      - The data is therefore positively skewed, with most observations at 0, and 15 possible distvolatility values for those voters that did switch party preference between the pre- and post-election wave. No other values of distvolatility are possible other than those shown above.

      - The reason I believe the data to be zero-inflated is because another study in this area (see link p.592, if it helps) using a similar dependent variable stated that the data was "very skewed towards zero", and used a negative binomial model (https://ejpr.onlinelibrary.wiley.com...475-6765.12049).

      - I have tested a variety of models (poisson, negative binomial, zero-inflated poisson, zero-inflated negative binomial, glm...), which have produced a variety of results. The unique distribution of distvolatility is the main reason why I am unsure which results/model type are most valid.

      Thanks again for your help,
      Conor

      Comment


      • #4
        I have included the output of different model types below, if it helps. The independent variables of interest are intmedia and socialmedia.

        POISSON REGRESSION
        Code:
         note: you are responsible for interpretation of noncount dep. variable
        
        Iteration 0:   log pseudolikelihood = -4285.1848  
        Iteration 1:   log pseudolikelihood = -4273.3653  
        Iteration 2:   log pseudolikelihood =  -4273.308  
        Iteration 3:   log pseudolikelihood = -4273.3078  
        
        Poisson regression                              Number of obs     =      1,449
                                                        Wald chi2(11)     =      75.79
        Log pseudolikelihood = -4273.3078               Prob > chi2       =     0.0000
        
        -----------------------------------------------------------------------------------
                          |               Robust
           distvolatility |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        ------------------+----------------------------------------------------------------
                 intmedia |  -.0228221     .10833    -0.21   0.833    -.2351451    .1895009
              socialmedia |  -.0749317   .1079076    -0.69   0.487    -.2864267    .1365633
                    age_i |  -.0149114   .0114957    -1.30   0.195    -.0374425    .0076197
                   female |  -.8162035   .3536464    -2.31   0.021    -1.509338   -.1230692
               highincome |  -.0323331   .2922719    -0.11   0.912    -.6051754    .5405093
                 highknow |   -.288478   .3224235    -0.89   0.371    -.9204164    .3434604
        partyclose_binary |  -3.736187   .6763477    -5.52   0.000    -5.061804    -2.41057
              leftright_i |  -.1947974   .0515355    -3.78   0.000    -.2958051   -.0937896
           political_mood |  -.0166545   .0110181    -1.51   0.131    -.0382496    .0049406
           networkhet12_i |   .0665276   .0632883     1.05   0.293    -.0575152    .1905705
                 nptvnews |   -.015744   .0474749    -0.33   0.740    -.1087931    .0773052
                    _cons |   1.879182    .932596     2.02   0.044     .0513273    3.707037
        -----------------------------------------------------------------------------------

        NEGATIVE BINOMIAL REGRESSION
        Code:
         note: you are responsible for interpretation of non-count dep. variable
        
        Fitting Poisson model:
        
        Iteration 0:   log pseudolikelihood = -4285.1848  
        Iteration 1:   log pseudolikelihood = -4273.3653  
        Iteration 2:   log pseudolikelihood =  -4273.308  
        Iteration 3:   log pseudolikelihood = -4273.3078  
        
        Fitting constant-only model:
        
        Iteration 0:   log pseudolikelihood = -1993.0908  
        Iteration 1:   log pseudolikelihood = -803.25524  
        Iteration 2:   log pseudolikelihood = -795.96579  
        Iteration 3:   log pseudolikelihood =  -795.7392  
        Iteration 4:   log pseudolikelihood = -795.73908  
        Iteration 5:   log pseudolikelihood = -795.73908  
        
        Fitting full model:
        
        Iteration 0:   log pseudolikelihood = -783.37498  
        Iteration 1:   log pseudolikelihood = -775.46255  
        Iteration 2:   log pseudolikelihood = -775.19466  
        Iteration 3:   log pseudolikelihood = -775.19343  
        Iteration 4:   log pseudolikelihood = -775.19343  
        
        Negative binomial regression                    Number of obs     =      1,449
                                                        Wald chi2(11)     =     113.83
        Dispersion           = mean                     Prob > chi2       =     0.0000
        Log pseudolikelihood = -775.19343               Pseudo R2         =     0.0258
        
        -----------------------------------------------------------------------------------
                          |               Robust
           distvolatility |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        ------------------+----------------------------------------------------------------
                 intmedia |   .1255122   .1321453     0.95   0.342    -.1334878    .3845123
              socialmedia |  -.1072578   .1286068    -0.83   0.404    -.3593225    .1448069
                    age_i |  -.0256229   .0109925    -2.33   0.020    -.0471678   -.0040779
                   female |  -1.506849   .2982102    -5.05   0.000     -2.09133   -.9223676
               highincome |  -.2024665   .3138774    -0.65   0.519    -.8176549    .4127219
                 highknow |  -.3929619   .3987052    -0.99   0.324     -1.17441    .3884859
        partyclose_binary |  -4.201325   .5646032    -7.44   0.000    -5.307927   -3.094723
              leftright_i |  -.2738756   .0679834    -4.03   0.000    -.4071206   -.1406306
           political_mood |  -.0419311   .0126475    -3.32   0.001    -.0667197   -.0171425
           networkhet12_i |   .1365838   .0632954     2.16   0.031     .0125271    .2606405
                 nptvnews |  -.0993774   .0713641    -1.39   0.164    -.2392485    .0404937
                    _cons |   2.221691   .7427555     2.99   0.003     .7659175    3.677465
        ------------------+----------------------------------------------------------------
                 /lnalpha |   3.521293    .120265                      3.285577    3.757008
        ------------------+----------------------------------------------------------------
                    alpha |   33.82812    4.06834                      26.72441     42.8201
        -----------------------------------------------------------------------------------
        ZERO-INFLATED POISSON
        Code:
        Fitting constant-only model:
        
        Iteration 0:   log pseudolikelihood = -3904.5654  (not concave)
        Iteration 1:   log pseudolikelihood = -1655.8721  
        Iteration 2:   log pseudolikelihood = -1426.9471  
        Iteration 3:   log pseudolikelihood = -1310.2937  
        Iteration 4:   log pseudolikelihood = -1309.9727  
        Iteration 5:   log pseudolikelihood = -1309.9727  
        
        Fitting full model:
        
        Iteration 0:   log pseudolikelihood = -1309.9727  
        Iteration 1:   log pseudolikelihood = -1199.3679  
        Iteration 2:   log pseudolikelihood = -1190.7897  
        Iteration 3:   log pseudolikelihood = -1190.0327  
        Iteration 4:   log pseudolikelihood = -1190.0138  
        Iteration 5:   log pseudolikelihood = -1190.0138  
        
        Zero-inflated Poisson regression                Number of obs     =      1,449
                                                        Nonzero obs       =        148
                                                        Zero obs          =      1,301
        
        Inflation model      = logit                    Wald chi2(11)     =      26.23
        Log pseudolikelihood = -1190.014                Prob > chi2       =     0.0060
        
        -----------------------------------------------------------------------------------
                          |               Robust
           distvolatility |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        ------------------+----------------------------------------------------------------
        distvolatility    |
                 intmedia |   .1457483   .0620504     2.35   0.019     .0241318    .2673647
              socialmedia |  -.0902421   .0615806    -1.47   0.143    -.2109379    .0304537
                    age_i |   .0088801   .0067993     1.31   0.192    -.0044463    .0222065
                   female |  -.3061134   .2075961    -1.47   0.140    -.7129942    .1007674
               highincome |   .0301107   .1847404     0.16   0.871    -.3319739    .3921952
                 highknow |  -.5226967   .2232216    -2.34   0.019    -.9602031   -.0851904
        partyclose_binary |  -2.342958   1.179573    -1.99   0.047    -4.654879    -.031037
              leftright_i |  -.0818301   .0426122    -1.92   0.055    -.1653484    .0016883
           political_mood |  -.0191852   .0096471    -1.99   0.047    -.0380932   -.0002773
           networkhet12_i |   .0478215   .0482742     0.99   0.322    -.0467942    .1424373
                 nptvnews |  -.0049585   .0452432    -0.11   0.913    -.0936335    .0837166
                    _cons |    1.78096   .6539628     2.72   0.006     .4992168    3.062704
        ------------------+----------------------------------------------------------------
        inflate           |
                   female |   .4496005   .2412327     1.86   0.062    -.0232069    .9224079
                    age_i |   .0281381   .0077531     3.63   0.000     .0129422    .0433339
        partyclose_binary |    1.43222   1.188294     1.21   0.228    -.8967947    3.761234
              leftright_i |   .1277518   .0454648     2.81   0.005     .0386424    .2168611
                    _cons |   .1349397    .398905     0.34   0.735    -.6468997     .916779
        -----------------------------------------------------------------------------------
        ZERO-INFLATED NEGATIVE BINOMIAL
        Code:
        Fitting constant-only model:
        
        Iteration 0:   log pseudolikelihood = -1510.8976  (not concave)
        Iteration 1:   log pseudolikelihood = -978.52393  (not concave)
        Iteration 2:   log pseudolikelihood = -826.25909  (not concave)
        Iteration 3:   log pseudolikelihood = -786.21785  
        Iteration 4:   log pseudolikelihood = -785.55506  
        Iteration 5:   log pseudolikelihood =  -772.3916  (not concave)
        Iteration 6:   log pseudolikelihood = -771.64816  (not concave)
        Iteration 7:   log pseudolikelihood = -771.19226  
        Iteration 8:   log pseudolikelihood = -769.78822  (backed up)
        Iteration 9:   log pseudolikelihood = -767.21199  
        Iteration 10:  log pseudolikelihood = -766.12564  
        Iteration 11:  log pseudolikelihood = -766.01186  
        Iteration 12:  log pseudolikelihood = -766.00875  
        Iteration 13:  log pseudolikelihood = -766.00874  
        
        Fitting full model:
        
        Iteration 0:   log pseudolikelihood = -766.00874  
        Iteration 1:   log pseudolikelihood = -765.25855  
        Iteration 2:   log pseudolikelihood = -761.59647  
        Iteration 3:   log pseudolikelihood = -760.83115  
        Iteration 4:   log pseudolikelihood = -760.82859  
        Iteration 5:   log pseudolikelihood = -760.82859  
        
        Zero-inflated negative binomial regression      Number of obs     =      1,449
                                                        Nonzero obs       =        148
                                                        Zero obs          =      1,301
        
        Inflation model      = logit                    Wald chi2(11)     =      27.46
        Log pseudolikelihood = -760.8286                Prob > chi2       =     0.0039
        
        -----------------------------------------------------------------------------------
                          |               Robust
           distvolatility |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
        ------------------+----------------------------------------------------------------
        distvolatility    |
                 intmedia |   .1294945   .1323249     0.98   0.328    -.1298575    .3888465
              socialmedia |  -.1075608   .1173758    -0.92   0.359     -.337613    .1224915
                    age_i |    .008255   .0090306     0.91   0.361    -.0094447    .0259547
                   female |   -.465647   .3159615    -1.47   0.141     -1.08492    .1536262
               highincome |  -.2211759   .2872466    -0.77   0.441    -.7841688     .341817
                 highknow |  -.5708536   .3631838    -1.57   0.116    -1.282681    .1409735
        partyclose_binary |  -1.474671   1.276883    -1.15   0.248    -3.977316    1.027973
              leftright_i |  -.2626629   .1262407    -2.08   0.037    -.5100902   -.0152356
           political_mood |  -.0343388   .0147464    -2.33   0.020    -.0632412   -.0054363
           networkhet12_i |   .0505506   .0644821     0.78   0.433     -.075832    .1769332
                 nptvnews |  -.0657494   .0805689    -0.82   0.414    -.2236615    .0921626
                    _cons |   .9160213   .7554607     1.21   0.225    -.5646544    2.396697
        ------------------+----------------------------------------------------------------
        inflate           |
                   female |   3.723743   3.270855     1.14   0.255    -2.687016     10.1345
                    age_i |   .1687383   .0742475     2.27   0.023     .0232158    .3142608
        partyclose_binary |   7.102586   5.893729     1.21   0.228    -4.448911    18.65408
              leftright_i |   .2254805   .3197976     0.71   0.481    -.4013114    .8522723
                    _cons |  -13.35607   6.778797    -1.97   0.049    -26.64227   -.0698728
        ------------------+----------------------------------------------------------------
                 /lnalpha |   3.233527   .2283771    14.16   0.000     2.785916    3.681137
        ------------------+----------------------------------------------------------------
                    alpha |   25.36896    5.79369                      16.21466    39.69151
        -----------------------------------------------------------------------------------
        TOBIT MODEL
        Code:
        Refining starting values:
        
        Grid node 0:   log likelihood = -1980.1584
        
        Fitting full model:
        
        Iteration 0:   log pseudolikelihood = -1980.1584  
        Iteration 1:   log pseudolikelihood = -1248.2141  
        Iteration 2:   log pseudolikelihood = -978.95445  
        Iteration 3:   log pseudolikelihood =  -833.2176  
        Iteration 4:   log pseudolikelihood = -788.53571  
        Iteration 5:   log pseudolikelihood = -786.99313  
        Iteration 6:   log pseudolikelihood = -786.96993  
        Iteration 7:   log pseudolikelihood = -786.96986  
        
        Tobit regression                                Number of obs     =      1,449
                                                           Uncensored     =        143
        Limits: lower = 0                                  Left-censored  =      1,301
                upper = 32.37                              Right-censored =          5
        
                                                        F(  11,   1438)   =       4.52
                                                        Prob > F          =     0.0000
        Log pseudolikelihood = -786.96986               Pseudo R2         =     0.0349
        
        ---------------------------------------------------------------------------------------
                              |               Robust
               distvolatility |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
        ----------------------+----------------------------------------------------------------
                     intmedia |  -1.347353   1.209848    -1.11   0.266    -3.720609    1.025902
                  socialmedia |   -.949716   1.177725    -0.81   0.420     -3.25996    1.360528
                        age_i |   -.383981   .1181594    -3.25   0.001    -.6157643   -.1521977
                       female |  -9.659573     3.4983    -2.76   0.006    -16.52189   -2.797255
                   highincome |  -1.329862   3.355703    -0.40   0.692    -7.912459    5.252736
                     highknow |  -.7423384    3.70838    -0.20   0.841    -8.016752    6.532075
            partyclose_binary |   -25.1309   8.518189    -2.95   0.003    -41.84031   -8.421494
                  leftright_i |  -2.235994   .6241807    -3.58   0.000    -3.460396   -1.011592
               political_mood |  -.1670533   .1262273    -1.32   0.186    -.4146626     .080556
               networkhet12_i |   .6267707   .6811763     0.92   0.358    -.7094349    1.962976
                     nptvnews |  -.2141815   .6432956    -0.33   0.739     -1.47608    1.047717
                        _cons |   -2.00573   10.18467    -0.20   0.844    -21.98413    17.97267
        ----------------------+----------------------------------------------------------------
         var(e.distvolatility)|   778.3652   130.3482                      560.4252    1081.058
        ---------------------------------------------------------------------------------------

        Comment


        • #5
          Dear Conor Cotton,

          Tanks for providing the additional information; just out of curiosity, can you also please post a histogram of your dependent variable?

          Given what you say, my feeling is that none of the approaches you tried is ideal, and I certainly would not use the Tobit and zero inflated models. From what you say, I would consider either an odered model or even a discrete choice model.

          Best wishes,

          Joao

          Comment


          • #6
            Click image for larger version

Name:	Screenshot 2020-08-11 at 02.40.40.png
Views:	1
Size:	66.5 KB
ID:	1567934


            Hi Joao,

            Thanks again for your reply. Attached is a screenshot of the dependent variable graph.

            When you say an ordered model, do you mean an ordered logit model? I assume this would require recoding the dependent variable values into 16 different groups (ie. 0-15)?

            Thanks,
            Conor

            Comment


            • #7
              Yes, that was the idea. I do not know if you lose information by doing that because I do not know if the variable is cardinal or just ordinal.

              Joao

              Comment

              Working...
              X