Can I use a zero inflated negative binomial regression?

camila haux

Join Date: Aug 2019

Posts: 6
#1

Can I use a zero inflated negative binomial regression?

31 Aug 2019, 02:30

Hello everyone,

this is my first post, so please be kind and understanding if I don't meet the forum norms.

So my regression equation was: reg amount12 ib1.lng_origins c.pca_generaltrst1 ib6.religion controls
The outcome variable is amount12=amount remitted in past 12 months, lng_origins=language origins in SA such as (Sotho, Venda, Tsonga etc) and religion=Atheists, christians etc.

So the very first problem that I have is that if I only look at the amount remitted of those that remit I may have selection bias. I cannot use a heckman because my selection equation does not have a variable that is different from the second stage so the exclusion restriction is violated.

Then I talked to a professor and he said that I should simply recode the missing values in the amount remitted to zeros because those people are not remitting any amount. So I did that and I also recoded two other variables with missings to zeros that I want to include as controls because I figured that otherwise stata only takes the values into account that are non-missing but to account for selection bias it has to take all the observations into account right? These are the control that I recoded: (1) relationship to remittance receiver (2) frequency of remittances.

Now I cant use OLS because the error terms are not distributed normally and I have a loooot of zeros which is why I thought I may be able to use a zero inflated negative binomial regression. Then in inflate() I would plug in my logit regression (all variables & controls without the outcome variable) that estimated whether a person remits or not:

zinb new_amount12 ib1.pop_lngorigins c.pca_generaltrst1 ib6.religion controls, inflate(ib1.pop_lngorigins c.pca_generaltrst1 ib6.religion other controls)

Unfortunately, the inflate regression does not give me the same or similar results as the logit regression that I did already, why is that? Can I still use the coefficients that I get for amount remitted?

Please note that this is a master thesis and that it does not have to be perfect (I would like it to be but I am pretty much new to these models so I think it is very normal that it will not be perfect right away).Thank you so much for your help in advance!!
Tags: None
Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#2

31 Aug 2019, 03:31

Multiple imputation should be taken in consideration when dealing with missing values. If I understood right, the DV conveys expenses, hence it is not a count variable. Being this so, a glm model with gamma family and log link could be considered.

Best regards,

Marcos
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2171
#3

31 Aug 2019, 06:30

Camila: Several things. First, it is a misconception that you cannot use a linear model estimated by OLS "because the errors are not normally distributed." Unless the sample size is small, estimation of a linear model by OLS is always a good starting point, with one important caveat: your Y variable must be the variable you actually want to explain. To me, this is the real issue.

If a zero is a true zero then your professor is correct, and this seems to be the case for remittances. Set it to zero. But do not treat a true missing value as a zero. At this point, if you drop truly missing data, how large is your sample size? How many zeros? How many different nonzero outcomes does Y take on?

My recommendation is to use a two-part model. If you answer the above question, I can continue to respond.

JW
Comment
camila haux

Join Date: Aug 2019

Posts: 6
#4

31 Aug 2019, 07:51

Dear JW,

thank you so much for your help!

So my total sample size is 28,464. And the non-zero observations for the amount12 are 1,887 (+4 observations are zero). A total of 26,573 are missing. So I don't know which of those missings are actually missings?

I do know that in the survey they asked 22,706 people if they have sent remittances of any kind in the past 12 months and then if they said yes (2,169), they asked how much they sent. So probably 278 observations (2169-1891=278) are truly missing and the rest I can recode to zero?

And I have also seen papers where they used a tobit model, I am just so confused what to use now...

Thanks again, really appreciate any help!
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2171
#5

31 Aug 2019, 12:55

Your plan for recoding the response variable makes sense to me. Be sure to document what you did in writing up your findings.

I would start with a linear regression and just include the zeros as they are. Then I would do Tobit. You need to be sure to compute the average marginal effects (which can be compared with OLS coefficients). Then Cragg's hurdle model using the -churdle linear- command, being sure to include all covariates in the selection and outcome equations.

Jeff
Comment
camila haux

Join Date: Aug 2019

Posts: 6
#6

31 Aug 2019, 16:26

Thank you so much for your suggestions - I was really frustrated yesterday and panicking that I would not be able to figure out what is best! It is very helpful

Just a few more questions:
1) For which model would you then use the recoded amount12 variable then? For the tobit and Craggs hurdle only?
2) For the tobit, do I have to put any upper or lower limit in the command? Or is it going to be just tobit amount12 ib1.lng_origins c.pca_generaltrst1 ib6.religion controls? I have run this command before (without specifying any lower or upper limit) but then it showed me that there were no censored values but shouldn't the zeros be censored values?
3) And what do you mean by being sure to include all covariates in the selection? What exactly is your concern what I may do incorrectly?

Sorry for asking so many questions, I just want to make sure that I understand everything correctly.

Thanks, Jeff!
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2171
#7

31 Aug 2019, 21:30

No problem, Camila. Recode the variable the same for all approaches. Make sure a zero is, as best you can tell, a zero. If you’re not very sure, code it to missing. This seems like a fairly small fraction.

In the Tobit and churdle, specify ll(0). What I mean in the churdle is don’t leave some variables out of the selection equation. I see this done fairly often. You have to specify the variables in each part. If you leave a variable out it generally causes inconsistency.
Comment
camila haux

Join Date: Aug 2019

Posts: 6
#8

01 Sep 2019, 02:43

Fantastic! I applied your suggestions but now I encountered new problems.

While I was working on the tobit postestimation, when I want to estimate the AME for the censored sample using this command:
margins, dydx(*) predict (ystar(0,.))
Stata tells me:
inconsistent estimation sample levels 1 and 2 of factor pop_lngorigins
Do you have any idea what this may mean?

when I use a normal margins, dydx(*) command it takes abnormally long - I am still waiting for the results.

Regarding the churdle linear command, stata tells me:
invalid selection model;
no observations

And if I try it with less variables just to see if maybe one variable is a problen it shows me:
initial values not feasible

Thanks so much for your advice so far, it is very helpful And even if I just get the tobit model to work, then I can compare that to the OLS results - I think that is already very helpful and adds value!
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2171
#9

01 Sep 2019, 09:03

Camila:

I probably have to see at least the output, if not also a data extract. Please show at least summary statistics of amount12 and the "selection" variable, along with the Tobit output. For churdle, are you sure selectvar is defined for all observations? It should be set to missing whenever amount12 is missing. It should be one when amount12 > 0 and zero if amount12 = 0.

If you just use dydx(*) with Tobit it should simple return the coefficients, so I'm puzzled by your finding. The command margins, dydx(*) predict (ystar(0,.)) is correct. Unfortunately, in my opinion, Stata has reversed the proper notation. ystar should refer to the underlying latent variable, but in the margins command it appears to mean y (the observed variable). I've checked this with "by hand" calculations.
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2171
#10

01 Sep 2019, 09:17

A couple more things. I just looked more closely at the churdle command, and I see that you don't put in a separate selection outcome. It is determined by the y variable. So it should be

Code:

churdle y x1 x2 ... xk, select(x1 x2 ... xk) ll(0)

When you use the tobit command, how many "censored" (unfortunate choice of word) observations does it report?
Comment

camila haux

Join Date: Aug 2019
Posts: 6

#11

01 Sep 2019, 11:04

Dear Jeff,

1) So I guess I must go through each control and independent variables and recode it to missing if amount12 is missing recode it to zero if amount12=0 and leave the value if amount12>0?

And this is the information you asked me for. I hope this helps?

This is the tobit output:

Code:

Refining starting values:

Refining starting values:

Grid node 0:   log likelihood =  -41906192

Fitting full model:

Iteration 0:   log pseudolikelihood =  -41906192  
Iteration 1:   log pseudolikelihood =  -34233897  
Iteration 2:   log pseudolikelihood =  -32270815  
Iteration 3:   log pseudolikelihood =  -31054566  
Iteration 4:   log pseudolikelihood =  -31021777  
Iteration 5:   log pseudolikelihood =  -31010035  
Iteration 6:   log pseudolikelihood =  -31009764  
Iteration 7:   log pseudolikelihood =  -31009734  
Iteration 8:   log pseudolikelihood =  -31009731  
Iteration 9:   log pseudolikelihood =  -31009731  
Iteration 10:  log pseudolikelihood =  -31009730  
Iteration 11:  log pseudolikelihood =  -31009730  
Iteration 12:  log pseudolikelihood =  -31009730  
Iteration 13:  log pseudolikelihood =  -31009730  
Iteration 14:  log pseudolikelihood =  -31009730  
Iteration 15:  log pseudolikelihood =  -31009730  

Tobit regression                                Number of obs     =      8,353
                                                   Uncensored     =      1,169
Limits: lower = 0                                  Left-censored  =      7,184
        upper = +inf                               Right-censored =          0

                                                F(  66,   8288)   =          .
                                                Prob > F          =          .
Log pseudolikelihood =  -31009730               Pseudo R2         =     0.1676

-------------------------------------------------------------------------------------------
                          |               Robust
             new_amount12 |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
--------------------------+----------------------------------------------------------------
           pop_lngorigins |
West Germanic - coloured  |  -1745.101   3950.441    -0.44   0.659    -9488.953    5998.752
   West Germanic - black  |   20534.04   12802.14     1.60   0.109    -4561.352    45629.44
   West Germanic - other  |  -2832.269   4096.955    -0.69   0.489    -10863.33    5198.789
                   Nguni  |  -110.8592   3523.101    -0.03   0.975    -7017.018      6795.3
                   Sotho  |   328.1406    3542.09     0.09   0.926    -6615.241    7271.523
                   Venda  |   6955.608   4870.478     1.43   0.153    -2591.747    16502.96
                  Tsonga  |   2160.247   4687.433     0.46   0.645    -7028.295    11348.79
                          |
         pca_generaltrst1 |   180.8257   377.5668     0.48   0.632    -559.2998    920.9512
                          |
                 religion |
          1. No religion  |   3886.939    1872.01     2.08   0.038     217.3303    7556.548
            2. Christian  |   1589.692   1502.533     1.06   0.290    -1355.648    4535.032
               3. Jewish  |   183.1857   5191.352     0.04   0.972    -9993.163    10359.53
               4. Muslim  |   133.1594   6014.309     0.02   0.982    -11656.39    11922.71
                5. Hindu  |   85236.76   9503.989     8.97   0.000     66606.56      103867
                7. Other  |   21882.63    6133.22     3.57   0.000     9859.986    33905.28
                          |
                  brnprov |
         2. Eastern Cape  |  -2653.259   2568.617    -1.03   0.302    -7688.392    2381.874
        3. Northern Cape  |  -7056.392   2898.603    -2.43   0.015    -12738.38   -1374.405
           4. Free State  |    -6136.3    3197.62    -1.92   0.055    -12404.43    131.8354
        5. KwaZulu-Natal  |  -4319.346   2698.866    -1.60   0.110      -9609.8    971.1073
           6. North West  |  -1142.944   3206.951    -0.36   0.722    -7429.372    5143.483
              7. Gauteng  |  -261.0968   3063.668    -0.09   0.932    -6266.653    5744.459
           8. Mpumalanga  |  -3854.059    2888.41    -1.33   0.182    -9516.065    1807.947
              9. Limpopo  |  -7429.577   3192.173    -2.33   0.020    -13687.04   -1172.118
                          |
                 best_gen |
               2. Female  |  -1699.624   938.9882    -1.81   0.070    -3540.276     141.028
                          |
                  edu_lev |
      incomplete primary  |   5325.529   4031.587     1.32   0.187     -2577.39    13228.45
       primary completed  |  -5443.667    2396.17    -2.27   0.023    -10140.76   -746.5746
    incomplete secondary  |  -1236.809   1255.305    -0.99   0.325     -3697.52    1223.903
          lower tertiary  |   516.3127   1173.086     0.44   0.660    -1783.229    2815.855
         higher tertiary  |  -5258.348   5123.291    -1.03   0.305    -15301.28    4784.585
                   other  |   3607.203   3962.133     0.91   0.363    -4159.568    11373.97
                      25  |  -1803.203   2746.166    -0.66   0.511    -7186.375    3579.969
                          |
                empl_stat |
                employed  |    244.643   1654.367     0.15   0.882    -2998.331    3487.617
                          |
              best_marstt |
  2. Living with Partner  |   -846.198   1314.147    -0.64   0.520    -3422.256     1729.86
        3. Widow/Widower  |  -2935.213   2191.087    -1.34   0.180    -7230.293    1359.867
4. Divorced or Seperated  |  -2572.785   2863.814    -0.90   0.369    -8186.577    3041.007
        5. Never married  |   833.9811   1238.124     0.67   0.501    -1593.051    3261.014
                          |
            age_intervals |
                6. 20-24  |  -5369.367   4015.577    -1.34   0.181     -13240.9    2502.168
                7. 25-29  |  -4645.755   3936.643    -1.18   0.238    -12362.56     3071.05
                8. 30-34  |  -1777.543   4106.531    -0.43   0.665    -9827.371    6272.285
                9. 35-39  |  -617.2596   4122.129    -0.15   0.881    -8697.665    7463.145
               10. 40-44  |  -3296.241   4234.272    -0.78   0.436    -11596.47    5003.991
               11. 45-49  |  -1063.552   4551.222    -0.23   0.815    -9985.087    7857.983
               12. 50-54  |  -687.5889   5074.855    -0.14   0.892    -10635.57    9260.397
               13. 55-59  |   4087.078   4724.316     0.87   0.387    -5173.764    13347.92
               14. 60-64  |  -45.82361   5443.571    -0.01   0.993    -10716.58    10624.94
                          |
                   hhsize |  -862.4076   229.6415    -3.76   0.000    -1312.562   -412.2529
                  tot_ass |   .0010023   .0004453     2.25   0.024     .0001294    .0018753
                          |
         new_rel_receiver |
                       0  |  -434741.6     158323    -2.75   0.006    -745094.3   -124388.9
                       4  |  -15984.62    3104.15    -5.15   0.000    -22069.53   -9899.709
                       5  |  -20006.91   3409.704    -5.87   0.000    -26690.78   -13323.03
                       6  |  -24278.85   3561.181    -6.82   0.000    -31259.66   -17298.05
                       8  |  -15933.78   3026.305    -5.27   0.000     -21866.1   -10001.47
                       9  |  -13713.71   4298.144    -3.19   0.001    -22139.15    -5288.27
                      12  |  -18911.23   2979.554    -6.35   0.000    -24751.91   -13070.56
                      13  |  -19981.52   14545.46    -1.37   0.170    -48494.27    8531.226
                      14  |  -17180.63   3892.493    -4.41   0.000    -24810.89    -9550.37
                      15  |  -25784.32   3970.853    -6.49   0.000    -33568.18   -18000.45
                      16  |  -16750.65   5063.353    -3.31   0.001    -26676.09   -6825.206
                      17  |  -19335.63   8281.317    -2.33   0.020    -35569.09    -3102.18
                      18  |  -16162.21   3323.644    -4.86   0.000    -22677.38   -9647.037
                      19  |  -16157.45   4917.914    -3.29   0.001    -25797.79   -6517.104
                      20  |  -23051.83   3519.221    -6.55   0.000    -29950.38   -16153.28
                      21  |  -12692.27   3236.184    -3.92   0.000       -19036   -6348.539
                      25  |   -10808.1   6086.545    -1.78   0.076    -22739.25    1123.054
                      26  |  -21748.88   3608.557    -6.03   0.000    -28822.55    -14675.2
                      30  |  -12915.42   3478.922    -3.71   0.000    -19734.97   -6095.859
                          |
                new_frq12 |    187.705    103.522     1.81   0.070     -15.2241    390.6341
         new_inkind12_frq |  -254.8363   112.8775    -2.26   0.024    -476.1045    -33.5681
                    _cons |   30518.05   6701.564     4.55   0.000     17381.31    43654.79
--------------------------+----------------------------------------------------------------
       var(e.new_amount12)|   1.28e+08   1.76e+07                      9.79e+07    1.68e+08
-------------------------------------------------------------------------------------------




sum new_amount12, detail

      f3_7_1 - Total amount of remittance in money sent
                     in past 12 months:1
-------------------------------------------------------------
      Percentiles      Smallest
 1%          200              0
 5%          600              0
10%         1000              0       Obs               1,891
25%         2800              0       Sum of Wgt.       1,891
50%         6000                      Mean           10043.85[INDENT=2]                        Largest       Std. Dev.      13972.29[/INDENT]75%        12000         150000
90%        24000         150000       Variance       1.95e+08
95%        30000         180000       Skewness       6.083797
99%        60000         240000       Kurtosis       69.18961


                      pca_generaltrst1
-------------------------------------------------------------
      Percentiles      Smallest
 1%    -3.277674      -3.277674
 5%    -2.927758      -3.277674
10%    -2.463739      -3.277674       Obs              22,458
25%    -1.468497      -3.277674       Sum of Wgt.      22,458

50%    -.6174005                      Mean          -.6743963
                        Largest       Std. Dev.      1.260248
75%     .2374917       2.521248
90%     .8311156       2.521248       Variance       1.588224
95%     1.190528       2.521248       Skewness      -.1316909
99%     2.255755       2.521248       Kurtosis       2.577096

          m8 - Religious affiliation of respondent
-------------------------------------------------------------
      Percentiles      Smallest
 1%            1              1
 5%            1              1
10%            2              1       Obs              22,659
25%            2              1       Sum of Wgt.      22,659

50%            2                      Mean           2.305486
                        Largest       Std. Dev.      1.252439
75%            2              7
90%            4              7       Variance       1.568604
95%            6              7       Skewness       2.408399
99%            6              7       Kurtosis       7.722633

             b11_3 - Province respondent born in
-------------------------------------------------------------
      Percentiles      Smallest
 1%            1              1
 5%            1              1
10%            2              1       Obs               9,121
25%            3              1       Sum of Wgt.       9,121

50%            5                      Mean           5.011622
                        Largest       Std. Dev.      2.356839
75%            7              9
90%            9              9       Variance       5.554689
95%            9              9       Skewness       .0462005
99%            9              9       Kurtosis       2.091913

                         Best gender
-------------------------------------------------------------
      Percentiles      Smallest
 1%            1              1
 5%            1              1
10%            1              1       Obs              28,445
25%            1              1       Sum of Wgt.      28,445

50%            2                      Mean           1.547864
                        Largest       Std. Dev.      .4977125
75%            2              2
90%            2              2       Variance       .2477177
95%            2              2       Skewness      -.1923405
99%            2              2       Kurtosis       1.036995

                           edu_lev
-------------------------------------------------------------
      Percentiles      Smallest
 1%            1              1
 5%            1              1
10%            2              1       Obs              25,044
25%            3              1       Sum of Wgt.      25,044

50%            3                      Mean           4.281025
                        Largest       Std. Dev.      4.466597
75%            4             25
90%            5             25       Variance       19.95049
95%            6             25       Skewness       4.098736
99%           25             25       Kurtosis       19.27936

               Employment status - Adult only
-------------------------------------------------------------
      Percentiles      Smallest
 1%            0              0
 5%            0              0
10%            0              0       Obs              22,721
25%            0              0       Sum of Wgt.      22,721

50%            0                      Mean           .4150786
                        Largest       Std. Dev.      .4927464
75%            1              1
90%            1              1       Variance        .242799
95%            1              1       Skewness       .3446938
99%            1              1       Kurtosis       1.118814

                     Best marital status
-------------------------------------------------------------
      Percentiles      Smallest
 1%            1              1
 5%            1              1
10%            1              1       Obs              23,957
25%            2              1       Sum of Wgt.      23,957

50%            5                      Mean           3.689819
                        Largest       Std. Dev.      1.746261
75%            5              5
90%            5              5       Variance       3.049426
95%            5              5       Skewness      -.6991092
99%            5              5       Kurtosis        1.62883

                        Age Intervals
-------------------------------------------------------------
      Percentiles      Smallest
 1%            5              5
 5%            5              5
10%            5              5       Obs              28,464
25%            6              5       Sum of Wgt.      28,464

50%            8                      Mean           8.463673
                        Largest       Std. Dev.      2.741433
75%           11             14
90%           13             14       Variance       7.515456
95%           14             14       Skewness       .4851433
99%           14             14       Kurtosis       2.083575

                Number of household residents
-------------------------------------------------------------
      Percentiles      Smallest
 1%            1              1
 5%            1              1
10%            2              1       Obs              23,900
25%            3              1       Sum of Wgt.      23,900

50%            5                      Mean           5.261339
                        Largest       Std. Dev.      3.355386
75%            7             30
90%           10             30       Variance       11.25862
95%           11             30       Skewness       1.538315
99%           16             30       Kurtosis       7.322676

                        Total Assets
-------------------------------------------------------------
      Percentiles      Smallest
 1%         2500            401
 5%     7563.433            401
10%        14000            401       Obs              22,104
25%      39726.6            500       Sum of Wgt.      22,104

50%     103280.7                      Mean           586326.6
                        Largest       Std. Dev.       5004999
75%       304500       2.23e+08
90%       940700       2.23e+08       Variance       2.51e+13
95%      1818718       3.50e+08       Skewness       48.26468
99%      7372000       3.50e+08       Kurtosis       2877.197

                      new_rel_receiver
-------------------------------------------------------------
      Percentiles      Smallest
 1%            0              0
 5%            0              0
10%            0              0       Obs              28,461
25%            0              0       Sum of Wgt.      28,461

50%            0                      Mean           .7711957
                        Largest       Std. Dev.      3.476242
75%            0             30
90%            0             30       Variance       12.08426
95%            4             30       Skewness       5.998841
99%           25             30       Kurtosis       43.04409

                          new_frq12
-------------------------------------------------------------
      Percentiles      Smallest
 1%            0              0
 5%            0              0
10%            0              0       Obs              28,426
25%            0              0       Sum of Wgt.      28,426

50%            0                      Mean           .6865897
                        Largest       Std. Dev.      3.737786
75%            0            200
90%            0            200       Variance       13.97104
95%            7            200       Skewness       35.34731
99%           12            300       Kurtosis       2303.839

                      new_inkind12_frq
-------------------------------------------------------------
      Percentiles      Smallest
 1%            0              0
 5%            0              0
10%            0              0       Obs              28,430
25%            0              0       Sum of Wgt.      28,430

50%            0                      Mean           .1471333
                        Largest       Std. Dev.       1.25058
75%            0             15
90%            0             20       Variance        1.56395
95%            0             48       Skewness       13.21677
99%            6             60       Kurtosis       327.3163

I hope you have a splendid Sunday evening! Thank you for being so helpful

Announcement

Can I use a zero inflated negative binomial regression?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment