non-count outcomes and use of Poisson regression

Tasha Epp

Join Date: Aug 2020

Posts: 2
#1

non-count outcomes and use of Poisson regression

26 Aug 2020, 15:40

I have data from a study assessing spores/g from 3 different sources (bee brood, brood honey, and super honey) from 10 - 20 beehives within 2 bee yards per bee producers. One of the bee yards per producer was clinically affected with disease X and the other bee yard per producer was clinically not affected. The goal is to compare the affected and unaffected yards to see if the outcome (spores/g) differ. I have chosen to focus this post on just one of the outcome sources - bee brood just for simplification sake.

Many of the outcome results are zeros (true zeros and not just a matter of detection limits). The data looks skewed toward zero. However, these are not counts (integer values) as there are decimal places. I think I should be using Poisson but there is no offset as the outcome is spores/g. Can I still use Poisson regression? Or do I need to use negative binomial or even zero-inflated models?

The following are the median values for spores/g from bee brood for all hives in each yard (where 0 is clinically not affected and 1 is clinically affected yards). The IQR is listed below the median for the bee yards.

table affected beeid, contents(median broodbees iqr broodbees)
---------------------------------
| BEEID
affected | 1 2 3
----------+---------------------
0 | 0 .4 .45
| 0 5.4 36.2
|
1 | 1.3 .4 3.05
| 1.9 1.15 6.15
----------------------------------

If I do use a Poisson regression (accounting for bee producer id as a fixed factor), I get a note about interpretation as the outcomes are not counts (which I am worried about).

. poisson broodbees affected ib(first).beeid
note: you are responsible for interpretation of noncount dep. variable

Iteration 0: log likelihood = -5499716.6
Iteration 1: log likelihood = -5446501.9
Iteration 2: log likelihood = -5440771
Iteration 3: log likelihood = -5439890.3
Iteration 4: log likelihood = -5439723.5
Iteration 5: log likelihood = -5439712.9
Iteration 6: log likelihood = -5439712.8
Iteration 7: log likelihood = -5439712.8

Poisson regression Number of obs = 90
LR chi2(3) = 2051748.44
Prob > chi2 = 0.0000
Log likelihood = -5439712.8 Pseudo R2 = 0.1587

------------------------------------------------------------------------------
broodbees | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
affected | 1.083511 .0021099 513.53 0.000 1.079375 1.087646
|
beeid |
2 | -10.25816 .1517964 -67.58 0.000 -10.55568 -9.960647
3 | -.7625571 .0015938 -478.44 0.000 -.765681 -.7594332
|
_cons | 9.793063 .0020161 4857.48 0.000 9.789112 9.797014
------------------------------------------------------------------------------

Any advice would be greatly appreciated.
Tags: None
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2167
#2

26 Aug 2020, 16:17

Tasha: This is a good application of Poisson regression without a count. All you have to do is assume the exponential mean function is a reasonable approximation. However, you must use the vce(robust) option to get valid standard errors. The ones you have computed assume the variance and the mean are the same, and that's almost certainly false.

I actually wish Stata would drop that message. It would be like giving a message when using OLS: Caution, your dependent variable does not take on negative values. It's not relevant provided you use robust standard errors.

As a robustness check, you can try Tobit estimation and then use margins to compute the semi-elasticities. These would be comparable to the Poisson coefficients.

JW
3 likes
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#3

27 Aug 2020, 02:15

Without meaning to suggest that Professor Wooldridge's textbook does not address the issue well--the appropriateness of Poisson regression in this context is well discussed in Chapter 18 Econometric analysis of cross section and panel data, and for deep understanding of the issue this is probably the right reading.

Yet I found this Stata Blog post very accessible and easy to read treatment of the topic of using Poisson regression for non-negative outcomes.

https://blog.stata.com/2011/08/22/us...tell-a-friend/

and in bit more detail this reading

https://www.stata.com/meeting/boston...10_nichols.pdf

Professor Jeff Wooldridge , if you have time can you please elaborate on your closing statement, that Poission regression is roughly equal to a Tobit, and suggest some readings on the topic?
1 like
Comment
Tasha Epp

Join Date: Aug 2020

Posts: 2
#4

27 Aug 2020, 10:55

What about the issue of overdisperson in the poisson regression even with the vce(robust) option? When I run the goodness of fit, it perhaps suggests issues with overdisperson?

. estat gof

Deviance goodness-of-fit = 1.09e+07
Prob > chi2(86) = 0.0000

Pearson goodness-of-fit = 3.80e+07
Prob > chi2(86) = 0.0000

STATA suggests that under these circumstances to run nbreg......??

. nbreg broodbees affected ib(first).beeid, vce(robust)
note: you are responsible for interpretation of non-count dep. variable

Negative binomial regression Number of obs = 90
Wald chi2(3) = 267.51
Dispersion = mean Prob > chi2 = 0.0000
Log pseudolikelihood = -307.93677 Pseudo R2 = 0.0562

-------------------------------------------------------------------------------------------------
| Robust
broodbees | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+-----------------------------------------------------------------------------------
affected | -.0147325 .7258476 -0.02 0.984 -1.437368 1.407903
|
beeid |
2 | -10.26628 1.103188 -9.31 0.000 -12.42849 -8.104072
3 | -.7692339 1.20164 -0.64 0.522 -3.124404 1.585937
|
_cons | 10.64216 1.245637 8.54 0.000 8.200759 13.08357
-------------+-----------------------------------------------------------------------------------
/lnalpha | 2.506893 .1015553 2.307849 2.705938
-------------+-----------------------------------------------------------------------------------
alpha | 12.26676 1.245754 10.05278 14.96835
-------------------------------------------------------------------------------------------------

Thanks
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35698
#5

27 Aug 2020, 11:08

Well no, if only because -- but not only because -- this is judging Poisson regression by a criterion irrelevant in your case. For example, variance/mean is a quantity without units for counts, but that is not true for measurements. The variance/mean ratio for measurements can't be compared meaningfully with 1 as a reference standard, because its units are arbitrary.

Part of the problem here is names. Poisson was a big fish in 19th century mathematics and physics, but he didn't discover the Poisson distribution -- let alone Poisson regression. The latter name does make sense in so far as Poisson distribution is a reference case for regression with this functional form for a counted response. We might be better off if people just talked about log-linear regression except that the term log-linear models was hijacked some 50 years ago for use in categorical data analysis.

There is important small print, but anyone who has soaked up the culture of generalized linear models just thinks of Poisson regression as one flavour of generalized linear models with a logarithmic link function.

The big deal here is the functional form Y = exp(Xb), not any assumptions or ideal conditions about conditional distributions. That doesn't mean that the latter are irrelevant, just that robust standard errors are the tool of choice.
2 likes
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2167
#6

28 Aug 2020, 06:07

Joro: I should train myself not to use imprecise language on this site. Here is what I mean to say. With Poisson regression, we are simply modeling E(y|x) = exp(x*b), and using an appealing objective function to do so. The coefficients, bj, are semi-elasticities (or an elasticity of xj is itself a natural log). The Tobit model -- as with any model that implies a complete distribution -- implies a form for E(y|x). It's more complicated than exp(x*b) but still not too difficult to work with. Therefore, we can compare average partial effects -- whether in level or percentage change form -- from the two models. Saying that the Tobit semi-elasticities "would be comparable" to those from Poisson regression -- which are just the coefficients -- is a bit imprecise. Those are the things about the models that we can compare. Whether they're similar is a different matter, but I've done it a few times and tend to get similar answers.

The general points is, we can always compare apples with apples across different models. Tasha's concerns result from comparing apples with oranges. As Nick helpfully pointed out, testing the Poisson distributional assumption is wrong headed because we don't care if the Poisson distribution holds. We don't care if there is overdispersion. It's a bit discourage that, in 2020, students are still being taught incorrectly on the merits of Poisson versus negative binomial. The NB regression is not even consistent in Tasha's case because she's applying it to a non-count, and we don't know any robustness properties of NB when the distribution is not negative binomial. By contrast, the Poisson estimates are fully robust for the mean parameters.

Another reminder that I still need to revise my MIT Press book to make these points more forcefully ....
4 likes
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3011
#7

28 Aug 2020, 11:15

Dear Jeff Wooldridge,

Thank you for these clarifying comments. As you can imagine, I could not agree more with what you say about the merits of Poisson regression and about how these topics are taught.

I have, however, some apprehension about using the Tobit for this kind of data. I understand that the form of E(y|x) that is implicit in the Tobit would be suitable for corner-solutions data, but the first order conditions of the estimator do not depend on E(y|x) but on things such as Pr(y=0|x) and the variance of the errors, which are estimated under very strong assumptions. These quantities are then combined to obtain an estimate E(y|x) and I find it difficult to understand how the Tobit can deliver reasonable estimates of E(y|x) unless its very restrictive assumptions are met, but maybe I am missing something. Of course, this contrasts sharply with what happens with Poisson regression, because in this case the first order conditions directly depend on the conditional expectation.

Best wishes and many thanks,

Joao
1 like
Comment

Jeff Wooldridge

Join Date: Apr 2014
Posts: 2167

28 Aug 2020, 14:19

Hi Joao. Thanks for your comments. A couple of reactions. First, every model is wrong so we have to use statistics to compare the performance. I agree that one might hesitate to use a Tobit model because, nominally, it imposes a lot of assumptions. But that's being too literal. Who says the Tobit mean -- which adds one more parameter, the standard deviation of the latent error -- can't fit better than an exponential mean? The shapes of E(y|x) are similar. Neither Poisson regression nor Tobit chooses parameter estimates to minimize the sum of squared residuals, although I agree that Poisson regression uses moment conditions directly for the mean. As you say, Tobit does not.

So let's agree the Tobit model is misspecified. How should we evaluate it compared with Poisson regression? Really only one way: How do well do they fit E(y|x)? We can compute R-squareds based on estimated mean functions. I just grabbed a data set on charitable contributions (charity.dta for those interested -- it comes with my intro book), which has a weird distribution because of focal points at 10 dollars, 25 dollars, and so on. We'd have to work hard to fit that distribution with a standard parametric model.

But how do the methods work for estimating the mean? By two measures of R-squared, the Tobit fits better. And it's not even trying to fit the mean. For the SSR based measure, for the Tobit R-squared is .256 and for Poisson the R-squared is .209. The difference is not trivial. And by the way, the linear regression actually fits better than Poisson regression. So Poisson is in last place in this example. As you know, I'm a big fan of exponential means estimated by Poisson regression, but it's not always the best fit.

The semi-elasticities give the same general story but there are some notable differences in magnitude. I'd be hard pressed to choose those generated by the Poisson regression given it's the worst fit.

A long time ago started to write a paper that started with, "Suppose E(y|x) has a mean that follows that from a Tobit model" and then estimate the parameters by, say, Poisson regression or weighted nonlinear least squares, without any distributional assumptions. There is a way to justify it, actually. In the end, I think I gave it as a problem in my second year PhD course.

Code:

. use charity, clear

. set more off

. 
. tab gift

  amount of |
gift, Dutch |
   guilders |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |      2,561       60.00       60.00
          2 |         25        0.59       60.59
          3 |          6        0.14       60.73
          4 |          1        0.02       60.75
          5 |        158        3.70       64.46
          7 |         14        0.33       64.78
          8 |          1        0.02       64.81
         10 |        702       16.45       81.26
         12 |          1        0.02       81.28
         15 |        152        3.56       84.84
         20 |         86        2.01       86.86
         22 |          2        0.05       86.90
         24 |          1        0.02       86.93
         25 |        387        9.07       95.99
         30 |         36        0.84       96.84
         35 |          7        0.16       97.00
         40 |          4        0.09       97.09
         50 |         86        2.01       99.11
         55 |          1        0.02       99.13
         60 |          1        0.02       99.16
         75 |          3        0.07       99.23
         90 |          1        0.02       99.25
         95 |          1        0.02       99.27
        100 |         25        0.59       99.86
        120 |          1        0.02       99.88
        150 |          1        0.02       99.91
        200 |          1        0.02       99.93
        250 |          3        0.07      100.00
------------+-----------------------------------
      Total |      4,268      100.00

. 
. reg gift resplast weekslast mailsyear propresp lavggift, robust

Linear regression                               Number of obs     =      4,268
                                                F(5, 4262)        =     118.81
                                                Prob > F          =     0.0000
                                                R-squared         =     0.2335
                                                Root MSE          =     13.195

------------------------------------------------------------------------------
             |               Robust
        gift |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    resplast |   .8323287   .6714167     1.24   0.215    -.4839976    2.148655
   weekslast |   -.017745   .0056432    -3.14   0.002    -.0288086   -.0066813
   mailsyear |   .3447685   .3962544     0.87   0.384    -.4320965    1.121633
    propresp |   13.20904   1.196681    11.04   0.000     10.86292    15.55516
    lavggift |    8.87639   .7987143    11.11   0.000     7.310494    10.44229
       _cons |  -21.87232   1.918434   -11.40   0.000    -25.63345   -18.11119
------------------------------------------------------------------------------

. scalar sst = e(mss) + e(rss)

. 
. tobit gift i.resplast weekslast mailsyear propresp lavggift, ll(0)

Refining starting values:

Grid node 0:   log likelihood =  -10590.01

Fitting full model:

Iteration 0:   log likelihood =  -10590.01  
Iteration 1:   log likelihood = -9422.2419  
Iteration 2:   log likelihood = -9120.7983  
Iteration 3:   log likelihood = -9115.3634  
Iteration 4:   log likelihood = -9115.3322  
Iteration 5:   log likelihood = -9115.3322  

Tobit regression                                Number of obs     =      4,268
                                                   Uncensored     =      1,707
Limits: lower = 0                                  Left-censored  =      2,561
        upper = +inf                               Right-censored =          0

                                                LR chi2(5)        =    1110.76
                                                Prob > chi2       =     0.0000
Log likelihood = -9115.3322                     Pseudo R2         =     0.0574

------------------------------------------------------------------------------
        gift |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  1.resplast |   .6063509   1.223622     0.50   0.620    -1.792586    3.005287
   weekslast |  -.1277379   .0160406    -7.96   0.000    -.1591857     -.09629
   mailsyear |   1.543729   .6828994     2.26   0.024     .2048911    2.882568
    propresp |    34.6707   2.413806    14.36   0.000     29.93838    39.40302
    lavggift |   13.10639   .6775495    19.34   0.000     11.77804    14.43474
       _cons |  -55.06357   2.947979   -18.68   0.000    -60.84314     -49.284
-------------+----------------------------------------------------------------
  var(e.gift)|   592.1187   22.37586                       549.836     637.653
------------------------------------------------------------------------------

. scalar sigmah = sqrt(_b[/var(e.gift)])

. margins, eydx(*) predict(ystar(0,.))

Average marginal effects                        Number of obs     =      4,268
Model VCE    : OIM

Expression   : E(gift*|gift>0), predict(ystar(0,.))
ey/dx w.r.t. : 1.resplast weekslast mailsyear propresp lavggift

------------------------------------------------------------------------------
             |            Delta-method
             |      ey/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  1.resplast |   .0372591   .0750136     0.50   0.619    -.1097648    .1842831
   weekslast |  -.0078663   .0009979    -7.88   0.000    -.0098221   -.0059104
   mailsyear |   .0950652   .0420408     2.26   0.024     .0126667    .1774636
    propresp |   2.135073   .1504902    14.19   0.000     1.840118    2.430029
    lavggift |   .8071112   .0449785    17.94   0.000     .7189549    .8952674
------------------------------------------------------------------------------
Note: ey/dx for factor levels is the discrete change from the base level.

. 
. predict xbh, xb

. gen gifth_t = normal(xbh/sigmah)*xb + sigmah*normalden(xbh/sigmah)

. gen rh = gift - gifth_t

. gen rhsq = rh^2

. sum rhsq

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        rhsq |      4,268    168.7765    1420.425   .0000315   48449.32

. scalar ssr_t = r(N)*r(mean)

. di 1 - ssr_t/sst
.25592575

. 
. poisson gift i.resplast weekslast mailsyear propresp lavggift, vce(robust)

Iteration 0:   log pseudolikelihood =   -29295.7  
Iteration 1:   log pseudolikelihood = -28878.956  
Iteration 2:   log pseudolikelihood = -28877.948  
Iteration 3:   log pseudolikelihood = -28877.948  

Poisson regression                              Number of obs     =      4,268
                                                Wald chi2(5)      =    1184.35
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -28877.948               Pseudo R2         =     0.3245

------------------------------------------------------------------------------
             |               Robust
        gift |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  1.resplast |  -.0328039   .0766209    -0.43   0.669    -.1829782    .1173704
   weekslast |  -.0085732   .0014008    -6.12   0.000    -.0113187   -.0058277
   mailsyear |   .0866153   .0377713     2.29   0.022     .0125849    .1606458
    propresp |    1.43388   .1258484    11.39   0.000     1.187221    1.680538
    lavggift |   .9080363   .0699659    12.98   0.000     .7709056    1.045167
       _cons |  -1.119548   .2536891    -4.41   0.000    -1.616769   -.6223265
------------------------------------------------------------------------------

. predict gifth_p
(option n assumed; predicted number of events)

. gen uh = gift - gifth_p

. gen uhsq = uh^2

. sum uhsq

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        uhsq |      4,268    179.5211    2349.589   .0000164   139854.4

. scalar ssr_p = r(N)*r(mean)

. di 1 - ssr_p/sst
.20855682

. corr gift gifth_t gifth_p
(obs=4,268)

             |     gift  gifth_t  gifth_p
-------------+---------------------------
        gift |   1.0000
     gifth_t |   0.5120   1.0000
     gifth_p |   0.4895   0.7923   1.0000

Comment

Joao Santos Silva

Join Date: Apr 2014
Posts: 3011

29 Aug 2020, 05:21

Dear Jeff Wooldridge,

Thank you so much for taking the time to reply and for the additional clarifying comments. Obviously I agree that all models are misspecified and that there is absolutely no reason to assume the the exponential model estimated by Poisson regression will always provide the best fit.

The example you provide illustrates that indeed the Tobit (when properly interpreted) can lead to a better fit better than Poisson regression, but I wonder how general this example is.

For example, Tobit assumes a conditional expectation that is linear when we are far from zero, whereas Poisson assumes an exponential mean. Therefore it is natural to assume that the regressors enter the two models differently. In your example, some continuous regressors enter the model in levels and others are in logs; if we let all of these variables enter the model both in levels and in logs, the results of the Poisson regression are substantially better than those of the Tobit (of course, the margins are not meaningful in this example).

More importantly, if we regress gift on the resplast dummy, both OLS and Poisson produce fitted values that are equal to the average of the dependent variable when the dummy is equal to 0 or 1 This is what we would expect from a reasonable estimator of the conditional mean when the only regressor is a dummy. In contrast, the Tobit fitted values are below the average of gift when resplast is 0, and above the average of gift when resplast is 1.

In short, agree with you on all of this but, perhaps because of my experience, I tend to be sceptical about the ability of the Tobit to produce meaningful results in this context, although that does not mean that we should rule it out.

Once again, thank you for taking the time to discuss these issues; hopefully this will be as useful to other users as it was to me.

Best wishes,

Joao

Code:

. use charity.dta, clear

. set more off

. g lavggift=log(avggift)

. g lweekslast=log(weekslast)

. g lmailsyear=log(mailsyear)

. g lpropresp=log(propresp)

.
. reg gift resplast weekslast mailsyear propresp avggift l*, robust

Linear regression                               Number of obs     =      4,268
                                                F(9, 4258)        =      79.98
                                                Prob > F          =     0.0000
                                                R-squared         =     0.2437
                                                Root MSE          =     13.113

------------------------------------------------------------------------------
             |               Robust
        gift |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    resplast |   3.526834   .9902288     3.56   0.000      1.58547    5.468199
   weekslast |   -.086694   .0168873    -5.13   0.000    -.1198018   -.0535861
   mailsyear |  -3.831536   1.299148    -2.95   0.003    -6.378543   -1.284529
    propresp |   8.255573   3.079576     2.68   0.007        2.218    14.29315
     avggift |  -.0035145   .0060786    -0.58   0.563    -.0154317    .0084027
    lavggift |   9.150577   .8103962    11.29   0.000     7.561778    10.73938
  lweekslast |   5.007975   1.099412     4.56   0.000     2.852555    7.163395
  lmailsyear |   7.388224   1.802861     4.10   0.000     3.853677    10.92277
   lpropresp |   2.592077   1.453919     1.78   0.075    -.2583617    5.442515
       _cons |    -29.711   4.407001    -6.74   0.000    -38.35102   -21.07098
------------------------------------------------------------------------------

.
. predict gifth_ols, xb

.
. scalar sst = e(mss) + e(rss)

.
. tobit gift i.resplast weekslast mailsyear propresp avggift l*, ll(0)

Refining starting values:

Grid node 0:   log likelihood = -10561.066

Fitting full model:

Iteration 0:   log likelihood = -10561.066  
Iteration 1:   log likelihood = -9369.2893  
Iteration 2:   log likelihood = -9069.0548  
Iteration 3:   log likelihood = -9062.3019  
Iteration 4:   log likelihood =  -9062.257  
Iteration 5:   log likelihood = -9062.2569  

Tobit regression                                Number of obs     =      4,268
                                                   Uncensored     =      1,707
Limits: lower = 0                                  Left-censored  =      2,561
        upper = +inf                               Right-censored =          0

                                                LR chi2(9)        =    1216.91
                                                Prob > chi2       =     0.0000
Log likelihood = -9062.2569                     Pseudo R2         =     0.0629

------------------------------------------------------------------------------
        gift |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  1.resplast |   7.993452   1.528021     5.23   0.000     4.997734    10.98917
   weekslast |  -.3429517   .0416402    -8.24   0.000    -.4245881   -.2613153
   mailsyear |  -7.323156    2.23718    -3.27   0.001    -11.70919   -2.937117
    propresp |   8.461574   7.606064     1.11   0.266    -6.450276    23.37342
     avggift |  -.0026719   .0049799    -0.54   0.592    -.0124351    .0070913
    lavggift |   13.47739   .7348953    18.34   0.000     12.03661    14.91817
  lweekslast |   14.77376   2.074052     7.12   0.000     10.70754    18.83998
  lmailsyear |   16.60953   3.340836     4.97   0.000     10.05975    23.15931
   lpropresp |   13.32008   3.818178     3.49   0.000     5.834465     20.8057
       _cons |  -69.78557   9.920698    -7.03   0.000    -89.23531   -50.33583
-------------+----------------------------------------------------------------
  var(e.gift)|   574.3013   21.65885                      533.3705    618.3732
------------------------------------------------------------------------------

.
. scalar sigmah = sqrt(_b[/var(e.gift)])

.
. margins, eydx(*) predict(ystar(0,.))

Average marginal effects                        Number of obs     =      4,268
Model VCE    : OIM

Expression   : E(gift*|gift>0), predict(ystar(0,.))
ey/dx w.r.t. : 1.resplast weekslast mailsyear propresp avggift lavggift lweekslast
               lmailsyear lpropresp

------------------------------------------------------------------------------
             |            Delta-method
             |      ey/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  1.resplast |   .4921176   .0917819     5.36   0.000     .3122285    .6720068
   weekslast |  -.0216996    .002665    -8.14   0.000    -.0269228   -.0164763
   mailsyear |   -.463358     .14166    -3.27   0.001    -.7410065   -.1857095
    propresp |   .5353891   .4809005     1.11   0.266    -.4071586    1.477937
     avggift |  -.0001691   .0003152    -0.54   0.592    -.0007868    .0004486
    lavggift |   .8527549   .0498834    17.09   0.000     .7549854    .9505245
  lweekslast |   .9347799   .1322584     7.07   0.000     .6755583    1.194002
  lmailsyear |   1.050934   .2117079     4.96   0.000     .6359945    1.465874
   lpropresp |   .8428015   .2424222     3.48   0.001     .3676628     1.31794
------------------------------------------------------------------------------
Note: ey/dx for factor levels is the discrete change from the base level.

.
. predict xbh, xb

.
. gen gifth_t = normal(xbh/sigmah)*xb + sigmah*normalden(xbh/sigmah)

.
. gen rh = gift - gifth_t

.
. gen rhsq = rh^2

.
. sum rhsq

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        rhsq |      4,268    162.2827    1414.651   .0000169   48426.59

.
. scalar ssr_t = r(N)*r(mean)

.
. di 1 - ssr_t/sst
.28455455

.
. poisson gift i.resplast weekslast mailsyear propresp avggift l*, vce(robust)

Iteration 0:   log pseudolikelihood = -27434.011  
Iteration 1:   log pseudolikelihood = -27315.268  
Iteration 2:   log pseudolikelihood = -27302.472  
Iteration 3:   log pseudolikelihood = -27301.803  
Iteration 4:   log pseudolikelihood = -27301.802  

Poisson regression                              Number of obs     =      4,268
                                                Wald chi2(9)      =    1534.92
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -27301.802               Pseudo R2         =     0.3613

------------------------------------------------------------------------------
             |               Robust
        gift |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  1.resplast |   .3684179   .0870178     4.23   0.000     .1978662    .5389696
   weekslast |  -.0226319   .0036232    -6.25   0.000    -.0297332   -.0155306
   mailsyear |  -.0780192   .1158396    -0.67   0.501    -.3050608    .1490223
    propresp |  -.3812796   .5377059    -0.71   0.478    -1.435164    .6726047
     avggift |  -.0007583   .0000885    -8.56   0.000    -.0009319   -.0005848
    lavggift |   1.029715   .0503443    20.45   0.000     .9310425    1.128388
  lweekslast |   .9168623   .1392761     6.58   0.000     .6438863    1.189838
  lmailsyear |   .4244867   .1656707     2.56   0.010     .0997781    .7491954
   lpropresp |   .9358605   .3259499     2.87   0.004     .2970105    1.574711
       _cons |  -2.514954   .7574839    -3.32   0.001    -3.999595   -1.030313
------------------------------------------------------------------------------

.
. predict gifth_p
(option n assumed; predicted number of events)

.
. gen uh = gift - gifth_p

.
. gen uhsq = uh^2

.
. sum uhsq

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        uhsq |      4,268    137.4855    887.5746   .0030101   32187.38

.
. scalar ssr_p = r(N)*r(mean)

.
. di 1 - ssr_p/sst
.39387638

.
. corr gift gifth_ols gifth_t gifth_p
(obs=4,268)

             |     gift gifth_~s  gifth_t  gifth_p
-------------+------------------------------------
        gift |   1.0000
   gifth_ols |   0.4936   1.0000
     gifth_t |   0.5432   0.9016   1.0000
     gifth_p |   0.6276   0.7934   0.8696   1.0000


.
.
end of do-file

.

Code:

. use charity.dta, clear

. set more off

.
.
. reg gift resplast , robust

Linear regression                               Number of obs     =      4,268
                                                F(1, 4266)        =     159.83
                                                Prob > F          =     0.0000
                                                R-squared         =     0.0408
                                                Root MSE          =     14.754

------------------------------------------------------------------------------
             |               Robust
        gift |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
    resplast |   6.443508   .5096817    12.64   0.000     5.444267    7.442749
       _cons |   5.287073   .2573627    20.54   0.000     4.782508    5.791638
------------------------------------------------------------------------------

.
. predict gifth_ols, xb

.
. scalar sst = e(mss) + e(rss)

.
. tobit gift i.resplast , ll(0)

Refining starting values:

Grid node 0:   log likelihood = -10892.828

Fitting full model:

Iteration 0:   log likelihood = -10892.828  
Iteration 1:   log likelihood = -9808.6177  
Iteration 2:   log likelihood = -9501.0502  
Iteration 3:   log likelihood = -9496.2084  
Iteration 4:   log likelihood =  -9496.171  
Iteration 5:   log likelihood =  -9496.171  

Tobit regression                                Number of obs     =      4,268
                                                   Uncensored     =      1,707
Limits: lower = 0                                  Left-censored  =      2,561
        upper = +inf                               Right-censored =          0

                                                LR chi2(1)        =     349.08
                                                Prob > chi2       =     0.0000
Log likelihood =  -9496.171                     Pseudo R2         =     0.0180

------------------------------------------------------------------------------
        gift |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  1.resplast |   19.56241   1.078413    18.14   0.000     17.44816    21.67666
       _cons |  -15.76089   .8068922   -19.53   0.000    -17.34281   -14.17896
-------------+----------------------------------------------------------------
  var(e.gift)|   806.9589   30.67495                      749.0064    869.3954
------------------------------------------------------------------------------

.
. scalar sigmah = sqrt(_b[/var(e.gift)])

.
. margins, eydx(*) predict(ystar(0,.))

Conditional marginal effects                    Number of obs     =      4,268
Model VCE    : OIM

Expression   : E(gift*|gift>0), predict(ystar(0,.))
ey/dx w.r.t. : 1.resplast

------------------------------------------------------------------------------
             |            Delta-method
             |      ey/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  1.resplast |   .9507644   .0501154    18.97   0.000       .85254    1.048989
------------------------------------------------------------------------------
Note: ey/dx for factor levels is the discrete change from the base level.

.
. predict xbh, xb

.
. gen gifth_t = normal(xbh/sigmah)*xb + sigmah*normalden(xbh/sigmah)

.
. gen rh = gift - gifth_t

.
. gen rhsq = rh^2

.
. sum rhsq

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        rhsq |      4,268    218.4543    1830.925   .0234679   59949.96

.
. scalar ssr_t = r(N)*r(mean)

.
. di 1 - ssr_t/sst
.03691444

.
. poisson gift i.resplast , vce(robust)

Iteration 0:   log pseudolikelihood = -40261.433  
Iteration 1:   log pseudolikelihood = -40261.428  
Iteration 2:   log pseudolikelihood = -40261.428  

Poisson regression                              Number of obs     =      4,268
                                                Wald chi2(1)      =     168.23
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -40261.428               Pseudo R2         =     0.0582

------------------------------------------------------------------------------
             |               Robust
        gift |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
  1.resplast |   .7969344   .0614419    12.97   0.000     .6765105    .9173584
       _cons |   1.665265    .048672    34.21   0.000     1.569869     1.76066
------------------------------------------------------------------------------

.
. predict gifth_p
(option n assumed; predicted number of events)

.
. gen uh = gift - gifth_p

.
. gen uhsq = uh^2

.
. sum uhsq

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        uhsq |      4,268    217.5807    1839.969   .0725864   59884.41

.
. scalar ssr_p = r(N)*r(mean)

.
. di 1 - ssr_p/sst
.04076598

.
. su gift gifth_ols gifth_t gifth_p if resplast==0

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        gift |      2,839    5.287073    13.71207          0        250
   gifth_ols |      2,839    5.287073           0   5.287073   5.287073
     gifth_t |      2,839    5.153193           0   5.153193   5.153193
     gifth_p |      2,839    5.287073           0   5.287073   5.287073

. su gift gifth_ols gifth_t gifth_p if resplast==1

    Variable |        Obs        Mean    Std. Dev.       Min        Max
-------------+---------------------------------------------------------
        gift |      1,429    11.73058    16.63227          0        250
   gifth_ols |      1,429    11.73058           0   11.73058   11.73058
     gifth_t |      1,429    13.33485           0   13.33485   13.33485
     gifth_p |      1,429    11.73058           0   11.73058   11.73058

.
.
end of do-file

Comment

John Mullahy

Join Date: Dec 2016

Posts: 751
#10

29 Aug 2020, 07:53

To Jeff and Joao:

1. Your Statalist exchange should be required reading for students studying applied microeconometrics.

2. Regarding fit: To quote Jeff "How do well do they fit E(y|x)?" Beyond in-sample R² it might be interesting to see how Poisson and Tobit perform with out-of-sample prediction of E(y|x) (e.g. cross-validation). It's not immediately obvious that one would be more susceptible to, say, in-sample overfitting than the other but it's certainly possible that this could be so.
3 likes
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2167
#11

30 Aug 2020, 10:04

Hi John. Thanks for your kind words. I agree that looking out-of-sample would be interesting. I haven't seen that done in comparing Poisson to Tobit. I think in some cases it's not clear whether the slope of the conditional mean should flatten out at high values of the x. As Joao pointed out, in the Tobit it does, not in the exponential.

A few reactions to Joao's most recent post. As I think he knows, it's practically preaching to the choir with me about the merits of Poisson regression. But I don't like being dogmatic about one approach always being the best. I wasn't suggesting Tobit in place of Poisson regression, but we often do robustness checks using models we know can't both be true to determine sensitivity of key parameters (usually average partial effects).

The point about wanting fitted values to average to ybar is a good one. In fact, it's the basis for the IPWRA doubly robust estimators I proposed for treatment effects. That's why, in that context, I always recommend Poisson regression (for nonnegative responses), or linear regression (for any response), or logistic regression (for binary or fractional response). Also, in a recent paper, Negi and Wooldridge (2020), we show that the same combinations of mean/quasi-likelihood functions ensure consistency of regression adjustment despite arbitrary misspecification of the mean function. In fact, we argue against using a Tobit for corner solutions. Again, Poisson regression with an exponential mean is the only sensible choice in the treatment effects case for y >=0 without an upper bound. In the charity example, the key variable is the mailsyear variable, which takes on more than 10 values. One could assign a dummy for every different treatment level but parsimony calls for estimating the effect of one more mailing per year. Thus, trying Poisson regression and trying Tobit seems sensible crazy to me.

Finally, regarding the functional forms in Joao's extended analysis of my example: it's somewhat weird to include both levels and logs of explanatory variables, mainly because the coefficients are almost impossible to interpret. When done, Poisson regression fits a lot better, and so it's worth exploring more flexible functional forms. My simple example, which was taken from a problem in my introductory text, clearly missed some nonlinearities. Another good example of checking for such things. I suspect more common methods, such as squares and interactions, might pick up the same thing. At a minimum, using "margins" in Joao's model doesn't make sense: one cannot hold log(mailsyear) fixed and increase mailsyear. One needs to compute the partial derivative, and it isn't clear whether it's negative or positive or for what values of the x it might change sign. It would be much easier to include squares and interactions and then use the margins command.
3 likes
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3011
#12

31 Aug 2020, 09:06

Dear Jeff,

Just to clarify, I know that you are well aware of the merits of Poisson regression: I remember discussing it with you when you visited the Bank of Portugal at the time I was working on the Log of Gravity paper.

As I said, I tend to be sceptical about the merits of the Tobit for corner-solutions data because there is nothing in the estimator to suggest that it will lead to a meaningful estimate of the conditional expectation, but your example shows that it is at least possible to find some cases where it works very well. So, thank you for providing that illustration.

Finally, as you mention and I had noted, using "margins" in my example does not produce interesting results; that was just a simple way to avoid choosing whether levels or logs would be more suitable for the "linear" and exponential conditional expectations estimated by the Tobit and Poisson regressions. Clearly it is not an approach I would recommend.

All good wishes and thanks again for your comments,

Joao
Comment
Jeff Wooldridge

Join Date: Apr 2014

Posts: 2167
#13

01 Sep 2020, 06:55

Maybe if we keep pushing Poisson regression the inferior alternatives will eventually disappear. 😉
2 likes
Comment
Maxence Morlet

Join Date: Mar 2021

Posts: 653
#14

18 Aug 2022, 06:41

Dear all,

First of all, I agree with John; I am a doctoral student in microeconometrics, and have learned a lot from great minds by reading this thread.

I have a question concerning the use of the Tobit model, therefore related to this thread - at least partially. Please let me know if another thread should be started for this.

Much like Tasha, I have a dependent variable that basically models a count variable. All values are non-negative. Conceptually, the dependent variable defines the number of job positions offered, which is no longer an integer due to a complex aggregation process. The main explanatory variable is continuous; it is unemployment.

I have about 7.5M observations; for 7.45M observations, the dependent variable is equal to 0, for 50,000 the dependent variable is strictly positive. When the dependent variable is equal to 0, this means that the respondent (our dependent variable stems from a survey) did not recruit any workers.

I have already run fixed-effects regressions and pseudo Poisson max likelihood HDFE estimation (using the community-contributed command ppmlhdfe) as recommended in #2.

Johnston and DiNardo (1997) however assert that if one ignores observability / selection, the "naively" estimated coefficient is inconsistent and attenuated by the probability that the outcome variable is positive. The dependent variable and the regressor in this case are not jointly normally distributed (far from it).

One solution would be the Tobit model, however running a Tobit model is computationally unfeasible in this case due to multiple fixed-effect vectors.

My question is the following: Am I stuck with an attenuated estimator, and should I simply specify that the estimate is a lower bound for the true estimator? What would be the best route to follow?

Thanks a lot in advance!
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3011
#15

21 Aug 2022, 01:59

Dear Maxence Morlet,

I think it all depends on what you are trying to do. As Jeff makes clear in his textbooks, the purpose of a model is to answer a question and therefore the model to use depends on the question you are asking.

If you just want to see if the conditional mean of y depends on x, your PPML results should be enough. This will also be enough if you think that the decision to hire and how much to hire is determined by a single process.

However, if you believe that there are two separate processes at work, then you may consider a two-part model or even a zero inflated model (if some firms will never hire for some reason). In this case you should look at John's work, namely his 1986 and 1998 papers (just check his most cited papers in Google Scholar).

In the hurdle models discussed in John's 1986 paper, the two parts are typically independent; this contrasts with the sample selection models and Tobit that allow for dependence. However, you can also modify the hurdle model to account for that, as done in this paper (see Section 5).

I will not comment on what Johnston and DiNardo (1997) say because I do not have enough information about it; can you please provide the full reference including the page number?

Best wishes,

Joao
1 like
Comment

Announcement