Outlier Identification for GLM (poisson)

Christoph Winter

Join Date: Sep 2014

Posts: 31
#1

Outlier Identification for GLM (poisson)

01 Dec 2015, 14:20

Hi all,

I am using Stata/SE13.1 (Windows).

I want to do some regression diagnostics after running a GLM (poisson family).
Unfortunately, there seem to be less postestimation commands for glm than for reg.

In particular, I am concered that outliers are driving my results.

So I have already investigated Cook's distance and I have looked at the residuals.
However, I am not sure what to do, so I wondered if there is...

1) a chance to run dfbeta after gml?
2) a command for gml like rreg for reg?

I know that dealing with outliers is difficulty.
Are there is general more possibilities than just reporting results with and without outliers and running robust regressions?

Best
Christoph

Last edited by Christoph Winter; 01 Dec 2015, 14:24.
Tags: None
Steve Samuels

Join Date: Mar 2014

Posts: 1786
#2

01 Dec 2015, 15:16

There's nothing in Stata that I'm aware of. R has the glmrob command in the robustbase package. See this presentation.

Note that rreg is an old command, advanced in its time, but now about 25 years old. It's not all that robust, with a breakdown point of \(\epsilon^* = \sqrt{0.5/p}\), where \(p\) is the number of predictors (Maronna et al, 2006, p. 113). The breakdown point is, roughly, the percent of outliers which can cause an estimate to exceed any bound. Thus, as \(p\) increases, the breakdown point with rreg decreases. rreg has been superseded in Stata by the mmregress package by Verardi and Croux (SSC), which simultaneously identifies high leverage points and outliers and has a breakdown point of 50%. There's a Stata Journal article that can be downloaded from here.

Reference:

R Maronna, RD Martin, and VJ Yohai (2006). Robust Statistics: Theory and Methods, Wiley, Chichester England.

Steve Samuels
Statistical Consulting
[email protected]

Stata 14.2
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35656
#3

01 Dec 2015, 18:17

modeldiag (SJ) has various commands that typically work after glm.

Are the outliers extreme on the response, the predictors or both?
Comment
Christoph Winter

Join Date: Sep 2014

Posts: 31
#4

02 Dec 2015, 02:59

First, thanks for the replies! I will have a look on mmregress and modeldiag.

I am analyzing count data and there are for sure two outliers with verly large numbers of counts (those are easy to detect).
Since it is poisson distributed and has a large tail, there are even more observations with a very large number of counts relative to the others (mean count = 2, and those have more than 20)
However, I think that the poisson regression is dealing with that at least in some way.

On the other hand, there are some observations with a large cook's distance, but zero counts. So there the predictors seem to be "the problem".
In order to adress that I thought about looking at dfbeta, but unfortunately that is not available for glm.
Comment
Maarten Buis

Join Date: Mar 2014

Posts: 3449
#5

02 Dec 2015, 03:30

You can use these tricks to compute dfbeta after most estimation commands: http://blog.stata.com/2014/05/08/usi...ential-points/

---------------------------------
Maarten L. Buis
University of Konstanz
Department of history and sociology
box 40
78457 Konstanz
Germany
http://www.maartenbuis.nl
---------------------------------
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35656
#6

02 Dec 2015, 03:31

In broad terms if I use a Poisson model I expect some large counts occasionally. Given your situation I think I would try some sensitivity analysis and jiggle the outliers a little to see what happens. What you could also try is vce(robust) which may give some kind of indirect indication of uncertainty here.
Comment
Christoph Winter

Join Date: Sep 2014

Posts: 31
#7

02 Dec 2015, 05:05

Thanks again! I will definitely try to use the jackknife solution to obtain the dfbeta's.
Nick, what do you mean by jiggle the outliers?
It is indeed the case that when dropping one outlier from the sample (sample size about 2000), the influence of one variable of interest becomes insignificant.
So reporting the regression with and without seems to be reasonable anyway.
However, now the question is how to deal with other influencial observations.
Can you elaborate on why vce(robust) gives me an indication?
Might it we reasonable to log the data +1 or to do a hyperbolic sine transformation and then consider mmregress instead of glm (poisson)?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35656
#8

02 Dec 2015, 06:08

With and without is one strategy. Otherwise just add or subtract constants and see how sensitive the results are.

I hate adding 1 as a fudge for counts personally. Working out what that all means is a real pain.

I think the main feature of Poisson is the log link. Everything else is secondary. I wouldn't lightly give up on the log link at all.
Comment
Christoph Winter

Join Date: Sep 2014

Posts: 31
#9

02 Dec 2015, 08:17

Ok, thanks for the advice!
Comment
Joao Santos Silva

Join Date: Apr 2014

Posts: 3006
#10

03 Dec 2015, 04:16

Dear Christoph,

Despite all the excelent advice you already received, I would add that you can follow the strategy suggested by Daryl Pregibon. Essentially, any GLM model can be estimated as a sequance of OLS regressions. Therefore, you can get "one-step" diagnostics by applying the usual OLS diagnostics to the last OLS regression implicitly used in the estimation. Pergibon's paper is for the logit, but Poisson regression is similar.

All the best,

Joao
Comment
Christoph Winter

Join Date: Sep 2014

Posts: 31
#11

03 Dec 2015, 07:07

Thanks a lot, Joao!
Comment

Announcement

Outlier Identification for GLM (poisson)

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment