Effective Outlier Analysis for Unbalanced Panel Data in Stata

Tony Annan

Join Date: Sep 2017

Posts: 26
#1

Effective Outlier Analysis for Unbalanced Panel Data in Stata

22 Sep 2017, 11:19

Can someone recommend the most effective way of performing both univariate and multivariate outlier analysis in Stata? I have a panel data by ID and Year with attrition. I suspect there are significant outliers in my data but I would like to use the most effective means Stata provides to detect and deal with them. Thank you for your help. Tony
Tags: None
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

24 Sep 2017, 15:56

There has been a lot of debate over outliers and there are no generally accepted solutions. If need be, you can run a regression with dummy variables for panel and then use the diagnostics available after regression. Attrition is a separate issue which suggests sample selection problems.
Comment
Tony Annan

Join Date: Sep 2017

Posts: 26
#3

05 Oct 2017, 09:15

Phil, thank you for taking up my question. Can you recommend any easy to understand materials that will clarify this issue for me? My descriptive analysis shows severe skewness and kurtosis in the datasets and extreme maximums and minimums. I like the idea of regression with dummy variables, but how do I diagnose or interpret the result to identify the outliers?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35730
#4

05 Oct 2017, 09:24

Outliers are whatever surprises the analyst given a model for the generating process. No model, no worthwhile measure of surprise. That's not really my definition but based on Venables and Ripley and it's the best I've seen. Reference within https://stats.stackexchange.com/ques...an/78067#78067

Not trying to be flippant here, but considering Atlantic hurricanes, mass shootings, etc. it is clear that outliers however rare can be real and -- whenever so -- the way to deal with them (statistically, scientifically, not talking policy or politics) is to build a model accommodating them.

If the reply is, Sure, but my data are not like that. Outliers are measurement blunders or contaminants from outside. Then fine, but you still need a model or minimally a working definition for what is within bounds to define what is outside bounds.
Comment
Tony Annan

Join Date: Sep 2017

Posts: 26
#5

05 Oct 2017, 09:43

Thanks so much, Nick. I understand. Which of the tools Stata offers for modeling unbalanced panel data will suggest I look at? Tony
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35730
#6

05 Oct 2017, 11:45

This is going to be even more than usually opinionated.

Let's knock down a straw person: Someone asks for a command to identify observations that are interesting. This is hopeless without a definition of interesting. So, why should surprising be any easier?

Unbalanced panel data means, I take it, that outliers if they exist exist in a context of

1. marginal distribution of all data and of each panel

2. variation in time ditto

3. gaps and/or unequal spacing of data that make panels unbalanced.

4. whatever other variables are contextual or predictive

5. realisation that thinking on a transformed scale may make more sense.

I don't know a command that will do that for you, and I don't see any hint yet of what you're thinking is your generating process. Nor I am clear whether your interest is identifying panels that are surprising or observations within panels.

I find skewness and kurtosis interesting but dubiously useful for small samples. Perhaps you have large samples.

I sometimes have the same problem and I'd rely on plotting panel time series, marginal distributions and bivariate scatter as well as subject-matter knowledge.

At the other extreme, significance tests for outliers are nothing IMO but a snare and a delusion (sorry, on that point I won't be Humble).
Comment
Felix Hofmann

Join Date: Feb 2018

Posts: 10
#7

06 Mar 2018, 02:02

Dear all,

I have three questions that referr to post #2 "you can run a regression with dummy variables for panel and then use the diagnostics available after Regression".

Assuming that we have panel data of 100.000 individuals and 5 years of observation (xtset ID YEAR).

1. Does creating dummy variables for panel mean that I enter dummies for each time of observation (e.g. when having 5 years of observation I use 5-1 = 4 dummies) or for each case as suggested here: https://www.stata.com/statalist/arch...msg00373.html?

2. After creating such dummies, can I then use regress and run outlier diagnostics such as Cook´s D or DFFITS (see Aguinis et al., 2013*)?

3. Can I also perform other regression diagnostics (such as: https://stats.idre.ucla.edu/stata/we...n-diagnostics/) using the panel dummy approach outlined above or would these diagnostics that are designed for OLS not be appropriate for panel data?

Thank you for your help! Felix

*Aguinis, H., Gottfredson, R. K., & Joo, H. (2013). Best-practice recommendations for defining, identifying, and handling outliers. Organizational Research Methods, 16(2), 270-301.
Comment

Announcement

Effective Outlier Analysis for Unbalanced Panel Data in Stata

Comment

Comment

Comment

Comment

Comment

Comment