Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to deal with outliers in panel data set?

    Dear all,
    I am working on panel data set (220 Observations (Countries,years)), and after implementing outliers test on stata (Interquartile test) I found that I have 25 outliers in different countries. 11 on one country and the rest are existed in the others. I maked sure that is not a human error. I don't want to remove them as it could be the behavior of the data as my data set applied on MENA region.
    thanks in advance

  • #2
    Very likely that you

    * are just seeing “big” countries in some sense

    * would be better off on logarithmic scale.

    What’s the interquartile test any way? There need be no alarm at e.g. points being above upper quartile + 1.5 IQR — which is my wild guess at what you mean.
    Last edited by Nick Cox; 11 Apr 2019, 03:17.

    Comment


    • #3
      I’m wondering what sort of variable is being underlined. If, for example, the DV refers to wage, or it is a count variable, or something else. For the first scenario, as pointed out in #2, logtransforming helps much. For the second scenario, Poisson-like models with some fine tuning (gamma, power, etc.) may help as well. In the worst scenario, categorizing under quantiles may turn into a handy approach. Quantile as well as non-parametric regression are options to consider, even though I’m not aware they allow for pabanel data. Hopefully that helps.
      Last edited by Marcos Almeida; 11 Apr 2019, 03:00.
      Best regards,

      Marcos

      Comment


      • #4
        Thank you Cox and Marcos, I agree with logarithmic scale make the problem less sever, however one of my variables has negative values, Is there any other solution?

        Comment


        • #5
          You need to tell us more about your data. For example, show the results of summarize.

          Comment


          • #6
            You might find Nick Co's Stata tip 96: Cube roots helpful.

            Comment


            • #7
              Although Abdelmoneam Khaled seems to have bailed out, he might be enticed back in, and in any case the topic remains of fairly wide interest.

              Belatedly I realised that MENA In #1 must mean Middle East and North Africa. (Don't assume on Statalist that people here all work in your field and so can read through your abbreviations, acronyms and allusions. Some of us are geographers and so ignorant about everything.) A smattering of general knowledge underlines that countries in MENA vary over a wide range in size, demographic, economic, geographic -- and that even per capita measures vary enormously too. Given further some explosive growth in time, that all points up the virtues of logarithmic scale as default (which does not mean compulsory).

              Lacking a sight of the data underlying the original question, I reach for the Grunfeld data as a panel dataset which fortunately, if not fortuitously, shows the same issue graphically, with just a twist that it is companies, not countries, which vary greatly in size.

              Underlying #1, I guess, is the criterion that points lying outside the interval from lower quartile MINUS 1.5 IQR to upper quartile PLUS 1.5 IQR are suspect. (IQR (interquartile range) = upper quartile - lower quartile.) I have read the canonical Exploratory Data Analysis by John W. Tukey many times since buying a copy in 1977, and can confirm that this criterion is not a test, and not even a criterion for outliers, despite what some texts and courses seem to think. It's just a rule of thumb for plotting points separately on a box plot while thinking about them. The thinking about them should include whether you need a transformed scale.

              It's quite common to meet the ideas that outliers are

              1* a nuisance to be excluded from the dataset

              2* identifiable with simple methods, just as a few giraffes trying to hide among gazelles can't escape careful scrutiny

              In my experience 1* is wildly pessimistic. Sometimes an outlier is evidently the result of some grotesque measurement error and can't be rescued, but most of the time an outlier is just something that is genuinely BIG, like the Amazon (or Amazon), and informative. (On occasion, BIG means SMALL.)

              In my experience 2* is wildly optimistic. Outliers, like some other things or qualities, are often in the eye of the beholder. There may be a continuum all the way from BIG!!! (e.g. Bill Gates' income or wealth -- he redefined BI G) to not at all big or even small (e.g. mine). Given a good model, with good structure and good predictors and good choice of scale or link function, an outlier might seem unsurprising or even expected.

              Entire books have been written about outliers, and much more could be said. https://stats.stackexchange.com/ques...iers-with-mean mentions many remedies, some simple and sensible if not necessarily successful, others less so. (The thread is more wide-ranging than its title.)

              My specific purpose here is to show that while box plots can be helpful, they can conceal as well as reveal and that it's easy to do better.

              Code first and then results and then discussion. Note that some of the comments flag community-contributed commands which must be installed before you can use them in your Stata.

              Code:
              webuse grunfeld, clear
              
              * box plots first
              local names
              foreach v in invest mvalue kstock {
                  graph box `v', name(`v', replace)
                  gen log_`v' = log(`v')
                  graph box log_`v', name(log_`v', replace)
                  local names `names' `v' log_`v'
              }
              
              graph combine `names', row(3) name(allbox, replace)
              graph drop `names'
              
              * quantile plots next
              * install from Stata Journal --  you need -qplot- too. 
              multqplot invest log_invest mvalue log_mvalue kstock log_kstock, trscale(invnormal(@)) xla(-3/3) combine(row(3) name(mq1, replace))  
              multqplot invest log_invest mvalue log_mvalue kstock log_kstock, trscale(invnormal(@)) xla(-3/3) combine(row(3) name(mq2, replace)) yla(#5)
              Three variables in the Grunfeld dataset need careful attention. Results of summarize not shown here show that all values are positive, so logarithms are valid. I didn't work hard, or even at all, at making the box plots look good. They are just exploratory plots.



              The box plots of the original variables do clearly indicate positive skewness but also hint at the 1.5 rule being an arbitrary cut-off and not a magic device to flush out the giraffes from the gazelles. The box plots of the logarithmic transforms show that we are much nearer symmetry, but again several individual data points are flagged in two out of three box plots. Have we made some things worse while making other things better?

              Incidentally, see https://www.stata.com/support/faqs/g...ithmic-scales/ if you want or need an explanation of why you can't just go yscale(log).

              We need something better. It's possible to have the macro and the micro views together -- to see all the fine detail as well as the coarse structure. One way is through quantile plots, in essence plots of ordered values versus cumulative probabilities. quantile is an official command of long standing with siblings such as qnorm. Here I use multqplot -- which in turn is a convenience wrapper for qplot. Both commands were published through the Stata Journal. The main point to multqplot is that underlying the first block of code: to loop over variables, draw a plot for each, and then combine the plots.




              The default shows minimum, maximum, median and quartiles as labelled points on the y axis. That explains why the values shown are typically not round or "nice" numbers at all. These are the five measures that can be read off a box plot, even if some or all of them coincide. As another nod to box plot conventions, the default shows cumulative probabilities 0(0.25)1 as labelled points on the x axis. Thus if you trace grid lines left and right and up and down you can identify the box of the box plot underneath the quantile array. The default grid line colour and width are subdued, but you could ramp them up if so inclined.

              Clearly some of the vertical labels are scrunched together whenever variables are highly skewed. Consider this as a deliberate feature to underline that you have to think about substantial skewness.

              The cumulative probability scale can be pushed through any transformation you can specify with Stata functions. Thus we can get normal quantile plots on the fly. If you do this you will need to reach in to change the axis labels.

              Click image for larger version

Name:	outlier_mq1.png
Views:	1
Size:	53.2 KB
ID:	1493562

              The horizontal scale is of standard normal deviates, i.e. points for a normal distribution with mean 0 and variance (SD) 1.

              The point about using the normal distribution as reference is not that we need marginal normal distributions for anything much -- really, we don't -- but literally that. It's a reference standard and the plot works to tell us about our data, fine and coarse structure alike. Here although the logarithmic transforms look lumpy in the middle, I see nothing pathological that would mess up later modelling. (Remember, this is panel data, so the data are a mixture.)

              If you're disturbed or distracted by the vertical axis labels, then you can always reach in and change the style. For different variables with quite different measures the # syntax is ideal.


              Click image for larger version

Name:	outliers_mq2.png
Views:	1
Size:	52.3 KB
ID:	1493563

              So, this is my alternative to lots of box plots. Naturally much else could be said about checking for outliers and looking at distributions, such as

              * These are panels. So time series plots are needed as well.

              * It is easy to focus too much on marginal distributions. Look at bivariate and multivariate patterns too. So, scatter plot matrix, plot principal component scores, and so forth.

              * Sometimes it is best just to go quickly to your model but then check very carefully indeed whether outliers as defined by the model are messing up the fit.

              Comment


              • #8
                I make here a separate note on negative values. Although Phil Bromiley flagged in #6 that cube roots can help pull in tails when values are variously positive and negative (and even zero too), I would want to know more about the variables before giving advice.

                For example a variable like GDP growth rate, which is mostly positive but occasionally negative, is often best left as it comes. Being open to transformations doesn't commit you to transforming every variable in sight.

                Comment

                Working...
                X