Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Outliers on Stata

    There any way to identify outliers using STATA?

  • #2
    There are many, many ways, depending on your definition of outliers. A good one is to plot your data and think about data points that seem surprising.

    Comment


    • #3
      Hello!

      There are many ways to identify outliers. Here is a good document for reference: http://www3.nd.edu/~rwilliam/stats2/l24.pdf

      Anton

      Comment


      • #4
        Thank you for the answers!
        I want to know if there any STATA command that I can use! I donĀ“t want to use graphic!

        Comment


        • #5
          You have to give much more specific detail on exactly what you are interested in to make fuller answers likely.

          Ignoring graphics here is a personal choice, as would be ignoring questions based on such a blinkered attitude.

          Please also note our preference for using full real names and for the correct spelling "Stata". See the FAQ Advice for more detail on this and other advice on posing questions.
          Last edited by Nick Cox; 23 Sep 2014, 11:00.

          Comment


          • #6
            Hi Nick Cox,
            Thank you for the answer!
            I have a database, with many variables, to compare the values between two groups of countries. I want to compare the average, minimum, maximum and SD. But I want to eliminate the outliers, because I see that some values is to high.

            And, my attitude to not chose graphic is because I have thousands observation, so it will be more difficult to identify outliers! So that I want to know if is there any command, that I can use, it can say that the value, for example, more than 500, is outliers.

            Comment


            • #7
              Keep in mind that you need strong theoretical justification in order to eliminate outliers from the analysis.

              Comment


              • #8
                Thanks for trying to provide detail, but my answer remains pretty much the same.

                In effect, you are asking if there is a Stata command that will tell you if values are "too high". If you can translate that into some statistical criterion, then there will be Stata code to do it.

                In any case, eliminating outliers is a highly debatable tactic. It's just one of several possible actions and in my view usually one of the worst imaginable.

                There are entire books and many, many articles on treatment of outliers; the discussion by Richard Williams Anton cited in #3 is good and linked to Stata; another discussion is at http://stats.stackexchange.com/quest...iers-with-mean

                On graphics: I think you have it precisely the wrong way round. The more data you have, the easier it usually is to identify possible outliers or -- more importantly -- decide what to do given skewed or heavy-tailed distributions.
                Last edited by Nick Cox; 23 Sep 2014, 11:42.

                Comment


                • #9
                  Thanks ofr the answers!

                  Yes, is that I want to know, if is possible to Stata to say me that the value is "too high"!!!

                  I will follow your suggestion, and see if I can resolve my problem.

                  Thank you once for all the comments!!

                  Comment


                  • #10
                    "Think on a logarithmic scale" solves many more problems than eliminating outliers.

                    Comment


                    • #11
                      litosbrito: you got it the wrong way around: You need to tell Stata when a value is "too high". Too high is necessarily a subjective statement. So it can only be made by humans. You can think of a criterium, and ask a computer (Stata) to apply that criterium, but you, and only you, can choose the criterium. But before you start on that road, try to answer this question: How can you hope to find anything new, if you first remove all surprising observations from your data?
                      ---------------------------------
                      Maarten L. Buis
                      University of Konstanz
                      Department of history and sociology
                      box 40
                      78457 Konstanz
                      Germany
                      http://www.maartenbuis.nl
                      ---------------------------------

                      Comment


                      • #12
                        Litosbrito (please, as per FAQ, re-register with your ful name and surname. Just click on the Contact us button at the bottom-right corner of the screen):
                        Provided that I second all the previous sound advices, an option for detecting outliers is to loop over a variable list, as in the following toy-example:
                        Code:
                        set obs 100
                        g A=runiform()
                        g B=runiform()
                        g C=runiform()
                        foreach var of varlist A-C {
                        quietly summarize `var'
                        g Z_`var'=1 if `var'>3*r(sd) ///the aim of Z_`var' is to detect the values beyond a threshold-value you decide to set (let's say >3 standard deviation apart)///
                        replace Z_`var'=0 if Z_`var'==.
                        list `var' Z_`var' if Z_`var'==1
                         }
                        Kind regards,
                        Carlo
                        Kind regards,
                        Carlo
                        (Stata 16.0 SE)

                        Comment


                        • #13
                          Thank you for the suggestion Carlo Lazzaro! I will tray it!

                          Thank you Maarten Buis for the suggestion of analysis!

                          Comment


                          • #14
                            Carlo wrote code for an indicator variable flagging values more than 3 times the standard deviation (SD). But consider a bundle of countries with life expectancy mean 60 years and SD 10 years. Then all values >30 years would be flagged as outliers, but not those with <30 years (which on most other criteria would be staggering outliers).

                            I guess Carlo was just plucking a criterion out of the air as an example, or he was thinking about some criterion of the form |value - mean| > k SD, but made a slip in his coding. Even then the mean and SD are both likely to be strongly affected by outliers when they exist, so wouldn't we be better off using median and interquartile range (IQR), say, as the basis for any rule of thumb?

                            Hang on: we are rediscovering box plot criteria. For those who want tables, I wrote extremes (SSC) but don't use it much. It deliberately (or so I suppose) doesn't offer hooks for dropping outliers, which is almost always bad practice in my view.

                            P.S. A side-effect of Carlo's code is that missing values will be flagged as 1. Taking his criterion literally, the code might be rewritten

                            Code:
                              
                            foreach var of varlist A-C {    
                               quietly summarize `var'    
                               g Z_`var'= (`var' > 3*r(sd)) if `var' < .      
                               list `var' Z_`var' if Z_`var' == 1
                            }
                            A bigger problem is that looking for univariate outliers is only part of the problem. It's entirely possible to have bivariate outliers that aren't univariate outliers, trivariate outliers that aren't bivariate outliers, and so forth. There are naturally ways of finding these.

                            Comment


                            • #15
                              Dear Nick,
                              I
                              was just plucking a criterion out of the air as an example...
                              All your remarks are, as always, sound.

                              Kind regards,
                              Carlo
                              Kind regards,
                              Carlo
                              (Stata 16.0 SE)

                              Comment

                              Working...
                              X