Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying and eliminating outliers in extremely large data sets

    Hello,

    I am working with a dataset with >200K observations and potential outliers that number in the low thousands. I can easily identify outliers by say Cook's distances and the like. I can even generate a list of all these outliers in stata without too much of an issue. The question is, how what stata code input can have stata to quickly and efficiently drop all of these essentially random observations whose common link is say a cook's distance or the like? So far the only way I am coming up to do this is to do it manually and the number of outliers is quite considerable.


  • #2
    -help drop-

    That said, you probably shouldn't do this in the first place.

    Comment


    • #3
      For the dataset in hand, there is likely considerable input error (ie misplaced decimals) so we are considering it.
      Regardless, the problem is I have to drop each observation 1 at a time manually. For thousands, that may be a problem.
      I don't know of a way to simply tell stata to drop observations with high cook distances.

      Comment


      • #4
        So you calculate the cook distance and store the result in a variable. Let's call the variable cook_d. Suppose your threshold for an outlier (actually a high leverage point) is 1. Then the command is

        Code:
        drop if cook_d > 1 & !missing(cook_d)
        It seems that you are brand new to Stata. I suggest that before you attempt to undertake any real work, you invest time in learning the basics. Your Stata installation comes with PDF manuals installed. Select PDF Documentation on the Help menu. Read the Getting Started [GS] and User's Guide [U] volumes of the documentation. These will acquaint you with basic Stata operations, the approach to data management, and elementary data analysis. You'll be exposed to the commands that are used every day when working with Stata. There are many worked examples in the documentation to illustrate how things are done. You won't remember every detail, but you will learn enough that in most situations you will know what commands are likely to be useful for your current task, and then you will be able to refer to the help files and the PDF documentation to fill in the details of syntax and the details of just how the commands are implemented.

        Comment


        • #5
          Dear Luke,

          I believe that what you want can be done like this:

          Code:
          sysuse auto
          reg price mpg rep78 trunk
          predict cd, cooksd
          drop if c>0.2
          Note, however, that this will drop influential observations, not observations with misplaced decimals. So, just like Clyde Schechter, I wonder whether this is a sensible thing to do; probably it is not!

          Best wishes,

          Joao
          PS: Oops... too late!

          Comment


          • #6
            If your data have a natural range, then you can legitimately drop observations outside that range. E.g., if x can only range from 0 to 1, it is legitimate to delete values >1 or <0. All other techniques to handle extreme values have generated substantial debate (which can be seen by looking at discussions of outliers on Statalist).

            Comment

            Working...
            X