Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Efficient way of removing outliers in many variables?

    Hi all,
    I need to remove outliers (with IQR (1.5) criteria at both ends) in about 70 variables. I’m looking for a software that can help me achieve it in as simple and straight forward way as possible, i.e., without having to first make iqr 1.5 cut offs for each variable - then flagging outliers with some binary variable - then removing them or generating a new variable – one variable at a time. I want to do it all together in one go: remove all outliers on all individual variables in varlist and generate new variables without outliers with some new prefix/suffix or something.

    I understand IQR is not the best approach and I understand the limitations and potential problems of identifying outliers with IQR approach – but the project I’m working on requires it.

    Is there a user-written software to make this process efficient?

    Thankfully,
    Massao
    Last edited by Senor Massao; 02 Nov 2022, 08:56.

  • #2
    To my understanding, taking literally what you ask for leads to something impossible: The IQR is a property of an individual variable, so one would *have* to work on one variable at a time. Also, I'm not sure what you mean by "remove." I'm going to assume that by "remove" you mean "change the value to missing," and that you want to avoid having to make changes manually, one variable at a time. With that understanding, I'd suggest this loop:
    Code:
    foreach v of varlist YourVar1 ..... YourVar70 {  // put in your own variable list
       quiet summ `v', detail
       quiet replace `v' = . if !inrange(`v', r(p25), r(p75))
    }

    Comment


    • #3
      I think Mike Lacy means

      Code:
      quiet replace `v' = . if !inrange(`v', r(p25) - 1.5 * (r(p75) - r(p25)), r(p75) + 1.5 * (r(p75) - r(p25))
      although the code might be speeded up doing some of those calculations just once, and I think this would be terrible practice for anything close to what I do.

      Comment


      • #4
        Yes, thanks go to Nick Cox for that friendly correction. I got so caught up in other features the question that I ignored the 1.5. <grin> Nick's suggestion makes me wonder, though, about the point relative to which outliers are to be defined. His code interprets Senor's intention as defining outliers relative to the 25th and 75th percentile, but perhaps Senor had in mind to define them relative to some central point, e.g. median +/- 1.5 * IQR.

        Comment


        • #5
          The criterion mentioned was one that Tukey in the 1970s used to identify on boxplots interesting points, at least modestly extreme, to show individually and think about. Thinking about might include, say, realising that a transformation would help ,mightily.

          Somewhen in the last 50 years someone took that to mean bad data points to discard as quickly as possible. Where is this justified?

          Comment


          • #6
            Originally posted by Mike Lacy View Post
            To my understanding, taking literally what you ask for leads to something impossible: The IQR is a property of an individual variable, so one would *have* to work on one variable at a time.
            Sorry for being unclear. Yes, I think your code is in the right direction. Many thanks. As IQR is an individual property of each variable, some sort of a loop would be needed to utilize IQR from each variable.
            Originally posted by Mike Lacy View Post
            Also, I'm not sure what you mean by "remove." I'm going to assume that by "remove" you mean "change the value to missing,"
            Yes, that is what I meant, to convert them into missing.
            Originally posted by Mike Lacy View Post
            and that you want to avoid having to make changes manually, one variable at a time. With that understanding, I'd suggest this loop:
            Code:
            foreach v of varlist YourVar1 ..... YourVar70 { // put in your own variable list
            quiet summ `v', detail
            quiet replace `v' = . if !inrange(`v', r(p25), r(p75))
            }
            Many thanks. I tried the code with suggested change by Nick, as follows:
            Code:
            foreach v of varlist logsa93 loga772 logh897 {
               quiet summ `v', detail
               quiet replace `v' = . if !inrange(`v', r(p25) - 1.5 * (r(p75) - r(p25)), r(p75) + 1.5 * (r(p75) - r(p25))
            }
            Only three variables included for simplicity. I’m getting an error:
            too few ‘)’ or ’]’ included
            Any ideas where I might have been wrong in transferring the code?

            Comment


            • #7
              Originally posted by Nick Cox View Post
              I think Mike Lacy means

              Code:
              quiet replace `v' = . if !inrange(`v', r(p25) - 1.5 * (r(p75) - r(p25)), r(p75) + 1.5 * (r(p75) - r(p25))
              [FONT=arial]although the code might be speeded up doing some of those calculations just once
              Many thanks :-)
              Originally posted by Nick Cox View Post
              , and I think this would be terrible practice
              I completely agree. I hope this is not used by anyone else without realizing the potential issues with this. I have been 'asked' to do it and I know this is absolutely bad-to say the least. But a quicker coding solution should reduce my frustration somewhat :-)

              Comment


              • #8
                Originally posted by Nick Cox View Post
                The criterion mentioned was one that Tukey in the 1970s used to identify on boxplots interesting points, at least modestly extreme, to show individually and think about. Thinking about might include, say, realising that a transformation would help ,mightily.

                Somewhen in the last 50 years someone took that to mean bad data points to discard as quickly as possible. Where is this justified?
                Agree. Transformation does help and yes, it was originally meant as part of making box plot by hand in Tukey's book from 1977. And yes, it is a bad practice to remove univariate outliers like this - but as I indicated, its part of 'job', so even with 'disclaimers' , I still need to apply this approach :-)

                Comment


                • #9
                  #6 Needs extra ) at the end. My bad.

                  Comment


                  • #10
                    Originally posted by Nick Cox View Post
                    #6 Needs extra ) at the end. My bad.
                    Many thanks. Kind regards, Massao

                    Comment

                    Working...
                    X