Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Getting rid of entire variables due to single outlier observations

    I have a dataset where values should be between 0.25 and 1.25 in order to makes sense. Values that are beyond this range are not useful, and the entire variable is likely not worth analyzing.
    Therefore, I would like to drop the whole variable for any variable that has any observation that is outside this range.

    Example data:
    date var1001 var2001 var3001 ...
    21937 0.9 0.8 0.7
    21938 0.8 0.7 0.6
    21939 0.6 1.4 0.9
    21940 1.0 0.2 1.0
    In this example I would like to totally drop the entire variable var2001 - and I would like to do this in an automated way, as I have 1500 variables, and ~900 observations per variable.
    I really can't figure out how to do it, neither in long or wide formats.

    Really grateful for any help.

  • #2
    Code:
    foreach v of varlist var* {
        summ `v', meanonly
        drop `v' if r(min) < 0.25 | r(max) > 1.25
    }
    will do this.

    That said, this strikes me as a very extreme approach to outlying values. Perhaps in your context it makes sense, but would it not be better to investigate these outlying values to find out what the correct values would be and then substitute them in? Or perhaps just remove the offending values and perhaps consider some imputation process to deal with them? With your approach, a variable could have just one error in 900 observations and you will jettison the whole thing. You are discarding information at a breathtaking pace! Throwing out the baby with the bathwater.

    Comment


    • #3
      Dear Clyde Schechter, I do the code in #2, but Stata report error. Could you please check it for me? Thank you.

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input int date double(var1001 var2001 var3001)
      21937 .9  .8 .7
      21938 .8  .7 .6
      21939 .6 1.4 .9
      21940  1  .2  1
      end
      Code:
      . set trace on
      
      . do "C:\Users\chen\AppData\Local\Temp\STD27e0_000000.tmp"
      
      . foreach v of varlist var* {
        2.     summ `v', meanonly
        3.     drop `v' if r(min) < 0.25 | r(max) > 1.25
        4. }
      - foreach v of varlist var* {
      - summ `v', meanonly
      = summ var1001, meanonly
      - drop `v' if r(min) < 0.25 | r(max) > 1.25
      = drop var1001 if r(min) < 0.25 | r(max) > 1.25
      invalid syntax
        }
      r(198);
      
      end of do-file
      
      r(198);
      Another code is also reported invalid:
      Code:
      . do "C:\Users\chen\AppData\Local\Temp\STD27e0_000000.tmp"
      
      . foreach v of varlist var* {
        2.  drop `v' if inrange(`v',0.25,1.25)==0
        3.  }
      - foreach v of varlist var* {
      - drop `v' if inrange(`v',0.25,1.25)==0
      = drop var1001 if inrange(var1001,0.25,1.25)==0
      invalid syntax
        }
      r(198);

      Comment


      • #4
        Sorry, that was a truly dumb mistake on my part. I wrote an -if-qualifier when it should have been an -if- command:

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input int date double(var1001 var2001 var3001)
        21937 .9  .8 .7
        21938 .8  .7 .6
        21939 .6 1.4 .9
        21940  1  .2  1
        end
        
        foreach v of varlist var* {
            summ `v', meanonly
            if r(min) < 0.25 | r(max) > 1.25 {
                drop `v'
            }
        }
        By the way, your approach with -inrange()- will not work. -inrange()- cannot look at the entire variable: it looks at once observation at a time. The post in #1 calls for dropping the variable if it has any values < 0.25 or > 1.25.
        Last edited by Clyde Schechter; 11 Nov 2021, 17:35.

        Comment


        • #5
          Thank you Clyde, it works! I had thought there's a mistake in my do-file editor or something else. Do this mean that when we need to -drop- variables we cannot use -if- qualifier in a loop?

          Comment


          • #6
            Even outside a loop, -drop variable(s)- does not allow an if qualifier.* If you think about it this makes perfect sense. An -if- qualifier designates which observations to apply a command to. But -drop variable- cannot be so restricted: you either drop the whole variable or you keep it. On the other hand, you may want to only drop the variable(s) if it(they) satisfy some overall condition: and the syntax for that is an -if- command. An -if- command does not designate what observations to apply commands to: rather, it designates some condition on the current state of Stata and the data that doesn't necessarily refer to any particular observations, and then the command(s) inside the braces are executed or not, depending on whether the condition is satisfied as a whole.

            *Just to keep you confused: there are of course perfectly good commands like -drop if !inrange(var, 0.25, 1.25)-. The key difference is that there is no variable to be dropped here. Here the observations that satisfy the condition are to be dropped--and of course the way to distinguish which observations are the targets of this command is with an -if- qualifier. That's what -if- qualifiers do!

            Comment


            • #7
              Thank you Clyde, I should re-learn -help drop-.

              Comment


              • #8
                Originally posted by Clyde Schechter View Post
                That said, this strikes me as a very extreme approach to outlying values. Perhaps in your context it makes sense, but would it not be better to investigate these outlying values to find out what the correct values would be and then substitute them in? Or perhaps just remove the offending values and perhaps consider some imputation process to deal with them? With your approach, a variable could have just one error in 900 observations and you will jettison the whole thing. You are discarding information at a breathtaking pace! Throwing out the baby with the bathwater.
                Wow, all of this was tremendously useful for me (and apparently for others as well).
                I had been trying to use an if command, but it was either throwing out all my values - or none of them.

                And I agree, this is an extreme way of cleaning my data, but I'm still in the early process of looking at it, and I need this mostly to be able to explore the data. I will need to return to the outliers and identify why they are what they are - but that is a later step once I understand it better. Most likely nearly all of those outlying datapoints are faulty (or due to errors/variability in the data collection process) - making the entire variable uncertain and of little use. As the variables I am interested in are ratios of activity in 2019 versus 2020 - this can't theoretically be conceived to vary much beyond that range (or based of the hypothesis I'm looking at: such variation would be irrelevant for the outcome).

                Each variable is connected to a geographic location, with varying population - and I could either exclude those values that fulfill combined criteria of low population, high variability (some arbitrary SD-cutoff), AND extending beyond the given range(0.25-~1.25). However, this would be a later step, once I understand the data better. Also, the data I'm collecting will need to be analyzed using a different method in the coming years, as I will have more baseline data, so that I can do more than just use ratios.

                I can perhaps give an example of what the data looks like (very messy), when all of the variables are plotted in a graph - this is 2020 and 2021:

                Click image for larger version

Name:	test_4_3.jpg
Views:	1
Size:	1.45 MB
ID:	1636226

                Hope this makes sense, otherwise I'm very open to suggestions!

                I will get back to you once I can tell if the analysis works.

                EDIT:
                It works!

                Now I can start looking into the data more in depth (not plotting all variables in one graph). The major point is understanding what differentiates those variables within this main series that trend high, versus those that trend low.)
                Graph after cleaning:
                Click image for larger version

Name:	test_5_2.jpg
Views:	1
Size:	1.10 MB
ID:	1636231
                Last edited by Carl Fredrik; 12 Nov 2021, 03:27.

                Comment

                Working...
                X