Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Stata 15 Dropping outliers based on mean and stdev drops everything

    Hi all,

    I'm trying to drop observations using r(mean) and r(sd) results of a price field I get from the summarize command. For some reason, stata drops all observations even when the price do not fall under my restriction. My code is below:.

    sum price, d
    drop if price < (r(mean) - 5 * r(sd) ) | price> (r(mean) + 5 * r(sd))



    Below is my log when I run the above command:

    sum price, d

    price
    -------------------------------------------------------------
    Percentiles Smallest
    1% .0969173 0
    5% .1055639 0
    10% .1224311 0 Obs 2,584,269
    25% .1498837 0 Sum of Wgt. 2,584,269

    50% .1888972 Mean .1920862
    Largest Std. Dev. .0571629
    75% .2265367 6.92407
    90% .2674918 6.92407 Variance .0032676
    95% .2977228 6.92407 Skewness 3.061594
    99% .3403101 6.984922 Kurtosis 303.4658

    . drop if price < (r(mean) - 5 * r(sd) ) | price> (r(mean) + 5 * r(sd))
    (2,584,269 observations deleted)


    The calculated value of r(mean) - 5 * r(sd) equals-0.0937 (i.e., 0.1920862 - 5*0.0571629), and, similarly, the calculated value of r(mean) + 5 * r(sd) equals 0.4779.

    Based on the above statistics, there are no negative price values, and less than 1 % of the data in which the price is greater than 0.4779. However, stata drops all of them.


    Could someone please help me understand what is causing this?

    Thanks,
    DP

  • #2
    Hard to reproduce and I can't see anything obviously wrong. Gratuitous comment: intervals of mean +/- so many SDs make little sense for something so skew. I would work with log of price. On that scale perhaps outliers will make sense any way.

    Comment


    • #3
      I'm wondering if perhaps dleasy pete ran some other command in between his -summ- and -drop- commands that might have obliterated the contents of r(). For example, contrast the behavior of these two code blocks.:
      Code:
      . clear*
      
      . set obs 1000000
      number of observations (_N) was 0, now 1,000,000
      
      . set seed 1234
      
      . gen price = rgamma(1.1, .19/1.1)
      
      . summ price
      
          Variable |        Obs        Mean    Std. Dev.       Min        Max
      -------------+---------------------------------------------------------
             price |  1,000,000    .1900046    .1814986   2.24e-07   2.735444
      
      . drop if price > r(mean) + 5*r(sd) | price < r(mean) - 5*r(sd)
      (2,288 observations deleted)
      
      . 
      . 
      . clear*
      
      . set obs 1000000
      number of observations (_N) was 0, now 1,000,000
      
      . set seed 1234
      
      . gen price = rgamma(1.1, .19/1.1)
      
      . summ price
      
          Variable |        Obs        Mean    Std. Dev.       Min        Max
      -------------+---------------------------------------------------------
             price |  1,000,000    .1900046    .1814986   2.24e-07   2.735444
      
      . count if price > r(mean) + 5*r(sd) | price < r(mean) - 5*r(sd)
        2,288
      
      . drop if price > r(mean) + 5*r(sd) | price < r(mean) - 5*r(sd)
      (1,000,000 observations deleted)
      What happens in the second one is that the -count- command overwrites what summ had put in r(), so r(mean) and r(sd) are no longer defined. The drop command then interprets that as missing values, and price < missing value is always true.

      That said, I agree with Nick that trimming the data on mean +/- some number of SD is probably not useful for a variable with this level of skewness.

      By the way, is dleasy pete your real name? It is the norm in this community, to promote collegiality and professionalism, that we use our real given and surname as our username. If dleasy pete is not your real name, please click on Contact Us in the lower right corner of the page and ask the system administrator to change your username. Thank you.

      Comment


      • #4
        There's a typo in your command. You've left out an opening parenthesis after the "|".
        It should be:
        Code:
        drop if price < (r(mean) - 5 * r(sd) ) | (price> (r(mean) + 5 * r(sd))
        There are other problems with your criterion:
        • It is not robust: Extreme outliers will pull the mean towards them and inflate the S.D. For that reason, criteria like yours have been abandoned. In the simplest replacement, the mean is replaced by the median and the SD by a multiple of the inter-quartile distance.
        • Extreme outliers might constitute an interesting group in themselves. Have you looked at a histogram or density plot?

        Pasting from your log into the Forum editor lost the column lineups. Please in future posts paste commands and results between CODE delimiters [CODE] and [/CODE].
        Last edited by Steve Samuels; 10 Aug 2018, 12:41.
        Steve Samuels
        Statistical Consulting
        [email protected]

        Stata 14.2

        Comment


        • #5
          Hi Nick,

          The program actually runs perfectly fine in stata 14, but in stata 15, it drops everything. Is there a way to check setting differences in two versions of stata?

          Thanks,
          DP

          Comment


          • #6
            What happens if you correct the typo in Stata 15? I have Version 14.2 and the problem code, unbalanced parenthesis and all, runs on a small example.

            I would have expected Stata to generate an "invalid ')' " r(198) error; indeed it does if I correct the typo and omit the first parenthesis in your code.

            I've written to Tech Support about this.
            Last edited by Steve Samuels; 10 Aug 2018, 13:45.
            Steve Samuels
            Statistical Consulting
            [email protected]

            Stata 14.2

            Comment


            • #7
              Oops: there's no typo. Sorry for the excitement. I think I'll stop for the weekend.

              Steve
              Last edited by Steve Samuels; 10 Aug 2018, 14:23.
              Steve Samuels
              Statistical Consulting
              [email protected]

              Stata 14.2

              Comment

              Working...
              X