Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dropping outliers without dropping missing values

    Dear Stata-users,

    I have a panel data strongly balanced. I have generated a new variable to calculate the growth rate of the variable "x":

    bys id: gen x_growth= (x[_n]-x[_n-1])/x[_n-1]

    After i would like to drop the outliers. I'm using the following code:

    drop if x_growth < r(p1) | x_growth > r(p99)

    Nevertheless, by using this code, all of my missing values for this variable are dropped but i would like to keep these missing values.

    Thank you in advance,

    Pierre


  • #2
    Stata treats missing values as positive infinity, thus they fulfil the stated condition.
    (http://www.stata.com/support/faqs/da...issing-values/)


    The follwoing should work:

    drop if x_growth < r(p1) | x_growth > r(p99) & x_growth < .

    Comment


    • #3
      I know no good rationale for this procedure. "Outliers" are observations separated from the rest of the data.. Being in the upper and lower 1% is not a definition. If it were, 2% of any data set would.consist of outliers, an assumption I would not care to defend.. On the one hand, the upper and lower 1% could be perfectly good data points with no evidence of separation. On the other, there could be more real outliers" than 2%, and these will be missed. Even more interesting, extreme groups could be evidence of multi-modality, and this could be the most important feature of the data, one that you should investigate. I suggest that you take a close look at histograms and density plots.
      Last edited by Steve Samuels; 24 Jul 2014, 18:46.
      Steve Samuels
      Statistical Consulting
      [email protected]

      Stata 14.2

      Comment


      • #4
        not only do I support Steve's statement but I would go further - to me, outliers are values that are surprising - and that statement implies some "model" of what is going on - and maybe your model is wrong; when there are "separated" values, you must think about what is going on and whether you are looking at the data in the right way

        Comment

        Working...
        X