Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • litosbrito
    started a topic Outliers on Stata

    Outliers on Stata

    There any way to identify outliers using STATA?

  • Nick Cox
    replied
    #25 John W. Tukey proposed a rule of thumb to plot points separately on a box plot if greater than p75 + 1.5 IQR or less than p25 - 1.5 IQR.

    So far, so good. This wasn't a recipe for identifying points to drop. In most cases the occurrence of outliers was, at least for Tukey, a signal to think about a transformation.

    Leave a comment:


  • Kithinji Charles
    replied
    *This example shows how to highlight outliers using percentiles
    input x
    1
    2
    12
    14
    15
    14
    16
    15
    14
    98
    76
    end
    * let show outliers using boxplot
    graph box x
    *we can then summarize with details
    sum x,detail
    return list
    gen x_outlier=1 if x<=r(p25)-(1.5*(r(p75)-r(p25)))|x>=r(p75)+(1.5*(r(p75)-r(p25)))
    keep if x_outlier==1

    Leave a comment:


  • Denila Jinny
    replied
    Originally posted by Nick Cox View Post
    I don't know anything you don't about SEM. My advice is to start a new thread with a title like "Non-normality and structural equation models" so that people who know about SEM can see that. Also, I would show some graphs of the distributions of your continuous variables to give us some flavour.
    OK. Thank you very much.

    Leave a comment:


  • Nick Cox
    replied
    I don't know anything you don't about SEM. My advice is to start a new thread with a title like "Non-normality and structural equation models" so that people who know about SEM can see that. Also, I would show some graphs of the distributions of your continuous variables to give us some flavour.

    Leave a comment:


  • Denila Jinny
    replied
    Originally posted by Nick Cox View Post
    Denila Jinny Not at all. A large sample can be highly non-normal too. To give a better answer, we need to know more about your data and your goals, especially on whether or why you think your data "should be" normal.
    Thank you very much sir for your immediate reply.
    I am working on a cross sectional data. My objective is to study the causal relationships between funding, profitability and productivity. Literature suggests bi-directional relationships among these variables. Therefore I intend to do non-recursive SEM, one of the assumptions of which is normality. I have 4 continuous variables, 1 interaction variable that interacts 2 continuous variables, 2 interaction variables that interact one continuous variable with 1 dichotomous variable, and few other categorical variables. Will this be enough for you to help me with this issue?

    Leave a comment:


  • Nick Cox
    replied
    Denila Jinny Not at all. A large sample can be highly non-normal too. To give a better answer, we need to know more about your data and your goals, especially on whether or why you think your data "should be" normal.

    Leave a comment:


  • Denila Jinny
    replied
    Originally posted by Nick Cox View Post
    "Think on a logarithmic scale" solves many more problems than eliminating outliers.
    Sir,

    What should I do, if the log value is also not normal?

    I have a dataset with 9000 observations. Can I assume normality just because the sample is large?

    Denila.

    Leave a comment:


  • Wesley Mokkink
    replied
    Steve Samuels I started a new thread.
    HTML Code:
    http://www.statalist.org/forums/forum/general-stata-discussion/general/1336660-remove-outliers-on-stata

    Leave a comment:


  • Steve Samuels
    replied
    You have now tacked a question on to a thread that was closed over a year ago. Start a new thread.

    Leave a comment:


  • Wesley Mokkink
    replied
    Nick Cox Dear Nick,

    I installed the "extremes" code written by you. I would like to use it to remove extreme values in my sample. However, I do not know how to actually remove those extreme values instead of just listing them. Is there any way to do this?

    ​Thanks in advance!

    Kind regards,

    Wesley

    Leave a comment:


  • Attaullah Shah
    replied
    Nick is right in his point about multivariate outliers. As a matter of fact, I have seen many papers in Finance that winsorize or drop values that are 3 SD away from mean values. In that case, we can adopt the following code
    sysuse auto
    foreach x of varlist price mpg{
    sum `x'
    drop if (`x' -(r(mean))>(3*r(sd)))
    }
    Regards
    Attaullah Shah
    Last edited by Attaullah Shah; 24 Sep 2014, 04:10.

    Leave a comment:


  • Carlo Lazzaro
    replied
    Dear Nick,
    I
    was just plucking a criterion out of the air as an example...
    All your remarks are, as always, sound.

    Kind regards,
    Carlo

    Leave a comment:


  • Nick Cox
    replied
    Carlo wrote code for an indicator variable flagging values more than 3 times the standard deviation (SD). But consider a bundle of countries with life expectancy mean 60 years and SD 10 years. Then all values >30 years would be flagged as outliers, but not those with <30 years (which on most other criteria would be staggering outliers).

    I guess Carlo was just plucking a criterion out of the air as an example, or he was thinking about some criterion of the form |value - mean| > k SD, but made a slip in his coding. Even then the mean and SD are both likely to be strongly affected by outliers when they exist, so wouldn't we be better off using median and interquartile range (IQR), say, as the basis for any rule of thumb?

    Hang on: we are rediscovering box plot criteria. For those who want tables, I wrote extremes (SSC) but don't use it much. It deliberately (or so I suppose) doesn't offer hooks for dropping outliers, which is almost always bad practice in my view.

    P.S. A side-effect of Carlo's code is that missing values will be flagged as 1. Taking his criterion literally, the code might be rewritten

    Code:
      
    foreach var of varlist A-C {    
       quietly summarize `var'    
       g Z_`var'= (`var' > 3*r(sd)) if `var' < .      
       list `var' Z_`var' if Z_`var' == 1
    }
    A bigger problem is that looking for univariate outliers is only part of the problem. It's entirely possible to have bivariate outliers that aren't univariate outliers, trivariate outliers that aren't bivariate outliers, and so forth. There are naturally ways of finding these.

    Leave a comment:


  • litosbrito
    replied
    Thank you for the suggestion Carlo Lazzaro! I will tray it!

    Thank you Maarten Buis for the suggestion of analysis!

    Leave a comment:

Working...
X