Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Droping outliers based on different criteria

    Dear all,

    I use Stata 13 and I want to calculate average value of X removing outliers (1st and 99th percentile) from different datasets. However, when there are many values of X equal to the percentile value, I want to drop the observations based on value of other variables ( for instance, the smallest values of variable Z). Is there a command that allows me to tag outliers this way?

    I have created a program (see code below) that creates an outlier dummy accordingly, but it is not efficient since it takes long and I have to repeat the procedure many times. Would you know of a more efficient way to perform what I want? Thank you so much.

    Code:
    program define trim_criteria 
    
    qui centile `1', centile(1 99)
    scalar p1= r(c_1)
    scalar p99= r(c_2)
    scalar tot= r(N)
    scalar tot1 = round(tot*0.01)
    
    
    gsort -v elasticity_sign -imp_v_share -exp_v_share imp_iso3 aff_iso3     // I sort the data according to these six variables
    
    *dummy1%
    qui gen d1_`1'=1 if float(`1')>float(p99) & `1'!=.
    qui count if float(p99)==float(`1')
    if r(N)>0 {
        qui count if d1_`1'==1
        qui scalar j1= tot1-r(N)
        qui gen f1 = _n if float(p99)==float(`1')
        qui egen g1=rank (f1), field
        qui replace d1_`1' = 1 if g1<=j1
    }
    qui count if d1_`1'==1
    cap assert r(N)==tot1 
    if _rc!=0 {
        assert r(N)==tot1-1
        } 
    
    qui replace d1_`1'=1 if float(`1')<float(p1) & `1'!=.
    qui count if float(p1)==float(`1')
    if r(N)>0 {
        qui count if d1_`1'==1
        qui scalar j2= (2*tot1)-r(N)
        qui gen f2 = _n if float(p1)==float(`1') & d1_`1'!=1
        qui egen g2=rank (f2), field
        qui replace d1_`1' = 1 if g2<=j2
    }
    qui count if d1_`1'==1
    cap assert r(N)==2*tot1
    if _rc!=0 {
        cap assert r(N)==2*tot1-1
            if _rc!=0 {
                assert r(N)==2*tot1-2 
            }
        } 
    foreach var in f1 g1 f2 g2{
    cap drop `var'
    end
    }

  • #2
    You'll increase your chances of a useful answer by following the FAQ on asking questions - provide Stata code in code delimiters, readable Stata output, and sample data using dataex.

    If your program works, I wouldn't waste time trying to do it faster - let it run over night or over the weekend. You only need to winsorize variables once.

    Comment

    Working...
    X