Deleting observations to match percentiles of two datasets

Chinmay Sharma

Join Date: Nov 2015

Posts: 351
#1

Deleting observations to match percentiles of two datasets

26 Jul 2018, 11:42

Dear Statalist users,

I am not sure if the following problem is well-posed , but I'll try my best.
Suppose I have a dataset on two variables as follows:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(x1 x2) 21 123 21 321 321 123 213 312 213 3 123 23 123 3 123 2 2 1 3 2 2 3 23 21 2 2 21 2 2 2 . . . . . . 32 32 end

These two variables denote the same underlying variable, but there are some observations in var2 that are outliers, that skew any inference that I would like to make on var2. For instance, in a regression of y var1, I obtain a certain set of results which are not replicated of y on var2. This is quite perplexing as the corr(var1,var2) is in excess of 0.9. What I wish to do is as follows:

1. Create a percentile variable for each var1 and var2. This is achieved by using the

Code:

egen percentile=xtile(var1), nq(100),

2. I use var1 as the benchmark for observations. In that sense, think of this as the "population" vector of variables, whereas var2 is an observed sample.
3. Now, in line with that reasoning, I want to drop those outliers that are causing differences in the unconditional distribution of var2 relative to var1. As a potential solution, what I wish to do is to create a loop that drops 5% (say) of observations each time, and then computes the percentiles of var1 and var2.
4. This loop goes through all possible unique combinations of the data (this will be numerous I imagine), all the while computing the percentile for each new set of data.
5. I then calculate say a distance metric which takes the sum of the squared differences for each percentile of var2 and var1 for each constructed dataset.
6. I then choose the dataset which minimizes this sum of squared differences.

Any help/guidance on this is much appreciated.

Best,
Chinmay
Tags: None
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

27 Jul 2018, 11:09

You'll increase your chances of a helpful answer by following the FAQ on asking questions - provide Stata code in code delimiters, readable Stata output, and sample data using dataex.

This doesn't seem like a very sensible exercise. You're cooking the observations on var2 to get the same results as you did with var1. What's the point? Any analysis of using var2 is just based on your systematic manipulation of the data. You're creating sample selection bias intentionally.

If you're worried about outliers, look at the documentation (in regress and regress postestimation) on influential observations and consider winsorizing the data.
1 like
Comment
Chinmay Sharma

Join Date: Nov 2015

Posts: 351
#3

27 Jul 2018, 11:19

@Phil, the point is not to cook data. The point is that these two datasets should in fact be equivalent- I want to look at precisely what datapoints are causing them not to be equivalent. Using dfbeta statistics, cooks distance measures etc. don't solve the problem as the dataset is huge, and it is very unlikely any one observation is going to be influential. But anyways, point taken.
Comment

Announcement

Deleting observations to match percentiles of two datasets

Comment

Comment