Dear Statalist users,
I am not sure if the following problem is well-posed , but I'll try my best.
Suppose I have a dataset on two variables as follows:
These two variables denote the same underlying variable, but there are some observations in var2 that are outliers, that skew any inference that I would like to make on var2. For instance, in a regression of y var1, I obtain a certain set of results which are not replicated of y on var2. This is quite perplexing as the corr(var1,var2) is in excess of 0.9. What I wish to do is as follows:
1. Create a percentile variable for each var1 and var2. This is achieved by using the
2. I use var1 as the benchmark for observations. In that sense, think of this as the "population" vector of variables, whereas var2 is an observed sample.
3. Now, in line with that reasoning, I want to drop those outliers that are causing differences in the unconditional distribution of var2 relative to var1. As a potential solution, what I wish to do is to create a loop that drops 5% (say) of observations each time, and then computes the percentiles of var1 and var2.
4. This loop goes through all possible unique combinations of the data (this will be numerous I imagine), all the while computing the percentile for each new set of data.
5. I then calculate say a distance metric which takes the sum of the squared differences for each percentile of var2 and var1 for each constructed dataset.
6. I then choose the dataset which minimizes this sum of squared differences.
Any help/guidance on this is much appreciated.
Best,
Chinmay
I am not sure if the following problem is well-posed , but I'll try my best.
Suppose I have a dataset on two variables as follows:
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input float(x1 x2) 21 123 21 321 321 123 213 312 213 3 123 23 123 3 123 2 2 1 3 2 2 3 23 21 2 2 21 2 2 2 . . . . . . 32 32 end
1. Create a percentile variable for each var1 and var2. This is achieved by using the
Code:
egen percentile=xtile(var1), nq(100),
3. Now, in line with that reasoning, I want to drop those outliers that are causing differences in the unconditional distribution of var2 relative to var1. As a potential solution, what I wish to do is to create a loop that drops 5% (say) of observations each time, and then computes the percentiles of var1 and var2.
4. This loop goes through all possible unique combinations of the data (this will be numerous I imagine), all the while computing the percentile for each new set of data.
5. I then calculate say a distance metric which takes the sum of the squared differences for each percentile of var2 and var1 for each constructed dataset.
6. I then choose the dataset which minimizes this sum of squared differences.
Any help/guidance on this is much appreciated.
Best,
Chinmay
Comment