Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Deleting observations to match percentiles of two datasets

    Dear Statalist users,


    I am not sure if the following problem is well-posed , but I'll try my best.
    Suppose I have a dataset on two variables as follows:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(x1 x2)
     21 123
     21 321
    321 123
    213 312
    213   3
    123  23
    123   3
    123   2
      2   1
      3   2
      2   3
     23  21
      2   2
     21   2
      2   2
      .   .
      .   .
      .   .
     32  32
    end
    These two variables denote the same underlying variable, but there are some observations in var2 that are outliers, that skew any inference that I would like to make on var2. For instance, in a regression of y var1, I obtain a certain set of results which are not replicated of y on var2. This is quite perplexing as the corr(var1,var2) is in excess of 0.9. What I wish to do is as follows:

    1. Create a percentile variable for each var1 and var2. This is achieved by using the
    Code:
    egen  percentile=xtile(var1), nq(100),
    2. I use var1 as the benchmark for observations. In that sense, think of this as the "population" vector of variables, whereas var2 is an observed sample.
    3. Now, in line with that reasoning, I want to drop those outliers that are causing differences in the unconditional distribution of var2 relative to var1. As a potential solution, what I wish to do is to create a loop that drops 5% (say) of observations each time, and then computes the percentiles of var1 and var2.
    4. This loop goes through all possible unique combinations of the data (this will be numerous I imagine), all the while computing the percentile for each new set of data.
    5. I then calculate say a distance metric which takes the sum of the squared differences for each percentile of var2 and var1 for each constructed dataset.
    6. I then choose the dataset which minimizes this sum of squared differences.


    Any help/guidance on this is much appreciated.


    Best,
    Chinmay

  • #2
    You'll increase your chances of a helpful answer by following the FAQ on asking questions - provide Stata code in code delimiters, readable Stata output, and sample data using dataex.

    This doesn't seem like a very sensible exercise. You're cooking the observations on var2 to get the same results as you did with var1. What's the point? Any analysis of using var2 is just based on your systematic manipulation of the data. You're creating sample selection bias intentionally.

    If you're worried about outliers, look at the documentation (in regress and regress postestimation) on influential observations and consider winsorizing the data.

    Comment


    • #3
      @Phil, the point is not to cook data. The point is that these two datasets should in fact be equivalent- I want to look at precisely what datapoints are causing them not to be equivalent. Using dfbeta statistics, cooks distance measures etc. don't solve the problem as the dataset is huge, and it is very unlikely any one observation is going to be influential. But anyways, point taken.

      Comment

      Working...
      X