Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Comparing Observations in Twos

    Hello Stata,


    I have 38,420 observations and 12 variables. Each variable is a numerical attribute of observation, the higher, the better. I want to select observations based on these variables by first throwing out dominated observations. That is, if observation 2 is lower than observation 1 in all attributes, I delete observation 2. Given the number of observations in my data, assuming that I do the comparison two at a time, I will be doing 15 rounds of comparisons, or 38,419 comparisons in total.

    Is there any Stata command or package that can help me do it? If not, do you have any recommendations on the algorithm I should use?

    Thank you so much.

  • #2
    Please provide an extract of your data using -dataex-. See also the Statalist FAQ (esp. #12).

    Comment


    • #3
      It is not clear why you think that at most 38,419 comparisons will need to be made.

      I can reproduce that number by assuming that in the first round, you compare 19,210 pairs of observations and drop from each pair one observation that is dominated, and proceed forward with 9,605 pairs of observations, again dropping one from each pair, and so on.

      But there is nothing that guarantees that in any comparison one of the two observations will dominate the other.

      Suppose there are 2 variables x and y, and it turns out that in every observation, x+y=1, or stated differently, y=1-x. Then in any pair of observations, say (x1,y1) and (x2,y2), if x1>x2 then (1-x1)<(1-x2) and thus y1<y2. So no matter what two observations you choose to compare, neither will dominate the other.

      Comment


      • #4
        Here is a MWE:

        Code:
        sysuse auto
        dataex make price mpg headroom gear_ratio in 1/5
        list
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str18 make int(price mpg) float(headroom gear_ratio)
        "AMC Concord"   4099 22 2.5 3.58
        "AMC Pacer"     4749 17   3 2.53
        "AMC Spirit"    3799 22   3 3.08
        "Buick Century" 4816 20 4.5 2.93
        "Buick Electra" 7827 15   4 2.41
        end
        I would like to compare the observations in pairs and eliminate the dominated ones. For example, Buick Century dominates AMX pacer because it is larger in all 4 variables. I cannot throw out any other observations because there is no strict dominance.
        Here are my questions:
        1. Is there any package that can help me achieve this task?
        2. If not, what algorithm can do this job as quickly as possible?
        3. For those observations that cannot be eliminated, is there a way to judge which ones are better?

        Thank you!

        Originally posted by Hemanshu Kumar View Post
        Please provide an extract of your data using -dataex-. See also the Statalist FAQ (esp. #12).

        Comment


        • #5
          Yes, you are absolutely right. I guess I should have been more rigorous and mentioned this is the minimum number of comparisons. In the worst-case scenario, I would have to make 38,420 * 38,419 = 1,476,057,980 comparisons (?).

          Originally posted by William Lisowski View Post
          It is not clear why you think that at most 38,419 comparisons will need to be made.

          I can reproduce that number by assuming that in the first round, you compare 19,210 pairs of observations and drop from each pair one observation that is dominated, and proceed forward with 9,605 pairs of observations, again dropping one from each pair, and so on.

          But there is nothing that guarantees that in any comparison one of the two observations will dominate the other.

          Suppose there are 2 variables x and y, and it turns out that in every observation, x+y=1, or stated differently, y=1-x. Then in any pair of observations, say (x1,y1) and (x2,y2), if x1>x2 then (1-x1)<(1-x2) and thus y1<y2. So no matter what two observations you choose to compare, neither will dominate the other.

          Comment

          Working...
          X