Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Efficient Alternatives to cross or joinby for Unique Pairwise Comparisons

    Hello everyone,

    I am using the cross and joinby commands to create pairwise comparisons of individual outcomes.

    Suppose individuals in my dataset are indexed by id = 1, 2, ..., n. For my analysis, it is irrelevant whether I use the pair (1,2) or (2,1) because both provide the same information.

    This means that after generating all pairs with cross or joinby, I always discard one of the duplicates.

    Is there a more efficient way to generate unique, non-redundant pairs directly?

    Thank you.

    Best regards,

    Vinicius Lima

  • #2
    That depends on what exactly you are finding inefficient. I understand you are having to eliminate half of all permutations of pairs, but that of itself is not necessarily slow. The merge-style commands (cross and joinby included) are too general and don't allow you to create the subset of pairs you need. Here's one way to do make the pairs, and from here you can decide the data you wish to merge in, since it may require renaming variables. The key is to make yourself a "map" frame, which itself is fairly space and time efficient to produce. You'll also notice I use the Cartesian product and then filter out what I don't need from there. It is also general enough to not require the original subject identifier to be of any specific type or definition.

    Code:
    clear *
    cls
    
    mkf Mydata
    cwf Mydata
    input pid out1 out2
    10 1 4
    20 2 5
    30 3 6
    end
    sort pid
    
    tempfile id1 id2
    
    frame put pid , into(ID1)
    cwf ID1
    save `id1'
    
    frame put pid , into(ID2)
    cwf ID2
    save `id2'
    
    mkf AllIDs
    cwf AllIDs
    append using `id1' `id2'
    sort pid
    by pid: keep if _n==1
    gen `c(obs_t)' pseudoid = _n
    
    qui summ pseudoid, meanonly
    local Npid = r(max)
    
    mkf Pairs
    cwf Pairs
    qui set obs `Npid'
    gen long paira = _n
    expand `Npid'
    sort paira
    by paira : gen long pairb = _n
    qui keep if paira < pairb
    qui compress
    
    frlink m:1 paira , frame(AllIDs pseudoid) gen(la)
    frlink m:1 pairb , frame(AllIDs pseudoid) gen(lb)
    frget pid1=pid, from(la)
    frget pid2=pid, from(lb)
    drop paira pairb la lb
    Result

    Code:
         +-------------+
         | pid1   pid2 |
         |-------------|
      1. |   10     20 |
      2. |   10     30 |
      3. |   20     30 |
         +-------------+

    Comment


    • #3
      Here's a simpler way. (Well, it's simpler because the complexity is hidden inside the -rangejoin- command. But it is fast and pretty conservative with memory.)

      Code:
      // CREATE A DEMONSTRATION DATA SET
      clear*
      set obs 1000
      gen `c(obs_t)' id = _n
      
      //  MAKE A TEMPORARY COPY OF THE DATA SET
      tempfile copy
      save `copy'
      
      //  FORM THE SET OF UNORDERED NON-DIAGONAL PAIRS
      rangejoin id . -1 using `copy'
      drop if missing(id_U)
      This requires that the id variable be an integer (though it can be stored as a float or double if necessary). If the id variable is a string, the use of -encode- or -egen, group()- will give you a suitable integer variable in 1:1 correspondence with the original id variable that can be used instead.

      -rangejoin- is written by Robert Picard and is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available fro SSC.

      Added: By the way, that final -drop if missing(id_U)- command is needed to drop just a single pair, namely the pair where id takes on its smallest value--for which there is no possible pairing to a smaller value of id.

      Comment


      • #4
        Thank you both for the advice!

        Comment

        Working...
        X