Efficient Alternatives to cross or joinby for Unique Pairwise Comparisons

Vinicius Lima

Join Date: Dec 2019

Posts: 10
#1

Efficient Alternatives to cross or joinby for Unique Pairwise Comparisons

23 Nov 2024, 06:25

Hello everyone,

I am using the cross and joinby commands to create pairwise comparisons of individual outcomes.

Suppose individuals in my dataset are indexed by id = 1, 2, ..., n. For my analysis, it is irrelevant whether I use the pair (1,2) or (2,1) because both provide the same information.

This means that after generating all pairs with cross or joinby, I always discard one of the duplicates.

Is there a more efficient way to generate unique, non-redundant pairs directly?

Thank you.

Best regards,

Vinicius Lima
Tags: None
Leonardo Guizzetti

Join Date: Jul 2016

Posts: 2402
#2

23 Nov 2024, 08:06

That depends on what exactly you are finding inefficient. I understand you are having to eliminate half of all permutations of pairs, but that of itself is not necessarily slow. The merge-style commands (cross and joinby included) are too general and don't allow you to create the subset of pairs you need. Here's one way to do make the pairs, and from here you can decide the data you wish to merge in, since it may require renaming variables. The key is to make yourself a "map" frame, which itself is fairly space and time efficient to produce. You'll also notice I use the Cartesian product and then filter out what I don't need from there. It is also general enough to not require the original subject identifier to be of any specific type or definition.

Code:

clear * cls mkf Mydata cwf Mydata input pid out1 out2 10 1 4 20 2 5 30 3 6 end sort pid tempfile id1 id2 frame put pid , into(ID1) cwf ID1 save `id1' frame put pid , into(ID2) cwf ID2 save `id2' mkf AllIDs cwf AllIDs append using `id1' `id2' sort pid by pid: keep if _n==1 gen `c(obs_t)' pseudoid = _n qui summ pseudoid, meanonly local Npid = r(max) mkf Pairs cwf Pairs qui set obs `Npid' gen long paira = _n expand `Npid' sort paira by paira : gen long pairb = _n qui keep if paira < pairb qui compress frlink m:1 paira , frame(AllIDs pseudoid) gen(la) frlink m:1 pairb , frame(AllIDs pseudoid) gen(lb) frget pid1=pid, from(la) frget pid2=pid, from(lb) drop paira pairb la lb

Result

Code:

+-------------+ | pid1 pid2 | |-------------| 1. | 10 20 | 2. | 10 30 | 3. | 20 30 | +-------------+
2 likes
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#3

23 Nov 2024, 10:37

Here's a simpler way. (Well, it's simpler because the complexity is hidden inside the -rangejoin- command. But it is fast and pretty conservative with memory.)

Code:

// CREATE A DEMONSTRATION DATA SET clear* set obs 1000 gen `c(obs_t)' id = _n // MAKE A TEMPORARY COPY OF THE DATA SET tempfile copy save `copy' // FORM THE SET OF UNORDERED NON-DIAGONAL PAIRS rangejoin id . -1 using `copy' drop if missing(id_U)

This requires that the id variable be an integer (though it can be stored as a float or double if necessary). If the id variable is a string, the use of -encode- or -egen, group()- will give you a suitable integer variable in 1:1 correspondence with the original id variable that can be used instead.

-rangejoin- is written by Robert Picard and is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available fro SSC.

Added: By the way, that final -drop if missing(id_U)- command is needed to drop just a single pair, namely the pair where id takes on its smallest value--for which there is no possible pairing to a smaller value of id.
2 likes
Comment
Vinicius Lima

Join Date: Dec 2019

Posts: 10
#4

25 Nov 2024, 07:15

Thank you both for the advice!
Comment

Announcement

Efficient Alternatives to cross or joinby for Unique Pairwise Comparisons

Comment

Comment

Comment