Creating "random" pairs with restrictions on which cases can be paired together

IYH Svien

Join Date: Sep 2017

Posts: 14
#1

Creating "random" pairs with restrictions on which cases can be paired together

19 Nov 2018, 11:01

I have a data set that includes three variables, a person ID, a family ID, and a neighborhood ID. To help fix ideas, the structure of the data looks like this:

Code:

person_id family_id neigh_id 1 1 1 2 1 1 3 2 1 4 2 1 5 3 1 6 3 1 7 4 2 8 4 2 9 5 2 10 5 2 11 6 2 12 6 2

What I'd like to do is randomly pair each person with another person in their neighborhood who isn't also in their family. Any advice on how to proceed -- even if it's only enough to get me on the right track -- would be very much appreciated.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#2

19 Nov 2018, 11:26

So, you do this in stages. First you create a copy of the data file, with the variables person_id and family_id renamed so that you can distinguish them from the originals. Then you join the two files together, pairing each person_id with every other person_id in the same neighborhood, and drop those cases that come from the same family. Finally, you assign a random number to each observation, and for each person_id, you pick the one with the smallest random number.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte(person_id family_id neigh_id) 1 1 1 2 1 1 3 2 1 4 2 1 5 3 1 6 3 1 7 4 2 8 4 2 9 5 2 10 5 2 11 6 2 12 6 2 end preserve rename person_id person_id2 rename family_id family_id2 tempfile matches save `matches' restore joinby neigh_id using `matches' drop if family_id == family_id2 set seed 1234 gen double shuffle = runiform() by person_id (shuffle), sort: keep if _n == 1

Note: if what you are looking for is pairing for some kind of matched-pairs statistical analysis, this will work fine. You may notice that the same person may be re-used as a match for more than one person_id. For matched-pairs analysis this is perfectly OK. If you have some other situation where you must have a different match for each person, then the third stage of the process is different and more complicated. You can post back for help with that if this is your situation.

In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
IYH Svien

Join Date: Sep 2017

Posts: 14
#3

19 Nov 2018, 11:45

Thanks, Clyde. I am indeed in a situation where it's inappropriate to reuse the same person in more than one pair.

(Thanks for the tip on how to show data examples. I knew there was probably a better way than putting it in a code block, but didn't know about -dataex-.)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#4

19 Nov 2018, 12:11

So replace the final "paragraph" of the code (the one that begins with set seed 1234) with:

Code:

set seed 1234 gen double shuffle = runiform() sort person_id shuffle local i = 1 while `i' < _N { local matched_id = person_id2[`i'] local pid = person_id[`i'] drop if person_id2 == `matched_id' in `=`i'+1'/L drop if person_id == `pid' in `=`i'+1'/L local ++i }

Notes:
1. In your example data, things work out so that everybody can be matched uniquely with somebody else. In the full data you may find that you run out of matches for some people if you can't use the same person as a match twice.

2. The number 1234 in the -set seed- command is arbitrary. Pick any positive integer you like. You will get different results with different seeds, but what matters is that you be able to reproduce the results if you need to do it over--that is why you need to set a seed.

3. In the event you have a very large data set, say more than 1,000,000 people, then you need to generate two random numbers, call them shuffle1 and shuffle2, and then you have to -sort person_id shuffle1 shuffle2-.
Comment
IYH Svien

Join Date: Sep 2017

Posts: 14
#5

19 Nov 2018, 12:48

Thanks, Clyde. This is really helpful.

One remaining question: When I run your code on the example data, I end up with this:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input float(person_id family_id neigh_id person_id2 family_id2) 1 1 1 4 2 2 1 1 3 2 3 2 1 6 3 4 2 1 2 1 5 3 1 1 1 7 4 2 10 5 8 4 2 9 5 9 5 2 8 4 10 5 2 7 4 end

If I'm understanding correctly, it looks as though person 1 is linked to person 4. But person 4 is not linked to person 1 (as they should be); instead they're linked to person 2. Do you have advice on how to edit the code so that the links are internally consistent (i.e., 1 is linked to 4 and 4 is linked to 1)?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#6

19 Nov 2018, 13:05

Well, rather than do that, here is code that just gives you a list with each pairing given once, with the lower numbered person in the pair listed first. The parts that are changed from before are italicized.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte(person_id family_id neigh_id) 1 1 1 2 1 1 3 2 1 4 2 1 5 3 1 6 3 1 7 4 2 8 4 2 9 5 2 10 5 2 11 6 2 12 6 2 end preserve rename person_id person_id2 rename family_id family_id2 tempfile matches save `matches' restore joinby neigh_id using `matches' drop if family_id == family_id2 set seed 1234 gen double shuffle = runiform() sort person_id shuffle local i = 1 while `i' < _N { local matched_id = person_id2[`i'] local pid = person_id[`i'] drop if inlist(`matched_id', person_id, person_id2) & inrange(_n, `i'+1, _N) drop if inlist(`pid', person_id, person_id2) & inrange(_n, `i'+1, _N) local ++i } drop shuffle gen p1 = min(person_id, person_id2), before(person_id) gen p2 = max(person_id, person_id2), after(p1) gen f1 = min(family_id, family_id2), after(p1) gen f2 = max(family_id, family_id2), after(p2) drop person_id* family_id* rename p? person_id? rename f? family_id?

Added: With these changes, you no longer necessarily end up with everybody having a match. In some cases, a person who might be the only match possible for person k gets matched instead to person j, with j < k, and is thereby removed from the pool of potential matches. This leaves person k unmatched. The extent and pattern of this kind of non-matching varies with the random number seed. If your first attempt to apply this to your real data leaves you with an unacceptably large number of unmatched people, you can change the random number seed and re-run it. You might do better, or you might do worse. You can keep trying until you get satisfactory results, provided your definition of "satisfactory" isn't too stringent. It may or may not even be possible to match everybody without resuse, and even if possible, the probability of generating such a match randomly may be so low that it won't happen in your lifetime.

This propensity to failing to find a match is one of the draw backs of matching without reuse, and is the reason it is usually not used in creating matched-pair samples for statistical studies.

Last edited by Clyde Schechter; 19 Nov 2018, 13:10.
1 like
Comment
IYH Svien

Join Date: Sep 2017

Posts: 14
#7

19 Nov 2018, 13:48

Thanks again, Clyde. This is all quite helpful.
Comment

Announcement

Creating "random" pairs with restrictions on which cases can be paired together

Comment

Comment

Comment

Comment

Comment

Comment