Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating "random" pairs with restrictions on which cases can be paired together

    I have a data set that includes three variables, a person ID, a family ID, and a neighborhood ID. To help fix ideas, the structure of the data looks like this:

    Code:
    person_id   family_id   neigh_id
    1           1           1
    2           1           1
    3           2           1
    4           2           1
    5           3           1
    6           3           1
    7           4           2
    8           4           2
    9           5           2
    10          5           2
    11          6           2
    12          6           2
    What I'd like to do is randomly pair each person with another person in their neighborhood who isn't also in their family. Any advice on how to proceed -- even if it's only enough to get me on the right track -- would be very much appreciated.

  • #2
    So, you do this in stages. First you create a copy of the data file, with the variables person_id and family_id renamed so that you can distinguish them from the originals. Then you join the two files together, pairing each person_id with every other person_id in the same neighborhood, and drop those cases that come from the same family. Finally, you assign a random number to each observation, and for each person_id, you pick the one with the smallest random number.

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte(person_id family_id neigh_id)
     1 1 1
     2 1 1
     3 2 1
     4 2 1
     5 3 1
     6 3 1
     7 4 2
     8 4 2
     9 5 2
    10 5 2
    11 6 2
    12 6 2
    end
    
    preserve
    rename person_id person_id2
    rename family_id family_id2
    tempfile matches
    save `matches'
    restore
    
    joinby neigh_id using `matches'
    drop if family_id == family_id2
    
    set seed 1234
    gen double shuffle = runiform()
    by person_id (shuffle), sort: keep if _n == 1
    Note: if what you are looking for is pairing for some kind of matched-pairs statistical analysis, this will work fine. You may notice that the same person may be re-used as a match for more than one person_id. For matched-pairs analysis this is perfectly OK. If you have some other situation where you must have a different match for each person, then the third stage of the process is different and more complicated. You can post back for help with that if this is your situation.

    In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.


    Comment


    • #3
      Thanks, Clyde. I am indeed in a situation where it's inappropriate to reuse the same person in more than one pair.

      (Thanks for the tip on how to show data examples. I knew there was probably a better way than putting it in a code block, but didn't know about -dataex-.)

      Comment


      • #4
        So replace the final "paragraph" of the code (the one that begins with set seed 1234) with:

        Code:
        set seed 1234
        gen double shuffle = runiform()
        sort person_id shuffle
        local i = 1
        while `i' < _N {
            local matched_id = person_id2[`i']
            local pid = person_id[`i']
            drop if person_id2 == `matched_id' in `=`i'+1'/L
            drop if person_id == `pid' in `=`i'+1'/L
            local ++i
        }
        Notes:
        1. In your example data, things work out so that everybody can be matched uniquely with somebody else. In the full data you may find that you run out of matches for some people if you can't use the same person as a match twice.

        2. The number 1234 in the -set seed- command is arbitrary. Pick any positive integer you like. You will get different results with different seeds, but what matters is that you be able to reproduce the results if you need to do it over--that is why you need to set a seed.

        3. In the event you have a very large data set, say more than 1,000,000 people, then you need to generate two random numbers, call them shuffle1 and shuffle2, and then you have to -sort person_id shuffle1 shuffle2-.

        Comment


        • #5
          Thanks, Clyde. This is really helpful.

          One remaining question: When I run your code on the example data, I end up with this:

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input float(person_id family_id neigh_id person_id2 family_id2)
           1 1 1  4 2
           2 1 1  3 2
           3 2 1  6 3
           4 2 1  2 1
           5 3 1  1 1
           7 4 2 10 5
           8 4 2  9 5
           9 5 2  8 4
          10 5 2  7 4
          end
          If I'm understanding correctly, it looks as though person 1 is linked to person 4. But person 4 is not linked to person 1 (as they should be); instead they're linked to person 2. Do you have advice on how to edit the code so that the links are internally consistent (i.e., 1 is linked to 4 and 4 is linked to 1)?

          Comment


          • #6
            Well, rather than do that, here is code that just gives you a list with each pairing given once, with the lower numbered person in the pair listed first. The parts that are changed from before are italicized.

            Code:
            * Example generated by -dataex-. To install: ssc install dataex
            clear
            input byte(person_id family_id neigh_id)
             1 1 1
             2 1 1
             3 2 1
             4 2 1
             5 3 1
             6 3 1
             7 4 2
             8 4 2
             9 5 2
            10 5 2
            11 6 2
            12 6 2
            end
            
            preserve
            rename person_id person_id2
            rename family_id family_id2
            tempfile matches
            save `matches'
            restore
            
            joinby neigh_id using `matches'
            drop if family_id == family_id2
            
            set seed 1234
            gen double shuffle = runiform()
            sort person_id shuffle
            local i = 1
            while `i' < _N {
                local matched_id = person_id2[`i']
                local pid = person_id[`i']
                drop if inlist(`matched_id', person_id, person_id2) & inrange(_n, `i'+1, _N)
                drop if inlist(`pid', person_id, person_id2) & inrange(_n, `i'+1, _N)
                local ++i
            }
            
            drop shuffle
            gen p1 = min(person_id, person_id2), before(person_id)
            gen p2 = max(person_id, person_id2), after(p1)
            gen f1 = min(family_id, family_id2), after(p1)
            gen f2 = max(family_id, family_id2), after(p2)
            drop person_id* family_id*
            rename p? person_id?
            rename f? family_id?
            Added: With these changes, you no longer necessarily end up with everybody having a match. In some cases, a person who might be the only match possible for person k gets matched instead to person j, with j < k, and is thereby removed from the pool of potential matches. This leaves person k unmatched. The extent and pattern of this kind of non-matching varies with the random number seed. If your first attempt to apply this to your real data leaves you with an unacceptably large number of unmatched people, you can change the random number seed and re-run it. You might do better, or you might do worse. You can keep trying until you get satisfactory results, provided your definition of "satisfactory" isn't too stringent. It may or may not even be possible to match everybody without resuse, and even if possible, the probability of generating such a match randomly may be so low that it won't happen in your lifetime.

            This propensity to failing to find a match is one of the draw backs of matching without reuse, and is the reason it is usually not used in creating matched-pair samples for statistical studies.
            Last edited by Clyde Schechter; 19 Nov 2018, 13:10.

            Comment


            • #7
              Thanks again, Clyde. This is all quite helpful.

              Comment

              Working...
              X