Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Matching without replacement from a file of pairs for case-control and other applications

    Short version: I’m seeking a solution to how to do 1:m matching of cases and controls without replacement, *given a file in long (“edge”) format of all possible matching pairs.* Code to create sample data occurs at the end of this post.

    Longer version:
    In corresponding off-list with Rima Saliba about this problem, I discovered, that the code I had posted here several years ago in response to a question about this problem is wrong.

    While there have been a number of threads over the year about case control matching, (see e.g. here). I now have encountered and refined the problem in a way that seems different enough from previous work to be worth posting in a more generalized way. What follows is my refined version of the problem, to which I'm interested in solutions.

    Via the use of -joinby- or perhaps -rangejoin- or -cross-, one can have a file that pairs up cases or treatment subjects with matching potential controls. In such situations, analysts may want to have 1:m matching *without replacement.* Complications include that:
    1) The controls that match one case/treatment subject may match many others.
    2) There are varying numbers of controls available for each case, in general more or perhaps less than m
    3) If possible, one wants to avoid too “greedy” an algorithm, which can result in the extreme in one case getting assigned all m controls and some other similar case getting 0.

    I have the idea that some solution involving -merge- should be possible, per some earlier threads, but I have not successfully figured how to do that. I also have the thought that one of the many built-in or community-contributed matching commands might be used, but I have not worked that out either. I *have* discovered, that some of the “greediness” problem can be avoided by having an algorithm that picks only *1* control without replacement for every case, and then applying this iteratively, so a solution that only picks one control per case would solve the problem.

    In that context, here is a code snippet to create what I’d consider a representative kind of data set with which to work:

    Code:
    // Create an “edge” file of matched pairs.
    set seed 82743
    local ncases = 100
    local maxmatch = 100
    local maxcontrolid = 2000
    clear
    set obs `ncases'
    gen int caseid = _n
    gen navail = ceil(runiform() * `maxmatch')
    label var navail "# controls matched to this case"
    expand navail
    gen int controlid = ceil(runiform() * `maxcontrolid')
    summ navail
    order caseid controlid

    I realize that “without replacement“ is not necessarily analytically preferable, but that’s another issue.
Working...
X