Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Matching participants per cases and controls

    Dear List

    I have a dataset of participants and reference individuals.
    As an example the data is:
    case 0/1
    sex 0/1
    age integer in whole years

    i found this old post in the old forum se link below.
    That describe how to match 2:1 by age using the following example.

    Code:
    clear
    // mock up control data
    set seed 846
    set obs 500 // don't know how many controls you have
    gen byte case = 0
    gen byte age = 20 +ceil(65*runiform()) // broad age range assumed
    tempfile controls
    sort age
    save `controls'
    clear
    
    // mock up cases
    set obs 63
    gen byte case = 1
    gen byte age = 20 +ceil(65*runiform())
    
    //The real stuff starts here; you have an existing control file you can append to your cases.
    
    append using `controls'
    gen rand = runiform()
    sort age case rand
    by age: egen ncases = sum(case)
    keep if (ncases >=1) // age groups with no cases are irrelevant
    
    // The following keeps the first 2 controls for each case within each age group
    
    by age: keep if (case ==1) | ((_n <= 2*ncases) & (case == 0))
    tab2 age case
    by age: egen ncontrols = sum(case == 0)
    count if (ncontrols < 2*ncases)
    now i don't have an exact match, and would like to match by sex and ageĀ±3 years.
    I would prefer it so that i did not need to use the same control for mere than one case.

    can anyone tell me how to do that, preferably with an indicator of what case is matched to what two controls.

    Thank you.
    Lars

  • #2
    I was the author of that earlier rather poorly designed suggestion that you show above. Now, with the -rangejoin- command available in SSC, and some better (?) thinking on my part, here's a new solution. The only "hard part" here is to accomplish sampling controls without replacement, to which I offered a clumsy solution in another post several years ago. I believe the following two lines (see below for context) accomplish sampling of controls without replacement in an easy way. Anyway, this solution seems so much easier that I have some doubts, but it seems to work.
    Critique welcome.

    If the use of -rangejoin- can be extended to accommodate multiple "caliper-matched" variables, that would be nice as regards a more general solution.

    Code:
    local totalcases = 100       // for example
    local totalcontrols  = 10000
    local nctl = 4  // I'm choosing 4 controls per case
    //  mock control data
    set seed 8846
    set obs `totalcontrols'
    gen int id_ctl = _n
    gen byte case = 0
    gen byte age = 20 +ceil(65*runiform()) // broad age range assumed
    gen int sex = runiform() > 0.5
    tempfile controls
    save `controls'
    clear
    // mock case data
    set obs `totalcases'
    gen int id = _n
    gen byte case = 1
    gen byte age = 20 +ceil(65*runiform())
    gen sex = runiform() > 0.5
    tempfile cases
    save `cases'
    //
    *******************************************************************************
    // Actual solution starts here
    use `controls'
    gen rand = runiform()
    sort rand // random order for controls
    drop rand
    save `controls', replace
    //
    use `cases'
    // exact match on sex, within +/- 3 years for age
    compress
    rangejoin age -3 3 using `controls', by(sex)  
    drop *_U  // clean up
    rename (id age sex) (id_case age_case sex_case) // clean up
    //  Sample `nctl' controls w/o replacement 
    bysort id_ctl: keep if _n ==1  // use each control only once
    bysort id_case: keep if _n <= `nctl'  // keep up to `nctl' controls for each case
    //
    // put the control data onto the file and check it out
    merge 1:1 id_ctl using `controls', keep(match)
    rename (age sex) (age_ctl sex_ctl)
    // Check how many controls were found for every case
    bysort id_case: gen byte numcontrols = _N if _n ==1
    tab numcontrols

    Comment


    • #3
      Hi Mike,

      Thanks for posting this code. It has been incredibly useful.

      I was using it on a study in which some (<5) cases had less than the optimal (n=3) number of controls. However the number of records with <3 controls kept changing each time I ran the code. This had issues re the reproducibility.

      If you run the code above with "local totalcontrols = 400" you will see what I mean.


      Changing the section of the code below:

      // Sample `nctl' controls w/o replacement bysort id_ctl: keep if _n ==1 // use each control only once bysort id_case: keep if _n <= `nctl' // keep up to `nctl' controls for each case // to
      /* Sample `nctl' controls w/o replacement */
      set seed 543543
      gen random=runiform()
      sort random
      bysort id_ctl (random): keep if _n ==1 // use each control only once
      bysort id_case: keep if _n <= `nctl' // keep up to `nctl' controls for each case

      meant the data linkage (ie. number of records per case) was identical each time I ran it.

      Note that those records with no matches are just deleted, for completeness it would be good to have a way of noting cases with no matches.

      Comment


      • #4
        Peter, it's certainly possible I made some kind of mistake, but I think there are ordinary (non-error) reasons for what you're seeing.

        // Sample `nctl' controls w/o replacement bysort id_ctl: keep if _n ==1 // use each control only once
        I don't understand why you want to comment out this line. This line ensures sampling w/o replacement.
        Perhaps you didn't mean to comment out this line?

        > I was using it on a study in which some (<5) cases had less than the optimal (n=3) number of controls.
        > However the number of records with <3 controls kept changing each time I ran the code.

        If you did not use "set seed ..., " you would get a different sorting of cases and controls each time, which would result in different results of the kind you describe. This is "greedy" matching, so if you have a case "Jane Smith" for whom relevant controls are in short supply, they may already be assigned to some other case before it's Jane's "turn."

        > meant the data linkage (ie. number of records per case) was identical each time I ran it.
        I'm confused by this sentence, since it lacks a subject. If you mean "I was getting the results I describe even though I had not changed my choice of nctl," I think my previous comment speaks to that problem.

        > If you run the code above with "local totalcontrols = 400" you will see what I mean.

        I would not expect the preceding code or almost any matching algorithm to work well with too a small number of controls from which to choose. With 100 cases and only 400 controls, it's quite likely that many cases will not have suitable matches, since similar case subjects will "compete" for the same set of controls. Some subjects may not have 4 suitable matches at all.



        Comment


        • #5
          Is it possible for this code to be modified somewhat to allow matching with replacement?

          Thanks,

          Rob.

          Comment


          • #6
            Yes. The -rangejoin- command pairs each case with *all matching controls.* So, you can just leave out the line:
            Code:
            bysort id_ctl: keep if _n ==1 // use each control only once

            Comment


            • #7
              Hi Mike,

              Thanks for your response, and my apologies for my lack of clarity.

              The line that was commented out was your original code, and included for comparison purposes only.

              I would suggest that a situation with 100 cases and 1000 potential controls, with wanting 4 controls per case, would reflect a potential real world situation.

              Therefore, with the change below, then each time I ran your original code it generated different results for "tab numcontrols".

              local totalcontrols = 1000


              However, if you modify your original code slightly, as below, then the same result will be returned each time.

              /* Sample `nctl' controls w/o replacement */
              gen random=runiform()
              sort random
              bysort id_ctl (random): keep if _n ==1 // use each control only once
              bysort id_case: keep if _n <= `nctl' // keep up to `nctl' controls for each case


              Any matching algorithm should be reproducible. When I made the changes above, the cohort of matched controls remained consistent each time the code was run.

              I hope that explanation is clearer.

              Peter.

              Comment

              Working...
              X