Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Case-control study matching 2 types of controls

    Hello,

    I am conducting a case control study where I would like to 1:1 match cases to two types of controls (cancer and healthy)

    I have successfully matched my cases to cancer controls by age (within 5 years) and gender using the following syntax:
    preserve
    keep if group_id == 1
    tempfile controls
    save `controls'
    restore

    keep if group_id==0
    rangejoin age -5 5 using `controls', by(sex)
    set seed 1234
    gen double shuffle = runiform()
    by patientid (shuffle), sort: keep if _n==1
    drop shuffle

    My questions now are:
    1) How do I also match the second group of healthy controls (ie. group_id==2, which no longer appears within my dataset) to my cases? Can it be done within this syntax?
    2) Now that the cases and cancer controls appear paired, how do I rearrange my data so that the variables are listed in single column including both cases and cancer controls? Given that the control data now appears beside the case data with _U variables, I don't think this can be done with the reshape command.

    Thanks in advance



    Last edited by Rene McCrae; 18 Oct 2018, 13:18.

  • #2
    Well, you could do the two matches simultaneously:

    Code:
    preserve
    keep if group_id == 1
    tempfile controls
    save `controls'
    restore
    
    keep if inlist(group_id, 0, 2)
    rangejoin age -5 5 using `controls', by(sex)
    set seed 1234
    gen double shuffle = runiform()
    by patientid group_id (shuffle), sort: keep if _n==1
    drop shuffle
    Now each case will appear in up to two observations, one with a group_id 1 control and another with a group_id 2 control

    The next step is to get this into fully long layout. You are correct that this is not a good task for -reshape-. Instead, you have to split out the case and control variables and then append them together.

    Code:
    clonevar case_id = patient_id // INDICATE WHICH CASE IN THE TRIPLET
    preserve
    clonevar sex_U = sex // CREATE _U VERSIONS OF SEX & CASE ID
    rename case_id case_id_U
    keep *_U
    rename *_U *
    tempfile controls
    save `controls'
    
    restore
    drop *_U
    append using `controls'
    Note: No sample data was provided, so this is not tested. Beware of typos or substantive errors.
    Last edited by Clyde Schechter; 18 Oct 2018, 13:51.

    Comment


    • #3
      Thanks for your help with this. When I use the code you suggested to match my cases to the two types of controls (syntax below, I adjusted slightly because I had labelled my cases group_id==0 and so on) I am seeing cases being used in more than 2 observations (one patientid was used 15 times for example) and there are now observations where the group_id_U==. being matched to my controls.

      Syntax:
      preserve
      keep if group_id == 0
      tempfile cases
      save `cases'
      restore

      keep if inlist(group_id, 1, 2)
      rangejoin age -1 1 using `cases', by(sex)
      set seed 1234
      gen double shuffle = runiform()
      by patientid group_id (shuffle), sort: keep if _n==1
      drop shuffle

      Data example:
      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input float(group_id group_id_U) double age long sex double age_U
      2 0 72.32 2 72.95
      2 0 72.51 2 73.51
      2 0    54 1 53.52
      2 0 87.77 2 88.67
      1 0 69.02 1    69
      2 0 75.34 1 75.15
      2 0  53.9 2 53.85
      2 0 60.17 2 60.12
      1 0 68.65 1 69.45
      2 0  80.1 1 79.84
      2 0 78.63 1 77.79
      2 0 66.51 1 65.84
      1 . 55.41 1     .
      2 0 54.23 1 53.52
      2 0 73.71 2 72.95
      2 0 78.57 2    78
      2 0  67.9 1  67.4
      1 0 69.44 2 68.99
      2 0 65.23 1 66.16
      2 0 53.42 2 53.85
      2 0 74.03 2 73.62
      2 0 56.68 1 57.03
      2 0 77.51 2    78
      2 . 48.81 1     .
      2 0 58.83 2 58.76
      2 0 56.74 2 57.16
      1 0    65 1 65.84
      2 0 68.11 2  68.4
      2 0 62.96 1 62.49
      1 0 55.15 2 55.01
      2 0 53.02 1 53.01
      1 . 47.63 1     .
      2 0 65.31 2 65.86
      2 0 50.02 2    50
      2 0 67.96 2 68.26
      2 0 65.31 1 66.16
      2 . 64.27 1     .
      2 0 79.09 1 79.84
      2 0 49.14 1 50.07
      1 0 74.29 2 75.06
      2 0 54.42 2 53.85
      2 0  66.6 1 66.93
      2 0 49.76 2    50
      2 0 50.27 2    50
      2 0 76.66 2 76.97
      1 0 49.83 1 50.07
      2 0 70.61 1 70.32
      2 0 46.62 2  45.7
      2 0 73.49 2 74.37
      2 0 57.09 1 57.03
      2 0 65.53 1 65.84
      1 . 55.34 1     .
      2 0 67.12 2 66.32
      2 0 47.43 2 47.04
      1 0 73.37 2 74.37
      2 0 68.55 1 68.91
      2 0  60.9 1 61.32
      2 . 55.54 1     .
      2 0 63.01 1 63.12
      2 0 78.24 2 78.65
      2 0 81.93 1 81.43
      2 0 63.12 2 63.73
      2 0 65.19 1 65.98
      2 0 67.85 2 68.26
      2 0 72.05 2 71.13
      2 0 64.11 2  64.1
      2 0  71.7 2 70.96
      2 0 50.55 2    50
      1 0 55.26 2 55.01
      2 0 81.71 2 81.51
      1 0 63.67 1 63.12
      2 0 79.02 2 78.65
      2 0 58.41 1 59.05
      2 0 79.83 1 80.38
      2 0 63.57 2 64.49
      2 0 67.68 2  68.4
      2 0 75.15 2  75.4
      2 0 70.96 1 70.29
      2 0 66.41 1 66.66
      2 0 79.73 2 79.22
      1 0 58.04 2 58.76
      2 0 64.88 2  64.5
      2 0 58.88 1 58.84
      2 0 46.12 2  45.7
      1 0 68.96 1 69.45
      2 0 67.38 1 67.56
      1 0 66.99 2 67.58
      2 0 76.39 2 76.86
      1 0 70.28 1 70.29
      2 0 69.16 2 68.99
      2 0 76.14 2 76.97
      2 0 79.77 1 79.89
      2 0 67.24 2 66.55
      2 0 54.04 2 55.01
      2 0 72.35 1 72.51
      2 0 62.21 2  61.9
      2 0 47.45 2 46.88
      2 0 67.59 2 67.58
      2 0 56.53 1 57.03
      2 0 65.66 1 66.66
      end
      label values group_id group_id
      label values group_id_U group_id
      label def group_id 1 "control", modify
      label def group_id 2 "non-cancer surgery", modify
      label def group_id 0 "case", modify
      label values sex sex
      label def sex 1 "Female", modify
      label def sex 2 "Male", modify
      This is my first time using -dataex- so please let me know if there would be something more helpful than what I provided. Any idea why I am getting so many observations with my current approach? I only have 183 cases and using the syntax I had before worked out to 183 observations with cases: 1 type of controls.

      Thanks in advance

      Comment


      • #4
        Yes, I forgot to -drop if missing(group_id_U)-. You can insert that right after the -rangejoin- command.

        As for some controls being used more than once, that is expected. The algorithm provides simple random sampling from among the elgibile controls, and the means that some controls will be used more than once. From a statistical perspective, there is nothing wrong with that. And, in fact, it is most likely to happen in cases where only one or a small number of controls is available to match a case. So, while there is a different algorithm that does not reuse cases, that aesthetic improvement (and it is only aesthetic) may come at the price of having no control at all for some cases, and therefore excluding them from the analysis altogether. If you want that, post back and code can be provided.

        Comment


        • #5
          Thanks a lot for your reply, I realize now that the syntax I was using was combining each of my types of controls, cancer (n=807) and non-cancer (n=2211) and using my cases multiple times to create 3018 observations.

          The following syntax creates 183 observations that only use my 183 cases once. However, it does not match the controls 1:1 and 1:1 as I would like, instead my 183 cases are matched to 50 cancer controls and 132 non-cancer controls.
          preserve
          keep if group_id== 1 | group_id==2
          tempfile controls
          save `controls'
          restore

          keep if group_id==0
          rangejoin age -1 1 using `controls', by(sex)
          drop if missing(group_id_U)
          set seed 1234
          gen double shuffle = runiform()
          by patientid group_id (shuffle), sort: keep if _n==2
          drop shuffle

          Am i correct in thinking _n specifies the number of matches? Is there any way to match them 1:1 and 1:1 within this syntax?

          Thanks in advance

          Comment


          • #6
            The _n in that code does not refer to the number of matches. See -help subscripting- for what it does mean.

            That said, there is an error in the code, which you inherited from me. (This is the peril of asking for code without providing example data: the code is untested, and as I warned, could be incorrect. It was incorrect.)

            Here I've generated some almost realistic demonstration data and then given corrected code: it works this time! The correction is shown in bold face.

            Code:
            //    CREATE DEMONSTRATION DATA
            clear*
            set obs 500
            gen byte group_id = 0 in 1/100 // CASES
            replace group_id = 1 in 101/300 // CONTROL GROUP 1
            replace group_id = 2 in 301/L    // CONTROL GROUP 2
            gen long patientid = _n
            set seed 1234
            label define sex    0    "Male"    1    "Female"
            gen sex:sex = runiform() < 0.5
            gen int age = round(rgamma(30, 2))
            tabstat age, by(group_id) statistics(mean sd min max)
            
            // DO THE MATCHING
            preserve
            keep if group_id== 1 | group_id==2
            tempfile controls 
            save `controls'
            restore
            
            keep if group_id==0
            rangejoin age -1 1 using `controls', by(sex)
            drop if missing(group_id_U)
            gen double shuffle = runiform()
            by patientid group_id_U (shuffle), sort: keep if _n==1
            drop shuffle
            This will almost give you 1:1:1 matching. Where it falls short of that is, in this case, 4 observations for which there is no eligible match in one of the control groups. It is likely that your real data will have unmatchable cases as well, in fact, probably more of them. It may even has some cases for which there is no eligible match in either control group. Matching on age within 1 year is a very stringent matching criterion, especially if there are any very old or very young people in the data. If you are left with too few matches, I would recommend relaxing that to 2 or 5 years.



            Comment


            • #7
              I applied the corrected code and it gave me 1:1:1 matching, thank you very much for your help with this!

              Comment


              • #8
                Hello, i had a question about rangejoin- is there a way to do it without replacement? so that each control is used only once?

                Comment


                • #9
                  The use of -rangejoin- has nothing to do with whether you end up getting matches with or without replacement. The command
                  Code:
                  by patientid group_id_U (shuffle), sort: keep if _n==1
                  is what produces the matches with replacement. For matches without replacement, you have to replace that command with a loop over the observations that sequentially removes all repetitions of an already used matched control.

                  There is no statistical advantage to using matching without replacement. In fact, most of the standard statistical results based on simple random sampling assume sampling with replacement. When you do matching without replacement you degrade the quality of your analysis in two ways. First, some cases will fail to find a matched control, because the only potential matches get taken by some other case(s). Thus the sample size goes down. Worse, the elimination of cases that cannot find a match may leave you with a biased sample--the hard to match cases are usually ones with extreme values on some of the variables. In addition, the quality of the matching itself degrades because the best match for one case may have already been taken by another case, so it gets left either with no match at all (as already discussed) or with a match of poorer quality. Matching without replacement really has nothing going for it. I don't recommend it.

                  Comment


                  • #10
                    Oh no! What you said makes sense. however my superior said they wanted without replacement as otherwise those few repeating controls would be overrepresented. i used calipmatch- and then for unmatched i had to loosen the matching criteria.
                    what i had done initially with rangejoin was: post rangejoining on age sex and race, i calculated the age differences between the cases and the matches, i then sorted the age differences by the cases and kept the first two.

                    Comment


                    • #11
                      A followup on the statistical as opposed to data management issues here: I'm curious here about the preferred analysis for with-replacement matching: Let's say we need the odds-ratio as the estimate of effect for a matched case-control study, and we were thinking to use conditional logit. What should be done to account for the re-use of some controls? I could see ignoring it on the idea that the number of re-used controls is small enough to likely make only a trivial difference in any SE estimates, but I'd presume there's a more principled approach or argument. I didn't easily find any info. on recommended methods, so I'd be interested to hear comments on this.

                      Comment


                      • #12
                        In a multi-level model, you can use the matched-pair (or matched-tuple) itself as a level and use cluster robust standard errors. You could add a level for control-id to the model in a multiple-membership relationship to the matched pair (tuple).
                        Last edited by Clyde Schechter; 08 Sep 2022, 13:18.

                        Comment


                        • #13
                          in response to both #10 and #11, you use weights to control for the re-use of controls; the basic issue, conceptually, is that each set of case-control should have weights that sum to 2 - 1 for the case and 1 for the control; some of the user-written programs (e.g., ultimatch) set up the weights for you as part of the matching process

                          Comment

                          Working...
                          X