Case-control study matching 2 types of controls

Rene McCrae

Join Date: Oct 2018

Posts: 7
#1

Case-control study matching 2 types of controls

18 Oct 2018, 13:15

Hello,

I am conducting a case control study where I would like to 1:1 match cases to two types of controls (cancer and healthy)

I have successfully matched my cases to cancer controls by age (within 5 years) and gender using the following syntax:
preserve
keep if group_id == 1
tempfile controls
save `controls'
restore

keep if group_id==0
rangejoin age -5 5 using `controls', by(sex)
set seed 1234
gen double shuffle = runiform()
by patientid (shuffle), sort: keep if _n==1
drop shuffle

My questions now are:
1) How do I also match the second group of healthy controls (ie. group_id==2, which no longer appears within my dataset) to my cases? Can it be done within this syntax?
2) Now that the cases and cancer controls appear paired, how do I rearrange my data so that the variables are listed in single column including both cases and cancer controls? Given that the control data now appears beside the case data with _U variables, I don't think this can be done with the reshape command.

Thanks in advance

Last edited by Rene McCrae; 18 Oct 2018, 13:18.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#2

18 Oct 2018, 13:48

Well, you could do the two matches simultaneously:

Code:

preserve keep if group_id == 1 tempfile controls save `controls' restore keep if inlist(group_id, 0, 2) rangejoin age -5 5 using `controls', by(sex) set seed 1234 gen double shuffle = runiform() by patientid group_id (shuffle), sort: keep if _n==1 drop shuffle

Now each case will appear in up to two observations, one with a group_id 1 control and another with a group_id 2 control

The next step is to get this into fully long layout. You are correct that this is not a good task for -reshape-. Instead, you have to split out the case and control variables and then append them together.

Code:

clonevar case_id = patient_id // INDICATE WHICH CASE IN THE TRIPLET preserve clonevar sex_U = sex // CREATE _U VERSIONS OF SEX & CASE ID rename case_id case_id_U keep *_U rename *_U * tempfile controls save `controls' restore drop *_U append using `controls'

Note: No sample data was provided, so this is not tested. Beware of typos or substantive errors.

Last edited by Clyde Schechter; 18 Oct 2018, 13:51.
Comment

Rene McCrae

Join Date: Oct 2018
Posts: 7

21 Oct 2018, 19:25

Thanks for your help with this. When I use the code you suggested to match my cases to the two types of controls (syntax below, I adjusted slightly because I had labelled my cases group_id==0 and so on) I am seeing cases being used in more than 2 observations (one patientid was used 15 times for example) and there are now observations where the group_id_U==. being matched to my controls.

Syntax:
preserve
keep if group_id == 0
tempfile cases
save `cases'
restore

keep if inlist(group_id, 1, 2)
rangejoin age -1 1 using `cases', by(sex)
set seed 1234
gen double shuffle = runiform()
by patientid group_id (shuffle), sort: keep if _n==1
drop shuffle

Data example:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(group_id group_id_U) double age long sex double age_U
2 0 72.32 2 72.95
2 0 72.51 2 73.51
2 0    54 1 53.52
2 0 87.77 2 88.67
1 0 69.02 1    69
2 0 75.34 1 75.15
2 0  53.9 2 53.85
2 0 60.17 2 60.12
1 0 68.65 1 69.45
2 0  80.1 1 79.84
2 0 78.63 1 77.79
2 0 66.51 1 65.84
1 . 55.41 1     .
2 0 54.23 1 53.52
2 0 73.71 2 72.95
2 0 78.57 2    78
2 0  67.9 1  67.4
1 0 69.44 2 68.99
2 0 65.23 1 66.16
2 0 53.42 2 53.85
2 0 74.03 2 73.62
2 0 56.68 1 57.03
2 0 77.51 2    78
2 . 48.81 1     .
2 0 58.83 2 58.76
2 0 56.74 2 57.16
1 0    65 1 65.84
2 0 68.11 2  68.4
2 0 62.96 1 62.49
1 0 55.15 2 55.01
2 0 53.02 1 53.01
1 . 47.63 1     .
2 0 65.31 2 65.86
2 0 50.02 2    50
2 0 67.96 2 68.26
2 0 65.31 1 66.16
2 . 64.27 1     .
2 0 79.09 1 79.84
2 0 49.14 1 50.07
1 0 74.29 2 75.06
2 0 54.42 2 53.85
2 0  66.6 1 66.93
2 0 49.76 2    50
2 0 50.27 2    50
2 0 76.66 2 76.97
1 0 49.83 1 50.07
2 0 70.61 1 70.32
2 0 46.62 2  45.7
2 0 73.49 2 74.37
2 0 57.09 1 57.03
2 0 65.53 1 65.84
1 . 55.34 1     .
2 0 67.12 2 66.32
2 0 47.43 2 47.04
1 0 73.37 2 74.37
2 0 68.55 1 68.91
2 0  60.9 1 61.32
2 . 55.54 1     .
2 0 63.01 1 63.12
2 0 78.24 2 78.65
2 0 81.93 1 81.43
2 0 63.12 2 63.73
2 0 65.19 1 65.98
2 0 67.85 2 68.26
2 0 72.05 2 71.13
2 0 64.11 2  64.1
2 0  71.7 2 70.96
2 0 50.55 2    50
1 0 55.26 2 55.01
2 0 81.71 2 81.51
1 0 63.67 1 63.12
2 0 79.02 2 78.65
2 0 58.41 1 59.05
2 0 79.83 1 80.38
2 0 63.57 2 64.49
2 0 67.68 2  68.4
2 0 75.15 2  75.4
2 0 70.96 1 70.29
2 0 66.41 1 66.66
2 0 79.73 2 79.22
1 0 58.04 2 58.76
2 0 64.88 2  64.5
2 0 58.88 1 58.84
2 0 46.12 2  45.7
1 0 68.96 1 69.45
2 0 67.38 1 67.56
1 0 66.99 2 67.58
2 0 76.39 2 76.86
1 0 70.28 1 70.29
2 0 69.16 2 68.99
2 0 76.14 2 76.97
2 0 79.77 1 79.89
2 0 67.24 2 66.55
2 0 54.04 2 55.01
2 0 72.35 1 72.51
2 0 62.21 2  61.9
2 0 47.45 2 46.88
2 0 67.59 2 67.58
2 0 56.53 1 57.03
2 0 65.66 1 66.66
end
label values group_id group_id
label values group_id_U group_id
label def group_id 1 "control", modify
label def group_id 2 "non-cancer surgery", modify
label def group_id 0 "case", modify
label values sex sex
label def sex 1 "Female", modify
label def sex 2 "Male", modify

This is my first time using -dataex- so please let me know if there would be something more helpful than what I provided. Any idea why I am getting so many observations with my current approach? I only have 183 cases and using the syntax I had before worked out to 183 observations with cases: 1 type of controls.

Thanks in advance

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#4

22 Oct 2018, 10:23

Yes, I forgot to -drop if missing(group_id_U)-. You can insert that right after the -rangejoin- command.

As for some controls being used more than once, that is expected. The algorithm provides simple random sampling from among the elgibile controls, and the means that some controls will be used more than once. From a statistical perspective, there is nothing wrong with that. And, in fact, it is most likely to happen in cases where only one or a small number of controls is available to match a case. So, while there is a different algorithm that does not reuse cases, that aesthetic improvement (and it is only aesthetic) may come at the price of having no control at all for some cases, and therefore excluding them from the analysis altogether. If you want that, post back and code can be provided.
Comment
Rene McCrae

Join Date: Oct 2018

Posts: 7
#5

22 Oct 2018, 11:36

Thanks a lot for your reply, I realize now that the syntax I was using was combining each of my types of controls, cancer (n=807) and non-cancer (n=2211) and using my cases multiple times to create 3018 observations.

The following syntax creates 183 observations that only use my 183 cases once. However, it does not match the controls 1:1 and 1:1 as I would like, instead my 183 cases are matched to 50 cancer controls and 132 non-cancer controls.
preserve
keep if group_id== 1 | group_id==2
tempfile controls
save `controls'
restore

keep if group_id==0
rangejoin age -1 1 using `controls', by(sex)
drop if missing(group_id_U)
set seed 1234
gen double shuffle = runiform()
by patientid group_id (shuffle), sort: keep if _n==2
drop shuffle

Am i correct in thinking _n specifies the number of matches? Is there any way to match them 1:1 and 1:1 within this syntax?

Thanks in advance
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#6

22 Oct 2018, 12:12

The _n in that code does not refer to the number of matches. See -help subscripting- for what it does mean.

That said, there is an error in the code, which you inherited from me. (This is the peril of asking for code without providing example data: the code is untested, and as I warned, could be incorrect. It was incorrect.)

Here I've generated some almost realistic demonstration data and then given corrected code: it works this time! The correction is shown in bold face.

Code:

// CREATE DEMONSTRATION DATA clear* set obs 500 gen byte group_id = 0 in 1/100 // CASES replace group_id = 1 in 101/300 // CONTROL GROUP 1 replace group_id = 2 in 301/L // CONTROL GROUP 2 gen long patientid = _n set seed 1234 label define sex 0 "Male" 1 "Female" gen sex:sex = runiform() < 0.5 gen int age = round(rgamma(30, 2)) tabstat age, by(group_id) statistics(mean sd min max) // DO THE MATCHING preserve keep if group_id== 1 | group_id==2 tempfile controls save `controls' restore keep if group_id==0 rangejoin age -1 1 using `controls', by(sex) drop if missing(group_id_U) gen double shuffle = runiform() by patientid group_id_U (shuffle), sort: keep if _n==1 drop shuffle

This will almost give you 1:1:1 matching. Where it falls short of that is, in this case, 4 observations for which there is no eligible match in one of the control groups. It is likely that your real data will have unmatchable cases as well, in fact, probably more of them. It may even has some cases for which there is no eligible match in either control group. Matching on age within 1 year is a very stringent matching criterion, especially if there are any very old or very young people in the data. If you are left with too few matches, I would recommend relaxing that to 2 or 5 years.
Comment
Rene McCrae

Join Date: Oct 2018

Posts: 7
#7

22 Oct 2018, 20:58

I applied the corrected code and it gave me 1:1:1 matching, thank you very much for your help with this!
Comment
Sakshi Rajatbhai Tewari

Join Date: Apr 2022

Posts: 53
#8

07 Sep 2022, 09:43

Hello, i had a question about rangejoin- is there a way to do it without replacement? so that each control is used only once?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#9

07 Sep 2022, 15:15

The use of -rangejoin- has nothing to do with whether you end up getting matches with or without replacement. The command

Code:

by patientid group_id_U (shuffle), sort: keep if _n==1

is what produces the matches with replacement. For matches without replacement, you have to replace that command with a loop over the observations that sequentially removes all repetitions of an already used matched control.

There is no statistical advantage to using matching without replacement. In fact, most of the standard statistical results based on simple random sampling assume sampling with replacement. When you do matching without replacement you degrade the quality of your analysis in two ways. First, some cases will fail to find a matched control, because the only potential matches get taken by some other case(s). Thus the sample size goes down. Worse, the elimination of cases that cannot find a match may leave you with a biased sample--the hard to match cases are usually ones with extreme values on some of the variables. In addition, the quality of the matching itself degrades because the best match for one case may have already been taken by another case, so it gets left either with no match at all (as already discussed) or with a match of poorer quality. Matching without replacement really has nothing going for it. I don't recommend it.
Comment
Sakshi Rajatbhai Tewari

Join Date: Apr 2022

Posts: 53
#10

08 Sep 2022, 12:43

Oh no! What you said makes sense. however my superior said they wanted without replacement as otherwise those few repeating controls would be overrepresented. i used calipmatch- and then for unmatched i had to loosen the matching criteria.
what i had done initially with rangejoin was: post rangejoining on age sex and race, i calculated the age differences between the cases and the matches, i then sorted the age differences by the cases and kept the first two.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#11

08 Sep 2022, 12:58

A followup on the statistical as opposed to data management issues here: I'm curious here about the preferred analysis for with-replacement matching: Let's say we need the odds-ratio as the estimate of effect for a matched case-control study, and we were thinking to use conditional logit. What should be done to account for the re-use of some controls? I could see ignoring it on the idea that the number of re-used controls is small enough to likely make only a trivial difference in any SE estimates, but I'd presume there's a more principled approach or argument. I didn't easily find any info. on recommended methods, so I'd be interested to hear comments on this.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#12

08 Sep 2022, 13:08

In a multi-level model, you can use the matched-pair (or matched-tuple) itself as a level and use cluster robust standard errors. You could add a level for control-id to the model in a multiple-membership relationship to the matched pair (tuple).

Last edited by Clyde Schechter; 08 Sep 2022, 13:18.
1 like
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4462
#13

08 Sep 2022, 13:57

in response to both #10 and #11, you use weights to control for the re-use of controls; the basic issue, conceptually, is that each set of case-control should have weights that sum to 2 - 1 for the case and 1 for the control; some of the user-written programs (e.g., ultimatch) set up the weights for you as part of the matching process
1 like
Comment

Announcement