Question on matching in a nested case control study

Rebecca Brotman

Join Date: Feb 2015

Posts: 5
#1

Question on matching in a nested case control study

03 Feb 2015, 12:11

I am conducting a nested case control study in which 3 controls are matched to cases on visit number and ethnicity. The sttocc command selects controls randomly from those members of the cohort who are at risk at the failure time of the case with replacement. So the same control could be selected multiple times at each visit. However, we would like to select controls without replacement.

I have been searching the boards trying to find information about running "sttocc" without replacement. I tried the above code (posted by Mike) but it did not work for me. I don't see a cluster_set_id variable?

Breslow text describes a nested case control study in which controls are sampled at each point in time without replacement, but that they could be sampled again later as a control, or become a case.

I thought I could use the ccmatch command at each visit but unfortunately, it doesn't work for greater than 1:1 matching. We are trying to do 3:1 matching.

Can anybody point me to a post or a reference on how to code this incidence density sampling design in Stata or how to do multiple case control matching? This seems like a common issue and I shouldn't be having such problems with this?

Thanks much,
Rebecca
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2421
#2

04 Feb 2015, 06:52

Some glitch happened here so that "the above code" you reference in your post does not appear. I haven't used -sttocc-, but I may have some relevant code around somewhere for something like your problem. One clumsy but perhaps workable approach that occurs to me would be to start by constructing a file with *many* more controls matched to each case (say 15:1), and then going back through the file and deleting the ones that have been picked from the ones later in time that have already been chosen.

So, I'd have a couple of suggestions: First, post the code you mentioned. Second, show a small mockup of your data to clarify your situation. Third, indicate something about how your "point in time" variable is measured (as an actual date variable?). With enough information, I suspect your situation might not be bad to solve from "first principles" if -sttocc- does not work.

Regards, Mike (probably but not certainly a different one than you reference in your post <grin>)
Comment
Rebecca Brotman

Join Date: Feb 2015

Posts: 5
#3

04 Feb 2015, 09:03

Hi Mike:

You were in fact the person that helped on the previous case! The administrator for the statalist suggested that instead of replying on an old post, that I start a new post. So he moved my post to a new entry. Here is the prior post: http://www.statalist.org/forums/foru...ttocc-function

I’m conducting a nested case control study with incidence density sampling. The cohort was a 12 month study with 5 visits and controls will be matched at the same visit number as cases.

Before I get into the coding issue, would you mind if I picked your brain about the issue of control sampling with or without replacement (in a study of incidence density sampling)?

Norman Breslow has a chapter in which he describes "A nested case-control study is then possible in which controls are selected by finite population random sampling, without replacement, from non-cases in the risk set." My understanding then is controls are sampled at each point in time without replacement, but that they could be sampled again later as a control, or become a case.
• http://depts.washington.edu/epidem/E...Breslow%29.pdf (See page 292.)

In contrast, Clayton and Hills (1993) and Wacholder (1992) suggest sampling controls with replacement. They say restricting subjects to serve as controls for only one case may lead to biased estimates because controls would not be sampled independently from the risk sets if they were restricted to only a single match. Stata has a command optimized for sampling with replacement ("sttocc").
• Clayton, D & M Hills (1993): Statistical Models in Epidemiology. New York: Oxford Univeristy Press, Chapter 33.
• Wacholder S, DT Silverman, JK McLaughlin, & JS Mandel (1992): Selection of Controls in Case-Control Studies, III. Design Options. American Journal of Epidemiology, 135(9): 1042-1050.

I suspect that we could pick controls with replacement, but we are certainly not going to assay the identical sample twice, so perhaps doing it without replacement gives us the most power, because to the degree that a woman’s identical sample is serving as a control for 2+ cases, there is loss of information.

Attached is a mockup of the data. Since posting yesterday, I think I found a solution to my original question using "radmatch" at each study visit. But I'm still confused by this issue of with or without replacement.

Thanks for your advice.

Best,
Rebecca

Attached Files

data nested case control.txt (1.2 KB, 1 view)
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10254
#4

04 Feb 2015, 12:32

Ordinarily, it should not be a problem if the same control is matched with different cases as long as the control fits into both risk sets. The goal is to get the best control match for a particular case. Therefore, it does not appear that the authors of -sttocc- programmed a command to select controls without replacement.

If you were to insist on selecting controls without replacement, then you have to take into account that the trade off is a large loss in the number of matched cases, especially if you require a large number of controls in a given case-control set. Recall that you can just have so many controls that match a particular case, and if two cases are similar and you exclude controls assigned to the first case when selecting controls for the second case, then there may not be enough controls for the second case, and thus the case is dropped.

This, in my opinion, is not very dissimilar to selecting controls with replacement and then identifying and keeping only unique case-control sets since STATA already does a good job in minimizing repeated controls. I use Example 1, p. 526 in the STATA Manual to illustrate:

http://www.stata.com/manuals13/st.pdf

Code:

webuse diet stset dox, failure(fail) enter(time doe) id(id) origin(time dob) scale(365.25)

Note that we have 337 subjects in the sample and 80 failures in single failure-per-subject data, Therefore, 80 defines the number of case-control sets that can be generated. You can also see this by typing

. sum id if fail>0

Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
id | 80 165.775 95.15663 6 330

We now want to convert our survival-time data to case-control data, matching cases and controls on both time and job and using 3 controls for each case

Code:

sttocc, match(job) n(3) gen ageentry=(doe-dob)/365.25 gen ageexit=(dox-dob)/365.25 list _set id _case _time ageentry ageexit job, sepby(_set)

.
The last command allows us to examine whether the controls were correctly selected and also to note that some controls appear in more than one case-control set, e'g., in row 97 and row 101.

. list _set id _case _time ageentry ageexit job, sepby(_set)

+------------------------------------------------------------+
| _set id _case _time ageentry ageexit job |
|------------------------------------------------------------|
1. | 1 74 0 42.57358 37.09788 53.39083 0 |
2. | 1 66 0 42.57358 40.09309 56.9692 0 |
3. | 1 72 0 42.57358 30.33265 46.37372 0 |
4. | 1 90 1 42.57358 31.4141 42.57358 0 |

|------------------------------------------------------------|
[OUTPUT OMITTED]
|------------------------------------------------------------|
97. | 25 333 0 55.381246 46.37645 61.8371 2 |
98. | 25 308 0 55.381246 49.53046 65.68378 2 |
99. | 25 217 0 55.381246 46.57084 65.94935 2 |
100. | 25 311 1 55.381246 51.86311 55.38124 2 |
|------------------------------------------------------------|
101. | 26 333 0 55.800137 46.37645 61.8371 2 |
102. | 26 314 0 55.800137 41.26215 56.80767 2 |
103. | 26 334 0 55.800137 47.32923 62.70773 2 |
104. | 26 329 1 55.800137 42.14648 55.80014 2 |
|------------------------------------------------------------|
105. | 27 7 0 56.462697 45.13895 63.18138 0 |

To identify unique case-control sets, we proceed as if we are dealing with duplicate observations.

Code:

quietly bysort id: gen dup = cond(_N==1,0,_n)

The variable dup lists the number of times id is repeated

. tab dup

dup | Freq. Percent Cum.
------------+-----------------------------------
0 | 156 48.75 48.75
1 | 71 22.19 70.94
2 | 71 22.19 93.13
3 | 17 5.31 98.44
4 | 5 1.56 100.00
------------+-----------------------------------
Total | 320 100.00

In our data set, we see that out of 320 observations, 164 are duplicates and 156 are unique. Recall that the same control can be a case, so when selecting unique case-control sets, we have to be careful not to delete any of these cases which appear at most once per id.

Code:

replace dup=0 if _case==1

. tab dup

dup | Freq. Percent Cum.
------------+-----------------------------------
0 | 180 56.25 56.25
1 | 57 17.81 74.06
2 | 61 19.06 93.13
3 | 17 5.31 98.44
4 | 5 1.56 100.00
------------+-----------------------------------
Total | 320 100.00

Here, I use the easy way out and delete duplicate observations where dup>1, but you can find a random way of selecting which of the duplicates to delete i.e. the 1st (for 57 observations), 2nd (for 61 observations), 3rd (for 17 observations) or 4th (for 5 observations).

Code:

drop if dup>1

Finally, identify and keep only unique case-control sets

Code:

gen ones=1 bysort _set: egen count= total(ones) drop if count<4 list _set id _case _time ageentry ageexit job, sepby(_set)

. list _set id _case _time ageentry ageexit job, sepby(_set)

+------------------------------------------------------------+
| _set id _case _time ageentry ageexit job |
|------------------------------------------------------------|
1. | 1 101 0 42.57358 38.29706 49.1718 0 |
2. | 1 90 1 42.57358 31.4141 42.57358 0 |
3. | 1 100 0 42.57358 36.15332 46.86653 0 |
4. | 1 94 0 42.57358 32.12594 47.91513 0 |
|------------------------------------------------------------|
5. | 2 319 0 47.8987 43.52361 59.06913 2 |
6. | 2 327 0 47.8987 43.07461 58.53525 2 |
7. | 2 196 1 47.8987 45.46475 47.8987 2 |
8. | 2 332 0 47.8987 42.24778 57.70842 2 |
|------------------------------------------------------------|
[OUTPUT OMITTED]

|------------------------------------------------------------|
85. | 74 280 1 67.964408 52.75838 67.96441 2 |
86. | 74 211 0 67.964408 53.48939 69.99863 2 |
87. | 74 273 0 67.964408 55.21424 69.99863 2 |
88. | 74 208 0 67.964408 52.16975 69.99863 2 |
|------------------------------------------------------------|
89. | 77 216 1 68.221766 50.62834 68.22176 2 |
90. | 77 255 0 68.221766 50.51609 69.4757 2 |
91. | 77 233 0 68.221766 54.05886 69.99863 2 |
92. | 77 264 0 68.221766 52.7666 69.99863 2 |
+------------------------------------------------------------+

Now you have unique case control sets. Initially you had 80 matched cases each with 3 controls, but out of these, only 23 are unique! In just the same way the number of unique matches declines, you also lose a lot of cases if you choose to select controls without replacement. On the other hand, if you can do with having a varying number of controls per case, e.g., ranging from a maximum of 4 to a minimum of 1, then you stop the process just after dropping the duplicate cases, Each of the controls will still be unique, but they would not be even across case-control sets.

Last edited by Andrew Musau; 04 Feb 2015, 12:44.
Comment
Rebecca Brotman

Join Date: Feb 2015

Posts: 5
#5

04 Feb 2015, 13:19

Thanks for your response Andrew. I was hoping to find a way to retain the 3:1 matching. We have plenty of control visits. There seems to be a lot of controversy about whether to do control selection with replacement at the exact same failure time.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2421
#6

04 Feb 2015, 13:47

Presuming that the matching is just as good under the with or without replacement scenarios, I presume the without replacement would give lower standard errors, since you get a larger effective N, so I think your idea is on target. However, I have looked at similar issues in propensity scoring matched analyses, and there the typical argument is that the validity benefits of better quality matching trumps the precision advantages of matching without replacement.

Anyway, I can see the problem you have is considerably harder than it was when I tried to answer before, since incidence-density sampling makes my earlier approach not feasible. Again, how (finely) is time measured in your situation? That might affect the possible programming solutions. And, how many persons are in the complete cohort, and how many eventually become cases? That would affect whether a brute-force approach might work.

Regards, Mike
Comment
Rebecca Brotman

Join Date: Feb 2015

Posts: 5
#7

04 Feb 2015, 13:57

Thanks Mike. This is what I was thinking.

In our study, time is discrete, rather than continuous, due to the fixed nature of clinical study visits. IN that setting, at visit 2, say, there are 20 cases and 200 potential controls (some of whom might be cases in the future, or might be selected again in the future).

When we select controls for cases at time 2, we are trying to figure out if we put the selected controls back in the pool to be re-selected AT TIME 2, or should that point in time be without replacement.The textbooks are not clear on this from my read.

This all said, I know there are other very important issues in control selections and we are really getting down to the minor details here.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10254
#8

04 Feb 2015, 14:11

I think that you should look at what other researchers in your field favor. If the number of controls is large enough, I do not see an obvious reason why your results should differ depending on the selection method that you use. The argument by Clayton and Hills (1993) and Wacholder (1992) would be relevant if "substitute" controls are very different across a wide number of attributes in the sample. If the differences are not significant on the most part, then one should not worry about sampling independently.

Of course, you may want to run both methods to check this - this also enhances the robustness of your results!

Last edited by Andrew Musau; 04 Feb 2015, 14:58.
Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2421

05 Feb 2015, 15:04

Setting aside considerations of whether sampling with or without replacement is preferable, I have some code that I think does incidence density sampling without replacement. The overall strategy is to start with a file of both the cases and controls, and use it to make a file of all possible pairs of cases and controls. Then, only the pairs with a case that matches on the covariates are retained, and then the pairs that don't meet the risk set condition are dropped. At this point, using a loop over all sets of case-control pairs, a sort of greedy sampling is performed: The controls for the each case are picked, then any other pairs involving those controls are deleted. I'm not certain that what I have done is right, or that it is the most efficient approach, but I think it's close and fast enough.

Code:

// Matched case-control sampling using incidence density sampling, with no replacement.
//
// Create example files of cases and controls to work with.
// Example conditions
clear
set seed 33245
local matchvars = "x y z"
local maxtime = 50
local pcase = 0.05 // proportion of case events among all observations
local ControlsPerCase = 3
set obs 100000 // total number of persons, cases and potential controls
//
//
gen int id = _n
gen int evtime = ceil(runiform() * `maxtime')  // time of disease event
replace evtime = . if runiform() > `pcase' // Many persons never have the disease event
// Create variables beside event time on which cases and controls would be matched.
foreach v of local matchvars {
  gen `v'  = ceil(3*runiform())   // 3 value for each match variable
}
// End of preparing example data
//  *********************************************
//
// Within this file of cases and controls, everyone is a potential control to start with,
// so save everyone in this file as a source of controls.
compress // important to save memory
tempfile filecase filectl
rename id idctl
rename evtime evtimectl
// randomize the order of the controls
gen rand = runiform()
sort rand, stable
drop rand
save `filectl'      // file of controls
rename idctl idcase
rename evtimectl evtimecase
//
//
// Strip the current file down to just the cases.
drop if missing(evtimecase)
qui count
di r(N) " event cases in file"
//
// Pair up each case with each of the potential controls that match on the matching variables.
// We will worry about time of event, risk set, etc. later.
joinby `matchvars' using `filectl'
//
//
// Drop impossible pairs
drop if (idcase == idctl)          // self pairs
drop if (evtimecase >= evtimectl)  // control member is not in risk set
//
// A few details before we start incidence sampling
gen rand = runiform()
sort idcase rand, stable // randomize the order of the cc pairs
drop rand evtime* x y z // don't need these anymore
by idcase: gen byte first = (_n ==1)  // just to count cases
qui count if first ==1
di r(N) " event cases that have a potential match after considering event time"
//
//
// Keep the desired number of controls for each case.  For each case,
// remove her/his controls from all other case-control pairs so
// as to give sampling w/o replacement.
// I use a loop here over all case-control pairs , generally not Stata-ish,
// but it seems like a good approach here.
by idcase: gen int seqnum = _n   // sequence number of the c-c pair for each case case
qui levelsof idcase, local(caselist)  // a list of all the cases
gen byte casedone = 0 // to mark each case as we process it.
foreach c of local caselist {
    // Keep desired number of c-c pairs for this case
    qui drop if (idcase == `c') & (seqnum > `ControlsPerCase')
    qui replace casedone = 1 if (idcase == `c')
    //
    // Make a list of the controls just used. I used a clumsy approach with preserve/restore.
    preserve
       qui keep if (idcase == `c')   // current case/control pairs
       local used ""  // will hold the list of controls just used
       forval i = 1/`ControlsPerCase' {
         local used = "`used' " + string(idctl[`i'])
       }   
    restore
    //
    //  Drop all remaining unexamined pairs that involve the controls just used
    local used = subinstr(ltrim("`used'"), " " , ",", .)
    qui drop if (casedone == 0 ) & inlist(idctl, `used')
}
// Report on number of cases and controls matched.
by idcase: gen NCtl = _N
tab NCtl if first, missing

Regards, Mike

Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2421

#10

05 Feb 2015, 15:12

Code:

// Matched case-control sampling using incidence density sampling, with no replacement.
//
// Create example files of cases and controls to work with.
// Example conditions
clear
set seed 33245
local matchvars = "x y z"
local maxtime = 50
local pcase = 0.05 // proportion of case events among all observations
local ControlsPerCase = 3
set obs 100000 // total number of persons, cases and potential controls
//
//
gen int id = _n
gen int evtime = ceil(runiform() * `maxtime')  // time of disease event
replace evtime = . if runiform() &gt; `pcase' // Many persons never have the disease event
// Create variables beside event time on which cases and controls would be matched.
foreach v of local matchvars {
  gen `v'  = ceil(3*runiform())   // 3 value for each match variable
}
// End of preparing example data
//  *********************************************
//
// Within this file of cases and controls, everyone is a potential control to start with,
// so save everyone in this file as a source of controls.
compress // important to save memory
tempfile filecase filectl
rename id idctl
rename evtime evtimectl
// randomize the order of the controls
gen rand = runiform()
sort rand, stable
drop rand
save `filectl'      // file of controls
rename idctl idcase
rename evtimectl evtimecase
//
//
// Strip the current file down to just the cases.
drop if missing(evtimecase)
qui count
di r(N) " event cases in file"
//
// Pair up each case with each of the potential controls that match on the matching variables.
// We will worry about time of event, risk set, etc. later.
joinby `matchvars' using `filectl'
//
//
// Drop impossible pairs
drop if (idcase == idctl)          // self pairs
drop if (evtimecase &gt;= evtimectl)  // control member is not in risk set
//
// A few details before we start incidence sampling
gen rand = runiform()
sort idcase rand, stable // randomize the order of the cc pairs
drop rand evtime* x y z // don't need these anymore
by idcase: gen byte first = (_n ==1)  // just to count cases
qui count if first ==1
di r(N) " event cases that have a potential match after considering event time"
//
//
// Keep the desired number of controls for each case.  For each case,
// remove her/his controls from all other case-control pairs so
// as to give sampling w/o replacement.
// I use a loop here over all case-control pairs , generally not Stata-ish,
// but it seems like a good approach here.
by idcase: gen int seqnum = _n   // sequence number of the c-c pair for each case case
qui levelsof idcase, local(caselist)  // a list of all the cases
gen byte casedone = 0 // to mark each case as we process it.
foreach c of local caselist {
    // Keep desired number of c-c pairs for this case
    qui drop if (idcase == `c') &amp; (seqnum &gt; `ControlsPerCase')
    qui replace casedone = 1 if (idcase == `c')
    //
    // Make a list of the controls just used. I used a clumsy approach with preserve/restore.
    preserve
       qui keep if (idcase == `c')   // current case/control pairs
       local used ""  // will hold the list of controls just used
       forval i = 1/`ControlsPerCase' {
         local used = "`used' " + string(idctl[`i'])
       }  
    restore
    //
    //  Drop all remaining unexamined pairs that involve the controls just used
    local used = subinstr(ltrim("`used'"), " " , ",", .)
    qui drop if (casedone == 0 ) &amp; inlist(idctl, `used')
}
// Report on number of cases and controls matched.
by idcase: gen NCtl = _N
tab NCtl if first, missing

Regards, Mike

Comment

Rebecca Brotman

Join Date: Feb 2015

Posts: 5
#11

06 Feb 2015, 06:38

Mike, thanks for your posts. Much appreciated.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2421
#12

06 Feb 2015, 08:04

You're welcome. Let me know if it does not work in some respect I did notice that the ">" character was not tolerated by the forum software, and so became "gt".

Regards, Mike
Comment
Natalie Malek

Join Date: Jul 2019

Posts: 18
#13

02 Jul 2019, 10:32

Hello,

Mike- is the "&" in your code above just supposed to be "&"?

Thank you
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2421
#14

02 Jul 2019, 11:13

Natalie--yes. The forum software sometimes takes over and does this sort of thing when it encounters logical operators.
1 like
Comment
Natalie Malek

Join Date: Jul 2019

Posts: 18
#15

02 Jul 2019, 11:47

Thank you, this code is very helpful!
Comment

Announcement