No announcement yet.
  • Filter
  • Time
  • Show
Clear All
new posts

  • Creating a fast loop for selecting controls meeting conditions in a case-control design

    Dear Stata users,

    I have a loop for selecting control observations without replacement for my case observations in a case control study. Unfortunately, this runs slowly because the dataset is quite large (around 5000 cases and 8 million controls to select from.) Thus, I’m looking to recreate the following loop in Mata with any tweaks that may improve the execution time as this has to be run for multiple case-control configurations. Currently, a single run is an over-night endeavor using Stata 18 on Windows Server 2012R2.

    Due to legal reasons, I cannot disclose the dataset. Instead, I will try my best to illustrate the loop with the auto dataset.

    I am aware of the excellent calipmatch package, which address 95% of my issue. However, one of the matching criteria is that the control observations must have a date value that is larger than that of the case observation, which to my understanding is not possible with calipmatch, because calipmatch only allows exact matches or caliper widths going above and below a certain score and not 'greater than' criterions. I tried to understand the calipmatch source code and modify it to my needs, but it is beyond my Mata skill level.

    A part of the reason why the current loop runs slow is that it identifies all possible matches and then finds the first X matches. In this case, it is two matches. In my actual data, there will be 30 matches and there will be two more exact matching criteria, i.e., having the same value on a categorical variable. In the following example, imagine that the variable 'price' would correspond to the date variable in the actual dataset. I have annotated it to illustrate my intention.

    sysuse auto
    generate byte case = strpos(make, "Buick") > 0       // Make a case group, in this case it is the Buick cars.
    count if case == 0                                   // 67 control observations to select from.
    gen float tmp = runiform()
    sort tmp                                             // Setting the order of the observation at random
    gen tmp2 = .                                         // Making temporary variable that will be rewritten throughout the loop.
    gen sto = .                                          // Making a storage variable that will assign matched observations with the same ID as each case.
    bysort sto case: replace sto = _n if case == 1       // Adding IDs to case numbers.
    forvalues i=1/67{
    levelsof price if sto == `i', local(A)               // Saving the value of price in a local.
    replace tmp2 = `i' if price > `A' & missing(sto)     // Identifying all observations with prices higher than the ith case.
    replace sto = `i' if price > `A' & missing(sto) & sum(tmp2 == `i') <= 2    // Saving the first two matches as controls.
    replace tmp2 = .                                     // Resets the temporary storage.
    Please let me know if I can help by elaborating anything further.

    Best regards Soeren

  • #2
    (A first note here is that I don't understand the approach to matching in your code -- sorry -- so I'm ignoring it.)

    Anyway, I would not assume that using Mata is the best solution to your problem, and I would not assume that a general purpose matching command like -calipmatch- (a very nice program) will offer a good approach in your situation. .Rather, I'd first look to some different algorithm. I think that your thread would best be placed in the Stata section of the forum.

    Also, if you haven't already, you should look at the numerous previous threads on StataList about selecting controls for case control studies. (Try e.g., case control select replacement in your preferred search engine.)

    Devising a good program for your situation would be easier with more realistic example data to work with. So, as a small step toward helping, here's a way to create example data for you or other people to work with, which presumes your categorical variables have "nice" distributions, which may not be true, but is reasonable for starters. The number of different values for your date variable may matter, but I've assumed about 5 years of different possible days.

    // Simulate data set.
    // Include this or similar code if you repost in the Stata section.
    local nctl = 8e6   // smaller values for testing would be good
    local ncase = 5000
    local years = 5
    set obs `=`nctl' + `ncase''
    gen long id = _n
    gen byte case = _n <=`ncase'
    label def cclabel 0 "control" 1 "case"
    label values case cclabel
    gen byte cat1 = runiformint(1,3)
    gen byte cat2 = runiformint(1,3)
    gen date  = runiformint(1,trunc(5 * `years'))
    Some approaches to a problem like this might involve -cross- or -joinby-, either of which could require excessive amounts of memory if you try to find matches for all the cases at once. You might want to consider an approach in which you randomly group the cases into sets of (say) 100, and find matches among all the available controls for that smaller set of cases, then do the next set of cases, etc.


    • #3
      Thank you for your recommendations, Mike.
      I apologize for any confusion. The approach to matching resembles that of calipmatch where cases and matched controls are assigned the same identification number in one variable.
      I agree that the topic may be fit for the general section of the Stata. However, the loop above can be implemented in Mata, which I would like to achieve. Thus, I will wait for suggestions on how to proceed in implementing this loop in Mata.