Match based on two variables

Nienke Krijnen

Join Date: Jul 2024

Posts: 3
#1

Match based on two variables

24 Jul 2024, 08:53

Hello,
I try the following, but I don't manage to find the correct code unfortunately.
I have a database with patient ID's, treatment (1 or 3), gender (0 or 1) and age. I want to match all patients with treatment = 3 (72 in total) with patients with treatment = 1 (584 in total) in a 1:3 ratio (so I would like to get 216 unique ID's). The matching should be based on (closest) age and (exact) gender. Does somebody know the code for this?
I tried making two datasets and using joinby. However, I only got 188 ID's instead of 216.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#2

24 Jul 2024, 09:11

Code:

// CREATE DEMONSTRATION DATA SET clear* set obs 584 gen `c(obs_t)' id = _n gen treatment = cond(_n <= 72, 3, 1) set seed 1234 gen sex = runiformint(0, 1) gen age = rnormal(40, 10) // SEPARATE INTO TWO DATA SETS preserve keep if treatment == 1 rename (id age) =_1 drop treatment tempfile treatment1 save `treatment1' restore keep if treatment == 3 rename (id age) =_3 drop treatment // COMBINE POTENTIAL MATCHES & SELECT BEST 3, BREAKING TIES AT RANDOM joinby sex using `treatment1' gen double shuffle = runiform() gen delta = abs(age_3 - age_1) by id_3 (delta shuffle), sort: keep if _n <= 3

This produces 216 observations consisting of 72 matched triplets.

In the future, when asking for help with code, it is best to show the actual code you used. Saying you "tried making two datasets and using joinby" isn't really sufficient: evidently that approach can work, but you must have done something wrong along the way. Without seeing what you did, nobody can tell what that was. Also, you should show example data from your data set using the -dataex- command. In this particular case, it was easy enough to mock up a data set that matches your description. But it is better to have an example from yours, because there may be ways in which my mock-up differs from yours that will break the code I show.

If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
Nienke Krijnen

Join Date: Jul 2024

Posts: 3
#3

24 Jul 2024, 10:00

Thank you Clyde Schechter for you fast and helpful response!
However, using your code, I get duplicate treatment 1 id's (id_1), while I want to match the id_3 with 216 unique id_1.
Do you know how I could solve this?

. duplicates report id_1

Duplicates in terms of id_1

--------------------------------------
copies | observations surplus
----------+---------------------------
1 | 135 0
2 | 62 31
3 | 15 10
4 | 4 3
--------------------------------------

Do you also want me to send the dataex? Because it's quite extensive.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#4

24 Jul 2024, 12:02

So, what I gave you is called matching with replacement. You are asking for matching without replacement. Many people prefer matching without replacement for aesthetic reasons: it seems nicer not to reuse the same control for different cases. These concerns are aesthetic only; it has no statistical advantages whatsoever. And it has a few disadvantages. Sometimes you just run out of suitable matches. That could happen, for example, if the sex distribution of the two groups is markedly different: there just might not be enough different people of one sex to come up with three different matches for each case. (By case, here, I mean people with treatment = 3.) Of course, sex is usually distributed about 50/50 and is rarely the obstacle. In most matching schemes, when one matches on age, one usually sets a "caliper," a maximum difference in age that is acceptable. It is not at all uncommon to run out of acceptable age matches. Now, you didn't set a caliper: you just said closest age matches. So we won't have that problem. But you still might be unhappy with the results because some of the matches might have large age differences that defeat the purpose of matching on age. (I worry about that in a situation like yours because in clinical practice, treatment selection is often heavily influenced by age.)

So, with the understanding that there is no good statistical reason to do this and it may produce poor results, here is how you can do matching without replacement.

Code:

// CREATE DEMONSTRATION DATA SET clear* set obs 584 gen `c(obs_t)' id = _n gen treatment = cond(_n <= 72, 3, 1) set seed 1234 gen sex = runiformint(0, 1) gen age = rnormal(40, 10) // SEPARATE INTO TWO DATA SETS preserve keep if treatment == 1 rename (id age) =_1 drop treatment tempfile treatment1 save `treatment1' restore keep if treatment == 3 rename (id age) =_3 drop treatment // COMBINE POTENTIAL MATCHES & SELECT BEST 3, BREAKING TIES AT RANDOM joinby sex using `treatment1' gen double shuffle = runiform() gen delta = abs(age_3 - age_1) local allocation_ratio 3 local current 1 sort id_3 (delta shuffle) while `current' < _N { local end_current = `current' + `allocation_ratio' - 1 // KEEP REQUIRED # OF MATCHES FOR THE CURRENT CASE drop if id_3 == id_3[`current'] in `=`end_current'+1'/L // REMOVE THE SELECTED MATCHES FROM FURTHER CONSIDERATION forvalues i = 0/`=`allocation_ratio'-1' { drop if id_1 == id_1[`current'+`i'] & _n > `end_current' } local current = `end_current' + 1 }

If, as in my demonstration data, the sex-specific age distributions of the two treatment groups are nearly the same, then you should be able to get reasonable matches this way. Try it and see. But if they aren't, you may find that some id_3's end up with matches having an unreasonably large age difference. If that happens, that is the inescapable price you pay for matching without replacement.
Comment
Nienke Krijnen

Join Date: Jul 2024

Posts: 3
#5

24 Jul 2024, 15:12

Thank you so much. This was exactly what I was looking for.
If I summarize the delta, the max age difference is 3.62 years and the mean is 0.36, which is very reasonable for matching purposes in our study.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#6

24 Jul 2024, 20:30

O.P. achieved success with the code in #4. But I want to emphasize that that code was simplified a bit. The price of that is that it would not work correctly with a match that is not guaranteed to produce the desired number of matches for every case. Here is the complete code that would be used when there is the possibility that not every case can achieve the full number of (or possibly even any) matches:

Code:

// CREATE DEMONSTRATION DATA SET clear* set obs 584 gen `c(obs_t)' id = _n gen treatment = cond(_n <= 72, 3, 1) set seed 1234 gen sex = runiformint(0, 1) gen age = rnormal(40, 10) // SEPARATE INTO TWO DATA SETS preserve keep if treatment == 1 rename (id age) =_1 drop treatment tempfile treatment1 save `treatment1' restore keep if treatment == 3 rename (id age) =_3 drop treatment // COMBINE POTENTIAL MATCHES & SELECT BEST 3, BREAKING TIES AT RANDOM joinby sex using `treatment1' gen double shuffle = runiform() gen delta = abs(age_3 - age_1) // ENFORCE AGE CALIPER OF 0.5 YR local caliper 0.5 drop if delta > `caliper' local allocation_ratio 3 local current 1 sort id_3 (delta shuffle) while `current' < _N { local end_current = `current' + `allocation_ratio' - 1 while id_3[`end_current'] != id_3[`current'] { local end_current = `end_current' - 1 } // KEEP REQUIRED # OF MATCHES FOR THE CURRENT CASE drop if id_3 == id_3[`current'] in `=`end_current'+1'/L // REMOVE THE SELECTED MATCHES FROM FURTHER CONSIDERATION forvalues i = 0/`=`allocation_ratio'-1' { drop if id_1 == id_1[`current'+`i'] & _n > `end_current' } local current = `end_current' + 1 }

Here, we imposed a restriction that the age match must always be within 0.5 years, then it is not possible to match every case to three non-cases in this data. In fact only 67 of the original 72 cases find any match at all. This code leaves in memory each case matched with up to 3 non-cases, as many as can be found after we remove already matched non-cases from being considered again. The previous code would actually produce an unacceptable result here: it would actually fail to remove some already matched non-cases, so that the match would partially be with replacement.

And again, let me emphasize the statistical point I made earlier. There is no statistical advantage at all to matching without replacement. Moreover matching without replacement may lead to some cases going unmatched or receiving fewer than the desired number of matches.
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4457
#7

25 Jul 2024, 06:19

while I generally agree with Clyde Schechter , there is one case where matching w/replacement can be a problem - when bootstrapping the results; as shown in Abadie, A and Imbens, GW (2008), "On the failure of the bootstrap for matching estimators," Econometrica, 76(6): 1537-1557, the combination is not consistent (where "consistent" is defined to mean that as N gets larger and larger, the estimated values gets closer and closer (but is not required to reach) the population value)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#8

25 Jul 2024, 08:21

Rich Goldstein Thanks for pointing that out.
Comment

Announcement

Match based on two variables

Comment

Comment

Comment

Comment

Comment

Comment

Comment