Matching cases and controls based on age and gender

Kate Ingarfield

Join Date: Nov 2018

Posts: 4
#46

05 Nov 2018, 03:00

Originally posted by Clyde Schechter View Post

So, something like this:

Code:

use my_data, clear // SEPARATE CASES FROM CONTROLS // AND DISTINGUISH VARIABLE NAMES preserve keep if group == 2 rename * *_control rename age_control age rename sex_control sex tempfile controls save `controls' restore keep if group == 1 rename * *_case rename age_case age rename sex_case sex // NOW JOIN ON AGE AND SEX joinby age sex using `controls' // RANDOMLY SELECT ONE MATCH IF THERE ARE MORE set seed 1234 // OR WHATEVER RANDOM NUMBER SEED YOU LIKE gen double shuffle = runiform() by case_id (shuffle), sort: keep if _n == 1 drop shuffle

The above will provide exact matches on age and sex. Now, in most real world situations, you won't be able to get enough matches with exact age. So typically people set some window, maybe 5 years, and require that the match be at least that close, if not exact. The code would be largely the same:

Code:

use my_data, clear // SEPARATE CASES FROM CONTROLS // AND DISTINGUISH VARIABLE NAMES preserve keep if group == 2 tempfile controls save `controls' restore keep if group == 1 // NOW JOIN ON AGE AND SEX // ALLOW WINDOW FROM 5 YEARS BELOW TO 5 YEARS ABOVE rangejoin age -5 5 using `controls', by(sex) // RANDOMLY SELECT ONE MATCH IF THERE ARE MORE set seed 1234 // OR WHATEVER RANDOM NUMBER SEED YOU LIKE gen double shuffle = runiform() by case_id (shuffle), sort: keep if _n == 1 drop shuffle

Evidently, if you want a narrower or wider window, you can just change the -5 and 5 in the -rangejoin- command to whatever you like.

Note that when using -rangejoin-, it is unnecessary to rename variables as -rangejoin- will do it for you automatically.

To run the second version, you need to have the -rangejoin- command installed. It was written by Robert Picard and is available from SSC. -ssc install rangejoin-

Hello Clyde,

I find this code very helpful and I wondered if you could advise a little further please! I have used this code to select one control per case, however I get duplicate controls (ie. one control matches to multiple cases) - is there any way to stop this?

Also, I would like to select 4 controls per case. How could I expand this code to do that please? Or would I need to use different code?

Thanks!

Kate
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#47

05 Nov 2018, 08:45

I have used this code to select one control per case, however I get duplicate controls (ie. one control matches to multiple cases) - is there any way to stop this?

Yes, there is a way to stop this. But why do you want to? There is no statistical reason to avoid re-using controls for different cases. In fact, by doing so, you increase the probability of some cases finding no match at all. So think about it. If you have a really compelling reason to do this, post back and I will show you the substantially more complicated code that is needed. If you choose to do this, please be sure also to post example data that I can customize the code for. Use -dataex- for that.

Also, I would like to select 4 controls per case. How could I expand this code to do that please? Or would I need to use different code?

Just change the penultimate line from

Code:

by case_id (shuffle), sort: keep if _n == 1

to

Code:

by case_id (shuffle), sort: keep if _n <= 4
Comment
gayathri WDDG Abeywickrama

Join Date: Dec 2018

Posts: 1
#48

13 Feb 2019, 07:10

Dear STATA experts,I am new to STATA. I am getting larger beta coefficient (Such as 93.6 66.3 for wealth quintiles) for my multiple linear regression analysis. My outcome variable is birth weight (continuous 400g-6500 g). Is that normal or do I need to use adjusted eman birth weights to avoid this problem..

Many Thanks
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#49

13 Feb 2019, 09:28

This post is completely unrelated to the topic of the thread. Please repost as a New Topic. Also, before doing that, please read the Forum FAQ for excellent advice about how to maximize your chance of getting a timely and helpful response. In particular, your post provides far too little information for anybody to give you a sensible answer. At a minimum you need to show the actual code you ran and the actual output you got from Stata. In addition, for your particular question, an example of your data would be helpful. Finally, for those who do not normally work in this domain, provide some explanation why you think that the coefficients you are getting are unreasonably large. (This is a multi-disciplinary forum; when posting you should never assume that others here are familiar with the subject matter of your research. The only common knowledge here is statistics, Stata, and whatever any college-educated person around the world could be assumed to know.)
Comment
ashraf abugroun

Join Date: Nov 2018

Posts: 37
#50

13 Feb 2019, 21:23

Very helpful threat. The code above works for 1:1 matching. How to tweak that code to work for 4:1 matching.
Thanks.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#51

13 Feb 2019, 22:01

See #47, where that very question was answered.
Comment
Vincent Samuel

Join Date: Jan 2017

Posts: 1
#52

08 Jul 2019, 04:45

I have similar but slightly different question. I want to match cases to controls. We have two IDs. The first one is participant ID and the second one is case ID which is used to match controls to a case. Each case has got 3 controls. How do I create the matched set ID in order to run a conditional logistic regression?
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2413
#53

08 Jul 2019, 09:11

You say you have a case ID that is used to match controls to a case. This sounds like a matched set ID to me, so it's hard to understand what the problem is, as this variable would simply be specified in the -group()- option of -clogit-. I presume I'm misunderstanding, so in order to clarify the problem, I'd suggest you post some sample data using -dataex-, as is described and recommended in the StataList FAQ.
Comment
Abby Chew

Join Date: Mar 2020

Posts: 13
#54

04 Mar 2020, 14:04

I am doing matched case-control as well and thank you for your code. However, my problem right now is that the data are not in a long format. All of my cases and controls are in the same row. How can I used clogit command? I want to see which variables are the main predictor of my case using clogit. Do I need to reconstruct my data again? Thank you in advance.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#55

04 Mar 2020, 14:38

You need to use the -reshape- command to convert your data from wide to long. Read -help reshape-. It is a somewhat difficult command for people to grasp at first. If you do not see how to apply it to your data, post back, using -dataex- to show example data.

If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment

Abby Chew

Join Date: Mar 2020
Posts: 13

#56

04 Mar 2020, 15:52

Since I matched by age+-1 and sex. Each of my case (_ca) has 2 controls (_c and _c_U), I guess. But actually I only need 1 case. Moreover, I need it as long format with match_ID to be working for clogit command. What is best way to generate that? Thank you in advance.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double(clinic_ca bmi_v1_ca calciumfromdairyservings_v1_ca clinic_c bmi_v1_c calciumfromdairyservings_v1_c clinic_c_U bmi_v1_c_U calcium_v1_c_U)
 770179 19.8       . 1786684 21.1 . 1786684 21.1 2622.44246575342
 935100 27.5       . 2640255 32.3 . 2894943 26.3 1040.66395547945
1227138 26.8       . 2873105 33.5 . 1839203 26.2 870.183133561644
1503561 23.5       . 2989521 33.6 . 3138082 29.3 994.717465753425
1598525 36.7  1.3589 2873105 33.5 . 1839203 26.2 870.183133561644
1620273 25.8 3.53973 2766647   35 . 2411038 28.4 1025.26746575342
1666901 32.3       . 7075806 28.8 . 3357238 21.7  1863.0676369863
1789203   27       . 1591829 40.8 . 3839462   24 1689.32808219178
1795922 38.4  .56986 3138082 29.3 . 1459900 41.1 762.363955479452
1802766 29.7       . 1591829 40.8 . 7075806 28.8         461.1875
end

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#57

04 Mar 2020, 16:14

This will set you up to use -clogit-:

Code:

gen long match_id = _n reshape long clinic bmi_v1 calciumfromdairyservings_v1 , i(match_id) j(cc) string gen byte case_status = cc == "_ca"

The variable case_status will be the outcome variable, and match_id will be the -group()- variable.
Comment
Nicoletta Riva

Join Date: Feb 2018

Posts: 29
#58

11 Jun 2020, 18:58

Hello!

I found this post while searching for tips on age/sex matching and I found it extremely useful!

However I was trying to match without replacement (to randomly select one match from both the cases and the controls). I used the code suggested in post #2 for randomly select one match from the cases, and then I tried to replicate the same code to select one match from the controls - as follows:

Code:

set seed 1234 gen double shuffle = runiform() by id_case (shuffle), sort: keep if _n == 1 drop shuffle set seed 1234 gen double shuffle = runiform() by id_control (shuffle), sort: keep if _n == 1 drop shuffle

Could this code be correct?

Thanks in advance

Nicoletta

Last edited by Nicoletta Riva; 11 Jun 2020, 19:01.
Comment
kjiyengar

Join Date: Apr 2014

Posts: 13
#59

15 Jun 2020, 10:00

Hello all,
I have 24000 observations where there are about 1000 cases and about 23000 controls. I want to match 1 case with 1 control based on industry (Ind) and similar size (20% up or down in terms of market capitalization). I have variables such as Ind, Clean (where 1=Case and 0=control), MarketCap (in millions of dollars), and a host of other variables which I want to compare between case and Control group. I have STATA 15 version. What would be the best way to do this? Any assistance is welcome!.
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17706
#60

20 Jun 2020, 09:43

Raj:
I'm not an expert with this kind of stuff, so take what follows as a temptative reply.
My gut-feeling is that you have to match one case with more than one controls (see Example 3 under -teffects psmatch- entry in Stata .pdf manual).

Kind regards,
Carlo
(Stata 19.0)
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment