Ranjejoin on STATA

Juliana Rosali

Join Date: Sep 2017

Posts: 4
#1

Ranjejoin on STATA

25 Sep 2017, 01:45

Hello everyone,
I have an enquiry regarding rangejoin. After I matched my cases to controls, I am still unable to get the pairID for the matched cases and controls. As such, I am unable to carry out any further regression analyses. Please help me on how I can get the pairID.

This is my STATA rangejoin command.
preserve
keep if faller_yn == 0
tempfile controls
save `controls'
restore
keep if faller_yn == 1
rangejoin ageofpatient -5 5 using `controls', by( gender )
set seed 1234
gen double shuffle = runiform()
by patientid (shuffle), sort: keep if _n <= 10
drop shuffle

What further commands do I need to get my pairID.

Any help is appreciated
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

25 Sep 2017, 08:16

Well, you don't have pairs. You have 10-tuples. And these 10-tuples are identified uniquely by patientid. (patientid is the id variable from the cases; the patientid from the control in each pairing will have been renamed patientid_U by -rangejion-). The 10-tuples, identified by variable patientid, represent the groups that you will need to specify in your analyses for purposes of cluster robust VCEs or as a level in a multi-level model.
Comment
Juliana Rosali

Join Date: Sep 2017

Posts: 4
#3

26 Sep 2017, 20:21

Thank you so much for your help!

I have some follow-up questions.
1. After matching cases to controls (on age and gender) using rangejoin, is there a way to automatically get the matches to proceed with regression directly rather than needing to manually transfer the matches to the data file?
2. Since rangejoin gives many-to-many match, is it correct to match the controls to the cases manually?
3. For case-control analysis, is it a must that one control must be matched to one case? Thank you!
Any help would be appreciated.

Last edited by Juliana Rosali; 26 Sep 2017, 20:33.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#4

26 Sep 2017, 22:33

1. After matching cases to controls (on age and gender) using rangejoin, is there a way to automatically get the matches to proceed with regression directly rather than needing to manually transfer the matches to the data file?

First of all, you should never MANUALLY do anything in data analysis. It's error prone. And, in addition, you must keep an audit trail of everything you do: a manual operation leaves no records behind. It's hard for me to write a general description of what you need to do to proceed from your -rangejoin- results to the regression. So post back with an example of your data as it looks post-rangejoin and I'll craft some code for you. (Be sure to use the -dataex- command to post your example. Run -ssc install dataex- to get the command and then run -help dataex- to read the simple instructions for using it. -dataex- is the only helpful way to show example data in this forum.)

Since rangejoin gives many-to-many match, is it correct to match the controls to the cases manually?

As above, never do anything manually. But in any case, I don't understand your question. Your cases are already matched to the controls. What is it you were thinking of doing?

For case-control analysis, is it a must that one control must be matched to one case?

No, there is no need for that.
Comment
Juliana Rosali

Join Date: Sep 2017

Posts: 4
#5

27 Sep 2017, 10:11

Dear Professor Schechter,

Thank you for the swift reply and your kind offer to help.

I have matched my cases to controls (by age and gender) using -rangejoin-.

After matching, I would like to perform conditional logistic regression to examine the association between my independent variable and dependent variable using the -clogit- command.

The -rangejoin- results (posted below) consist of PatientID, the matching variables (age and gender), dependent variable, independent variable of interest and covariates that I would like to adjust for.

The results give many-to-many match i.e. a case can be matched with multiple controls and vice versa

My results are shown below.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte(patientid dependant_variable age gender indepedant_variable covariate1 covariate2 patientid_u dependant_variable_u age_u indepedant_variable_u covariate1_u covariate2_u) 1 1 74 0 1 0 1 10 0 75 1 1 1 1 1 74 0 1 0 1 8 0 72 0 0 1 2 1 78 1 2 1 1 9 0 79 1 0 1 3 1 94 1 1 0 0 . . . . . . 4 1 93 0 2 1 1 6 0 92 1 0 0 4 1 93 0 2 1 1 7 0 92 0 1 0 5 1 82 1 2 0 0 9 0 79 1 0 1 end

ID1 (Case) is matched with ID8 and ID10 (Controls);

ID9 (Control) is matched with ID2 and ID5 (Cases)

What I have done previously was that I created a new column in my original dataset and manually transferred the matches.

This is because -clogit- command requires me to indicate the "IDs" that link the cases with their matched controls in one single column.
However, since the -rangejoin- results place cases and controls (as well as their respective (in)dependent variables and covariates in separate columns), I am not sure how to proceed to regression automatically.

Thank you very much in advance for your kind help.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30100

27 Sep 2017, 10:36

This is a -reshape- problem. It just requires a little fiddling with the data first. Also, after -reshape- there will be as many copies of each case observation as there are matched controls, but we only want to have one, so we have to remove the extras. Once those things are taken care of you are ready for -clogit-

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte(patientid dependant_variable age gender indepedant_variable covariate1 covariate2 patientid_u dependant_variable_u age_u indepedant_variable_u covariate1_u covariate2_u)
1 1 74 0 1 0 1 10 0 75 1 1 1
1 1 74 0 1 0 1  8 0 72 0 0 1
2 1 78 1 2 1 1  9 0 79 1 0 1
3 1 94 1 1 0 0  . .  . . . .
4 1 93 0 2 1 1  6 0 92 1 0 0
4 1 93 0 2 1 1  7 0 92 0 1 0
5 1 82 1 2 0 0  9 0 79 1 0 1
end

//    PREPARE DATA FOR RESHAPE
ds *_u, not
local vbles `r(varlist)'
rename (`vbles') =_case
rename *_u *_ctrl
gen long obs_num = _n

//    THE CASE PATIENT ID UNIQUELY IDENTIFIES TUPLES
clonevar group_id = patientid_case

//    GO TO LONG LAYOUT
reshape long  `vbles', i(obs_num) j(cc) string
drop obs_num

//    REMOVE DUPLICATE OBSERVATIONS FOR CASES
duplicates drop if cc == "_case"

//    READY FOR CLOGIT
clogit dependant_variable i.indepedant_variable covariate*, group(group_id) iterate(1)

Note: In your example data, -clogit- does not converge. I don't think you will have this problem with your real data as it arises from the small sample and some patterns in the example data. If you do have this problem, however, it is not attributable to this case-control setup.

Comment

Maria Carabello

Join Date: Dec 2018

Posts: 3
#7

01 Aug 2019, 22:19

Hello,

I am attempting to run a very similar analysis to what Juliana first mentioned on this thread. I was able to successfully reshape my data into long format, and now there is a _case and _ctrl version of every variable in my dataset. As such, I would not be able to run the -clogit- model as you specified here, as I now have two versions of my dependent variable (depvar_case + depvar_ctrl), and this is also true for the independent variable, covariates, etc. Did I do something incorrectly?

Thank you in advance, any help would be greatly appreciated.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#8

01 Aug 2019, 22:42

I was able to successfully reshape my data into long format, and now there is a _case and _ctrl version of every variable in my dataset.

That statement is self-contradictory. If there is a _case and _ctrl version of every variable then you have reshaped your data into wide format. So either you took the right data layout and goofed by -reshape-ing it to wide, or you didn't successfully get through the -reshape long- part of the code in #6.

Either way, the solution is to get the data into long format (where there are separate observations, not separate variables, for cases and controls.) This will prove helpful not just for running the -clogit- command, but for nearly everything. Data in wide layout is unwieldy to work with in Stata, except for a small number of commands.
Comment
jennifer Thornton

Join Date: Sep 2018

Posts: 18
#9

17 Nov 2022, 17:26

I am attempting a very similar analyses as well. I am attempting to use the rangejoin command to match on two different variables within a certain range. I want to match on age at diagnosis (within 3 years), presumed diagnosis (within 1 year) and sex (exact). Here is my code. I am getting an error message about an extra argument after key variable age_at_diag.

preserve
keep if case== 1
rename * *_CASE
rename age_at_diag_CASE age_at_diag
rename male_sex_CASE male_sex
rename presumed_mm_diag_date_CASE presumed_mm_diag_date
tempfile CASE
save "D:\shared\CASE.dta"

restore
keep if case == 0
rename * *_CONTROL
rename age_at_diag_CONTROL age_at_diag
rename male_sex_CONTROL male_sex
rename presumed_mm_diag_date_CONTROL presumed_mm_diag_date

rangejoin presumed_mm_diag_date -1 1 age_at_diag -3 3 using "D:\shared\CASE.dta", by(male_sex)

If anyone has any advice I would sincerely appreciate it. Thank you!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#10

17 Nov 2022, 17:34

-rangejoin- does not have the ability to range-match on two different variables in a single command. You have to pick one of them, and then use a -keep if- command to enforce the second restriction. For example:

Code:

rangejoin presumed_mm_diag_date -1 1 using "D:\shared\CASE.dta", by(male_sex) keep if abs(age_at_diag - age_at_diag_U) <= 3
Comment
George Ford

Join Date: Aug 2014

Posts: 3152
#11

18 Nov 2022, 11:31

try cem instead. You can specify bins which will get you close to the same result and then have the weights/match variables to use for regression.
Comment
jennifer Thornton

Join Date: Sep 2018

Posts: 18
#12

07 Dec 2022, 13:50

Thank you so much for the help! Using the "keep if" command worked perfectly! I really appreciate it.
Comment
jennifer Thornton

Join Date: Sep 2018

Posts: 18
#13

09 Dec 2022, 13:30

I have a follow-up question on the rangejoin command. I realized I made a mistake by trying to match my cases and controls using the command presumed_mm_diag_date -1 1, since that would only match on dates within 1 day from each other. What I really want to do is match on dates within 365 days (ie, 1 year) from one another.

However, when I try to amend my code as shown below, STATA goes into the "not responding" mode after a few minutes of showing the "hourglass" symbol. Do you have any suggestions please? Thank you.
rangejoin presumed_mm_diag_date -365 365 using "D:\shared\CASE.dta", by(male_sex)
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#14

09 Dec 2022, 14:13

There are a couple of possibilities here.

If your data set is very large, it may be that the number of cases of the same sex with presumed mm diagnosis date within 1 year of a given control is also very large. This may simply mean that you're going to have to be very patient while the process goes to completion. Let's say you start with 100,000 observations, half of them cases and half controls, and each of those groups half men and half women. If all of the diagnosis dates in the entire data set fell within a single range of 365 days (so the interval condition is always met), then each of the 50,000 controls would find 25,000 acceptable matches. The resulting data set that -rangejoin- is trying to create would have 1.25 billion observations! That's going to be very slow. Now, what I have presented here is a worst case scenario. But it might be that your actual situation is not all that far from this situation. You can do the calculation for your actual situation to see if you are leading -rangejoin- to create some gigantic data set that doesn't seem sensible for your needs. If so, the solution is usually to make the match stricter. The easy way to do this is to add more discrete variables to the -by()- option, assuming there are some that make sense to match on. Another way is to narrow the acceptable range of diagnosis dates. Agreed that 1 day is too short, but what about 3 months, or 6 months?

I would not be concerned about the "not responding" behavior. You can see this whenever Stata is engaged in an operation that is programmed to be uninterruptible and happens to be taking a long time. By itself, this means nothing except that you need to be patient.

Another possibility has nothing to do with -rangejoin- at all. If this "D:" drive is a network drive, it may be that Stata is simply being delayed by the long time required to access that drive, read from it, and write temporary things on it. If that is what's happening, possible solutions are to run your program at a time of day when network traffic is lighter, request higher access priority from the IT administrator of the network, or run this on a local drive instead of the network.
Comment
jennifer Thornton

Join Date: Sep 2018

Posts: 18
#15

12 Dec 2022, 10:11

Thank you very much for your help. All of this is extremely informative. You are correct that my dataset is very large, so I am working to add more discrete variables to match on. And I am glad to know that I don't need to worry about the "not responding" behavior of STATA; I will be patient. I am also looking into the possibility of running the program at a time of day when the D: drive network traffic is lighter (I am not permitted to move the data to a local drive). Thank you again!
Comment

Announcement