New on SSC: range join - a program to form pairwise combinations using a range

Robert Picard

Join Date: Mar 2014
Posts: 1536

New on SSC: range join - a program to form pairwise combinations using a range

30 Mar 2016, 15:25

Thanks for Kit Baum, a new program called rangejoin is now available from SSC. Stata 11 is required.

rangejoin forms pairwise combinations between observations in memory and observations in a using dataset when the value of a key variable in the using dataset is within the range specified by observations in the data in memory.

To install, type in Stata's command window:

Code:

ssc install rangejoin

Once installed, type

Code:

help rangejoin

to get more information.

rangejoin leverages the power of the recently announced rangestat (from SSC, with Roberto Ferrer and Nick Cox) to implement what amounts to a joinby over a range of values. The observations in memory define the low and high bounds of the interval to use and the values of a key variable in the using dataset determines which observations are to be paired.

Without rangejoin, you would have to follow these steps to achieve the same results:

rename variables in memory or in the using dataset to avoid name conflicts, and
use cross or joinby to form all pairwise combinations, and
use keep if inrange(keyvar, low, high) to reduce the data to pairwise combinations that are within the desired range.

With even moderately small datasets, the overhead required to form all pairwise combinations first and then pare down to observations in range is hugely inefficient. For larger datasets, this brute force approach may not be feasible at all.

rangejoin can be used with a stored copy of the data in memory to form the same pairwise combinations that rangestat makes to calculates its statistics. This could be useful if you need to impose an additional condition that can't be specified with rangestat.

You can also do things that you could not do with rangestat, that is you can form pairwise combinations over completely different datasets. Suppose that you have data on medical events:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(patient event_id event_type) str10 event_date float eventday
1 1 345 "01/02/2015" 20090
1 2 543 "01/02/2015" 20090
1 3 676 "01/20/2015" 20108
2 1 567 "01/10/2015" 20098
2 2 999 "01/20/2015" 20108
end
format %td eventday
save "events.dta", replace

and you have in memory data on prescription fills:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(patient Rx_id) str10 fill_date float(supply drug_id fillday)
1 1 "01/01/2015" 30 88 20089
1 2 "01/01/2015" 90 99 20089
1 3 "01/15/2015" 15 77 20103
1 4 "01/17/2015" 21 33 20105
2 1 "01/01/2015" 15 22 20089
2 2 "01/07/2015" 30 44 20095
end
format %td fillday

You can match each prescription with medical events that occurred during the fill period:

Code:

* the day of the last pill
gen lastday = fillday + supply - 1
format %td lastday

rangejoin eventday fillday lastday using "events.dta", by(patient)
list patient Rx_id drug_id fillday lastday event_date event_type , sepby(patient Rx_id)

and the results:

Code:

. list patient Rx_id drug_id fillday lastday event_date event_type , sepby(patient Rx_id)

     +---------------------------------------------------------------------------+
     | patient   Rx_id   drug_id     fillday     lastday   event_date   event~pe |
     |---------------------------------------------------------------------------|
  1. |       1       1        88   01jan2015   30jan2015   01/02/2015        345 |
  2. |       1       1        88   01jan2015   30jan2015   01/02/2015        543 |
  3. |       1       1        88   01jan2015   30jan2015   01/20/2015        676 |
     |---------------------------------------------------------------------------|
  4. |       1       2        99   01jan2015   31mar2015   01/02/2015        345 |
  5. |       1       2        99   01jan2015   31mar2015   01/02/2015        543 |
  6. |       1       2        99   01jan2015   31mar2015   01/20/2015        676 |
     |---------------------------------------------------------------------------|
  7. |       1       3        77   15jan2015   29jan2015   01/20/2015        676 |
     |---------------------------------------------------------------------------|
  8. |       1       4        33   17jan2015   06feb2015   01/20/2015        676 |
     |---------------------------------------------------------------------------|
  9. |       2       1        22   01jan2015   15jan2015   01/10/2015        567 |
     |---------------------------------------------------------------------------|
 10. |       2       2        44   07jan2015   05feb2015   01/10/2015        567 |
 11. |       2       2        44   07jan2015   05feb2015   01/20/2015        999 |
     +---------------------------------------------------------------------------+

You can reorder the observations and show, for each medical event, which prescriptions were being taken at the time of the event:

Code:

. sort patient event_id Rx_id

. list patient event_id eventday event_type Rx_id drug_id fillday lastday, sepby(patient event_id)

     +-------------------------------------------------------------------------------------+
     | patient   event_id    eventday   event~pe   Rx_id   drug_id     fillday     lastday |
     |-------------------------------------------------------------------------------------|
  1. |       1          1   02jan2015        345       1        88   01jan2015   30jan2015 |
  2. |       1          1   02jan2015        345       2        99   01jan2015   31mar2015 |
     |-------------------------------------------------------------------------------------|
  3. |       1          2   02jan2015        543       1        88   01jan2015   30jan2015 |
  4. |       1          2   02jan2015        543       2        99   01jan2015   31mar2015 |
     |-------------------------------------------------------------------------------------|
  5. |       1          3   20jan2015        676       1        88   01jan2015   30jan2015 |
  6. |       1          3   20jan2015        676       2        99   01jan2015   31mar2015 |
  7. |       1          3   20jan2015        676       3        77   15jan2015   29jan2015 |
  8. |       1          3   20jan2015        676       4        33   17jan2015   06feb2015 |
     |-------------------------------------------------------------------------------------|
  9. |       2          1   10jan2015        567       1        22   01jan2015   15jan2015 |
 10. |       2          1   10jan2015        567       2        44   07jan2015   05feb2015 |
     |-------------------------------------------------------------------------------------|
 11. |       2          2   20jan2015        999       2        44   07jan2015   05feb2015 |
     +-------------------------------------------------------------------------------------+

Last edited by Robert Picard; 30 Mar 2016, 15:42.

Tags: fuzzy match, Joinby, merge, range, rangestat

Robert Picard

Join Date: Mar 2014

Posts: 1536
#2

05 Apr 2016, 17:38

Thanks again to Kit Baum, an updated version of rangejoin is now available on SSC. The previous version did not correctly handle observations with missing values for the interval bounds which could lead, under certain scenarios, to spurious pairings even though such observations should never match because the interval is undefined. I encourage everyone to update their copy by typing in Stata's Command window:

Code:

adoupdate rangejoin
1 like
Comment
Marvin Aliaga

Join Date: Feb 2015

Posts: 255
#3

10 Jan 2017, 10:52

Hi Robert Picard ,

A quick question, is there a way to keep the unmatched observations? For example, the events that were not matched using the ID. Something like the value 2 in the regular _merge variable.

Thanks,
Marvin
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

10 Jan 2017, 15:13

I'll assume that you do not want to match each observations in the master to observations in the using that did not match (just use joinby if that's the case).

Otherwise, just make a list of all the observations that matched, reduce to one observation per identifier, and use merge to get the unmatched observations. Here's a replay of the example in #1 with an additional 2 events that do not match.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(patient event_id event_type) str10 event_date float eventday
1 1 345 "01/02/2015" 20090
1 2 543 "01/02/2015" 20090
1 3 676 "01/20/2015" 20108
2 1 567 "01/10/2015" 20098
2 2 999 "01/20/2015" 20108
2 3 888 "01/20/2016" 20473
3 1 888 "01/20/2015" 20108
end
format %td eventday
save "events.dta", replace

* make sure each observation is uniquely identified
isid patient event_id

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(patient Rx_id) str10 fill_date float(supply drug_id fillday)
1 1 "01/01/2015" 30 88 20089
1 2 "01/01/2015" 90 99 20089
1 3 "01/15/2015" 15 77 20103
1 4 "01/17/2015" 21 33 20105
2 1 "01/01/2015" 15 22 20089
2 2 "01/07/2015" 30 44 20095
end
format %td fillday

* the day of the last pill
gen lastday = fillday + supply - 1
format %td lastday

rangejoin eventday fillday lastday using "events.dta", by(patient)
list patient Rx_id drug_id fillday lastday event_date event_type , sepby(patient Rx_id)

* find unmatched observations
bysort patient event_id: keep if _n == 1
keep patient event_id
merge 1:1 patient event_id using "events.dta", keep(using) nogen
list

Comment

Lotte van Ouw

Join Date: Feb 2022

Posts: 1
#5

25 Mar 2022, 06:48

Hi Robert Picard , thank you for explaining the option 'rangejoin'. I was wondering if you can also use multiple keyvars?
I need to match medication start and stopdates with visitdates. However, I don't want to 'match' the med to the visitdate if both the start- and enddate are before the visitdate. Because than that person is not using the medication anymore at the moment of the visit.
Can you maybe help me out?

thank you, kind regards,
Lotte
Comment
Lorens Elvang

Join Date: Apr 2023

Posts: 22
#6

12 May 2023, 04:12

Hi Lotte van Ouw ,
Did you figure out if this is possible? I'm currently interested in doing something similar.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30089
#7

12 May 2023, 10:16

#5 is a good example of why you should not address questions to particular people on Statalist. Robert Picard has not been active here for quite a while now. Others who could have answered the question posed, like me, glanced at the question, saw it was addressed to him, and didn't bother to read on. Only address your question to a particular person if a) you are asking them to elaborate on a previous response in the thread, or b) you are quite certain that they are the only person who can provide an answer.

Turning to the question itself, no, -rangejoin- does not support multiple keyvars. But multiple keyvars are not needed for this task. To pair up visits with those pharmacy records where the visit date falls between the pharmacy's stop and start dates, you only need one key variable. The trick is to get the master and using data sets right. Unlike -merge- or -joinby-, which produce the same results regardless of which data set is master and which is using, -rangejoin- treats these asymmetrically.

Code:

use pharmacy_data, clear rangejoin visit_date start_date stop_date using visit_data, // POSSIBLY SOME OPTIONS HERE
Comment
Chiara Gottheil

Join Date: Feb 2024

Posts: 3
#8

12 Feb 2024, 08:42

Hello,

I'm in a similar situation as the above user where I need to join two datasets using rangejoin, but also keep unmatched observations. I don't quite understand the code suggested by Robert above. I understand that it allows us to join the datasets, then merge it back with the original using dataset and only keep the unmatched observations, but I don't quite understand how it works (particularly the bysort and keep part). If someone could please clarify this, that would be great.

Additionally, a couple other clarifications:
1. This code leaves us with all of the unmatched observations from the using dataset, which would then need to be appended into the joined dataset to complete it, correct?
2. Any unmatched observations in the dataset in memory remain in the joined dataset, correct?

Thanks in advance!
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30089

12 Feb 2024, 11:29

I think you are not understanding what was written in #4. If you re-read the full post, not just the code,you will see that that code is not intended to, in effect, re-create the original data set but with the additional information from events.dta merged in where it is suitably matched and in range. Rather #4 assumes you want to keep the matched data and then also include in the resulting data set a single observation containing the patient and event_id variables for each patient that did not match.

If what you want to do is, instead, re-create the original data but with the additional information from events.dta, then you would use different code.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(patient event_id event_type) str10 event_date float eventday
1 1 345 "01/02/2015" 20090
1 2 543 "01/02/2015" 20090
1 3 676 "01/20/2015" 20108
2 1 567 "01/10/2015" 20098
2 2 999 "01/20/2015" 20108
2 3 888 "01/20/2016" 20473
3 1 888 "01/20/2015" 20108
end
format %td eventday
tempfile events
save `events'

* make sure each observation is uniquely identified
isid patient event_id

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(patient Rx_id) str10 fill_date float(supply drug_id fillday)
1 1 "01/01/2015" 30 88 20089
1 2 "01/01/2015" 90 99 20089
1 3 "01/15/2015" 15 77 20103
1 4 "01/17/2015" 21 33 20105
2 1 "01/01/2015" 15 22 20089
2 2 "01/07/2015" 30 44 20095
end
format %td fillday

* the day of the last pill
gen lastday = fillday + supply - 1
format %td lastday
tempfile drugs
save `drugs'


rangejoin eventday fillday lastday using `events', by(patient)
list patient Rx_id drug_id fillday lastday event_date event_type , sepby(patient Rx_id)

//    NOW MERGE WITH THE ORIGINAL DATA SET
merge m:1 patient Rx_id using `drugs', keep(match using)

Comment

Chiara Gottheil

Join Date: Feb 2024

Posts: 3
#10

16 Feb 2024, 08:11

Hi Clyde,

Thanks for your response. I tried this code, and it did not merge in the unmatched observations from the original using dataset, so I tried merging in the original using dataset (events) instead of the original master dataset (drugs) instead. I believe this gave me my desired result (the rangejoined data plus unmatched observations from the original datasets). Does this seem flawed to you? If not, thanks for helping me figure it out.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30089
#11

16 Feb 2024, 09:12

Oh, sorry. Yes, merging in the events data set is the correct solution to your problem. I really don't recall what I was thinking when I -merge-d in the drugs data set.
Comment
Chiara Gottheil

Join Date: Feb 2024

Posts: 3
#12

16 Feb 2024, 12:43

No problem, thanks again for your help!
Comment

Announcement

New on SSC: range join - a program to form pairwise combinations using a range

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment