Merge by value ranges

William Blackmon

Join Date: Oct 2016

Posts: 40
#1

Merge by value ranges

29 Jul 2022, 10:19

Hello,

Imagine you have a payment schedule for the number of widgets you produce. In the code below, this payment schedule is laid out in the "lookup" dataset. If you produce 0-3 widgets, you get paid 500. If you produce 3.0001 - 6, you get paid 600, etc.

I have a second dataset that has the number of widgets produced for a bunch of individuals ("detail" dataset, value = number of widgets produced). How might I do this in a memory-efficient way? Solutions that involve looping over individual observations don't work well in my case where I'm dealing with in the range of 100k individuals.

Thank you!!

Code:

// prep example lookup dataset clear set obs 3 gen range_lo = . gen range_hi = . gen output = . replace range_lo = 0 in 1 replace range_lo = 3 in 2 replace range_lo = 6 in 3 replace range_hi = 3 in 1 replace range_hi = 6 in 2 replace range_hi = 10 in 3 replace output = 500 in 1 replace output = 600 in 2 replace output = 800 in 3 tempfile lookup save `lookup' // prep example detail dataset clear set obs 1000 gen value = _n/100 gen output = .
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#2

29 Jul 2022, 11:15

This is precisely what -rangejoin- was created to do. It is written by Robert Picard and is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

Code:

// prep example lookup dataset clear set obs 3 gen range_lo = . gen range_hi = . gen output = . replace range_lo = 0 in 1 replace range_lo = 3 in 2 replace range_lo = 6 in 3 replace range_hi = 3 in 1 replace range_hi = 6 in 2 replace range_hi = 10 in 3 replace output = 500 in 1 replace output = 600 in 2 replace output = 800 in 3 tempfile lookup save `lookup' // prep example detail dataset clear set obs 1000 gen value = _n/100 //gen output = . UNNECESSARY, AND ACTUALLY GETS IN THE WAY tempfile detail save `detail' use `lookup', clear rangejoin value range_lo range_hi using `detail' drop if range_lo == 3 & value == 3 drop if range_lo == 6 & value == 6

Notes:
1. The variable output in the detail data set is not only unnecessary, it actually gets in the way. So omit that.
2. The -drop if...- commands at the end are needed because -rangejoin- treats the intervals as closed on both sides (except when one of the bounds is missing). This results, in your situation, in each observation where value equals one of the endpoints, in two observations: one using the lower value and the other using the higher value. Since you want the intervals to be treated as open at the lower end, these two commands fix that.

You will find -rangejoin- is both memory efficient and fast.
1 like
Comment
William Blackmon

Join Date: Oct 2016

Posts: 40
#3

29 Jul 2022, 11:17

Fantastic. Perfect response. All the gold stars go to you.
Comment

Announcement

Merge by value ranges

Comment

Comment