Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merge by value ranges

    Hello,

    Imagine you have a payment schedule for the number of widgets you produce. In the code below, this payment schedule is laid out in the "lookup" dataset. If you produce 0-3 widgets, you get paid 500. If you produce 3.0001 - 6, you get paid 600, etc.

    I have a second dataset that has the number of widgets produced for a bunch of individuals ("detail" dataset, value = number of widgets produced). How might I do this in a memory-efficient way? Solutions that involve looping over individual observations don't work well in my case where I'm dealing with in the range of 100k individuals.

    Thank you!!


    Code:
    // prep example lookup dataset
    clear
    set obs 3
    gen range_lo = .
    gen range_hi = .
    gen output = .
    replace range_lo = 0 in 1
    replace range_lo = 3 in 2
    replace range_lo = 6 in 3
    replace range_hi = 3 in 1
    replace range_hi = 6 in 2
    replace range_hi = 10 in 3
    
    replace output = 500 in 1
    replace output = 600 in 2
    replace output = 800 in 3
    
    
    tempfile lookup
    save `lookup'
    
    
    // prep example detail dataset
    clear
    set obs 1000
    gen value = _n/100
    gen output = .

  • #2
    This is precisely what -rangejoin- was created to do. It is written by Robert Picard and is available from SSC. To use it, you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.
    Code:
    // prep example lookup dataset
    clear
    set obs 3
    gen range_lo = .
    gen range_hi = .
    gen output = .
    replace range_lo = 0 in 1
    replace range_lo = 3 in 2
    replace range_lo = 6 in 3
    replace range_hi = 3 in 1
    replace range_hi = 6 in 2
    replace range_hi = 10 in 3
    
    replace output = 500 in 1
    replace output = 600 in 2
    replace output = 800 in 3
    
    tempfile lookup
    save `lookup'
    
    
    // prep example detail dataset
    clear
    set obs 1000
    gen value = _n/100
    //gen output = .    UNNECESSARY, AND ACTUALLY GETS IN THE WAY
    
    tempfile detail
    save `detail'
    
    
    use `lookup', clear
    rangejoin value range_lo range_hi using `detail'
    drop if range_lo == 3 & value == 3
    drop if range_lo == 6 & value == 6
    Notes:
    1. The variable output in the detail data set is not only unnecessary, it actually gets in the way. So omit that.
    2. The -drop if...- commands at the end are needed because -rangejoin- treats the intervals as closed on both sides (except when one of the bounds is missing). This results, in your situation, in each observation where value equals one of the endpoints, in two observations: one using the lower value and the other using the higher value. Since you want the intervals to be treated as open at the lower end, these two commands fix that.

    You will find -rangejoin- is both memory efficient and fast.

    Comment


    • #3
      Fantastic. Perfect response. All the gold stars go to you.

      Comment

      Working...
      X