Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • rangejoin updated on SSC

    Thanks to Kit Baum, a new version of rangejoin is now available on SSC. Stata 11 is required. You also need to install the latest version of rangestat (1.1.0) in order to use this updated version of rangejoin.

    To update both rangestat and rangejoin, type in Stata's Command window
    Code:
    adoupdate rangestat
    adoupdate rangejoin
    For a first install, type
    Code:
    ssc install rangestat
    ssc install rangejoin
    Once installed/updated, type
    Code:
    help rangejoin
    to get more information.

    rangejoin forms pairwise combinations between observations in memory and observations in a using dataset when the value of a key variable in the using dataset is within the range specified by observations in the data in memory. rangejoin leverages the power of rangestat (from SSC, with Nick Cox and Roberto Ferrer) to implement what amounts to a joinby over a range of values. The observations in memory define the low and high bounds of the interval to use and the values of a key variable in the using dataset determines which observations are to be paired.

    This version fixes a problem in the previous version when an interval bound was computed by adding an offset to the value of the key variable and the computed bound could not be stored in a variable of the key variable's data type. This was most likely to bite when the key variable was a byte. The overflow would be treated as a missing bound and affected observations were excluded from the sample. Many thanks to Clyde Schechter for bringing this to our attention.

    Missing interval bounds are now allowed and handled using the rules that Stata uses for its inrange() function: if the lower bound is missing, observations will match up to and including the value of the higher bound. If both low and high bounds are missing, all observations will match.

    The help file has been completely revamped and features a new way to present examples that can be easily tried via a click to run link. Each example is presented in a code block and is run as a do-file. The code that runs is exactly what is shown in the help file. Long lines are split using the /// line continuation indicator.

  • #2
    Unfortunately, I have discovered a bug in the April refresh of rangejoin that will lead to missing matches if the keyvar is of type float AND there are more observations in the dataset in memory than in the using dataset, or within groups if the by() option is specified. Here's a quick example, derived from the first example in the help file:

    Code:
    clear
    input house asking
    4 444
    5 555
    6 666
    end
    save "house_asking.dta", replace
    
    clear
    input str5 name low high
    Peter 300 500
    Paul  400 600
    Mary  600 700
    Peter 300 500
    Paul  400 600
    Mary  600 700
    end
    
    rangejoin asking low high using "house_asking.dta"
    sort name house
    list, sepby(name)
    and the results
    Code:
    . list, sepby(name)
    
         +-------------------------------------+
         |  name   low   high   house   asking |
         |-------------------------------------|
      1. |  Mary   600    700       6      666 |
      2. |  Mary   600    700       .        . |
         |-------------------------------------|
      3. |  Paul   400    600       4      444 |
      4. |  Paul   400    600       5      555 |
      5. |  Paul   400    600       .        . |
         |-------------------------------------|
      6. | Peter   300    500       4      444 |
      7. | Peter   300    500       .        . |
         +-------------------------------------+
    A new version of rangejoin that fixes the problem has been sent to Kit Baum but it may take a few days until it makes its way to the SSC servers. In the mean time, you can sidestep the problem by recasting the keyvar variable to another data type. Here's how to do this using the same example as above:

    Code:
    clear
    input house asking
    4 444
    5 555
    6 666
    end
    recast double asking
    save "house_asking.dta", replace
    
    clear
    input str5 name low high
    Peter 300 500
    Paul  400 600
    Mary  600 700
    Peter 300 500
    Paul  400 600
    Mary  600 700
    end
    
    rangejoin asking low high using "house_asking.dta"
    sort name house
    list, sepby(name)
    and the correct results
    Code:
    . list, sepby(name)
    
         +-------------------------------------+
         |  name   low   high   house   asking |
         |-------------------------------------|
      1. |  Mary   600    700       6      666 |
      2. |  Mary   600    700       6      666 |
         |-------------------------------------|
      3. |  Paul   400    600       4      444 |
      4. |  Paul   400    600       4      444 |
      5. |  Paul   400    600       5      555 |
      6. |  Paul   400    600       5      555 |
         |-------------------------------------|
      7. | Peter   300    500       4      444 |
      8. | Peter   300    500       4      444 |
         +-------------------------------------+
    For the curious, rangejoin combines the data in memory (master) with the using dataset via an unmatched merge, effectively putting both datasets side by side in memory. It then uses rangestat to find matches that are in range. If there are more observations in the master than in the using, the extra observations have missing values for all variables in the using, including the keyvar. With rangestat, observations were keyvar is missing are exluded from the sample. To get around that, rangejoin replaces the missing values for keyvar with c(maxdouble) for these padded observations. Stata will promote any integer type to a larger integer type, or to a float, or double (see help replace). As it turns out, Stata will not promote a float to a double, it will replace the value with a missing, which leads rangestat to exclude the observation. I should have picked-up on this but none of my certification scripts had the necessary elements. My bad.

    Comment


    • #3
      Thanks to Kit Baum, a new version of rangejoin that fixes the bug described in the above post is now available from SSC. You can get the update by typing:
      Code:
      adoupdate rangejoin
      To install rangejoin for the first time, type
      Code:
      ssc install rangejoin

      Comment

      Working...
      X