Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • New on SSC: range join - a program to form pairwise combinations using a range

    Thanks for Kit Baum, a new program called rangejoin is now available from SSC. Stata 11 is required.

    rangejoin forms pairwise combinations between observations in memory and observations in a using dataset when the value of a key variable in the using dataset is within the range specified by observations in the data in memory.

    To install, type in Stata's command window:

    Code:
    ssc install rangejoin
    Once installed, type

    Code:
    help rangejoin
    to get more information.

    rangejoin leverages the power of the recently announced rangestat (from SSC, with Roberto Ferrer and Nick Cox) to implement what amounts to a joinby over a range of values. The observations in memory define the low and high bounds of the interval to use and the values of a key variable in the using dataset determines which observations are to be paired.

    Without rangejoin, you would have to follow these steps to achieve the same results:
    1. rename variables in memory or in the using dataset to avoid name conflicts, and
    2. use cross or joinby to form all pairwise combinations, and
    3. use keep if inrange(keyvar, low, high) to reduce the data to pairwise combinations that are within the desired range.
    With even moderately small datasets, the overhead required to form all pairwise combinations first and then pare down to observations in range is hugely inefficient. For larger datasets, this brute force approach may not be feasible at all.

    rangejoin can be used with a stored copy of the data in memory to form the same pairwise combinations that rangestat makes to calculates its statistics. This could be useful if you need to impose an additional condition that can't be specified with rangestat.

    You can also do things that you could not do with rangestat, that is you can form pairwise combinations over completely different datasets. Suppose that you have data on medical events:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(patient event_id event_type) str10 event_date float eventday
    1 1 345 "01/02/2015" 20090
    1 2 543 "01/02/2015" 20090
    1 3 676 "01/20/2015" 20108
    2 1 567 "01/10/2015" 20098
    2 2 999 "01/20/2015" 20108
    end
    format %td eventday
    save "events.dta", replace
    and you have in memory data on prescription fills:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input float(patient Rx_id) str10 fill_date float(supply drug_id fillday)
    1 1 "01/01/2015" 30 88 20089
    1 2 "01/01/2015" 90 99 20089
    1 3 "01/15/2015" 15 77 20103
    1 4 "01/17/2015" 21 33 20105
    2 1 "01/01/2015" 15 22 20089
    2 2 "01/07/2015" 30 44 20095
    end
    format %td fillday
    You can match each prescription with medical events that occurred during the fill period:
    Code:
    * the day of the last pill
    gen lastday = fillday + supply - 1
    format %td lastday
    
    rangejoin eventday fillday lastday using "events.dta", by(patient)
    list patient Rx_id drug_id fillday lastday event_date event_type , sepby(patient Rx_id)
    and the results:

    Code:
    . list patient Rx_id drug_id fillday lastday event_date event_type , sepby(patient Rx_id)
    
         +---------------------------------------------------------------------------+
         | patient   Rx_id   drug_id     fillday     lastday   event_date   event~pe |
         |---------------------------------------------------------------------------|
      1. |       1       1        88   01jan2015   30jan2015   01/02/2015        345 |
      2. |       1       1        88   01jan2015   30jan2015   01/02/2015        543 |
      3. |       1       1        88   01jan2015   30jan2015   01/20/2015        676 |
         |---------------------------------------------------------------------------|
      4. |       1       2        99   01jan2015   31mar2015   01/02/2015        345 |
      5. |       1       2        99   01jan2015   31mar2015   01/02/2015        543 |
      6. |       1       2        99   01jan2015   31mar2015   01/20/2015        676 |
         |---------------------------------------------------------------------------|
      7. |       1       3        77   15jan2015   29jan2015   01/20/2015        676 |
         |---------------------------------------------------------------------------|
      8. |       1       4        33   17jan2015   06feb2015   01/20/2015        676 |
         |---------------------------------------------------------------------------|
      9. |       2       1        22   01jan2015   15jan2015   01/10/2015        567 |
         |---------------------------------------------------------------------------|
     10. |       2       2        44   07jan2015   05feb2015   01/10/2015        567 |
     11. |       2       2        44   07jan2015   05feb2015   01/20/2015        999 |
         +---------------------------------------------------------------------------+
    You can reorder the observations and show, for each medical event, which prescriptions were being taken at the time of the event:
    Code:
    . sort patient event_id Rx_id
    
    . list patient event_id eventday event_type Rx_id drug_id fillday lastday, sepby(patient event_id)
    
         +-------------------------------------------------------------------------------------+
         | patient   event_id    eventday   event~pe   Rx_id   drug_id     fillday     lastday |
         |-------------------------------------------------------------------------------------|
      1. |       1          1   02jan2015        345       1        88   01jan2015   30jan2015 |
      2. |       1          1   02jan2015        345       2        99   01jan2015   31mar2015 |
         |-------------------------------------------------------------------------------------|
      3. |       1          2   02jan2015        543       1        88   01jan2015   30jan2015 |
      4. |       1          2   02jan2015        543       2        99   01jan2015   31mar2015 |
         |-------------------------------------------------------------------------------------|
      5. |       1          3   20jan2015        676       1        88   01jan2015   30jan2015 |
      6. |       1          3   20jan2015        676       2        99   01jan2015   31mar2015 |
      7. |       1          3   20jan2015        676       3        77   15jan2015   29jan2015 |
      8. |       1          3   20jan2015        676       4        33   17jan2015   06feb2015 |
         |-------------------------------------------------------------------------------------|
      9. |       2          1   10jan2015        567       1        22   01jan2015   15jan2015 |
     10. |       2          1   10jan2015        567       2        44   07jan2015   05feb2015 |
         |-------------------------------------------------------------------------------------|
     11. |       2          2   20jan2015        999       2        44   07jan2015   05feb2015 |
         +-------------------------------------------------------------------------------------+
    Last edited by Robert Picard; 30 Mar 2016, 15:42.

  • #2
    Thanks again to Kit Baum, an updated version of rangejoin is now available on SSC. The previous version did not correctly handle observations with missing values for the interval bounds which could lead, under certain scenarios, to spurious pairings even though such observations should never match because the interval is undefined. I encourage everyone to update their copy by typing in Stata's Command window:

    Code:
    adoupdate rangejoin

    Comment


    • #3
      Hi Robert Picard ,

      A quick question, is there a way to keep the unmatched observations? For example, the events that were not matched using the ID. Something like the value 2 in the regular _merge variable.

      Thanks,
      Marvin

      Comment


      • #4
        I'll assume that you do not want to match each observations in the master to observations in the using that did not match (just use joinby if that's the case).

        Otherwise, just make a list of all the observations that matched, reduce to one observation per identifier, and use
        merge to get the unmatched observations. Here's a replay of the example in #1 with an additional 2 events that do not match.

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input float(patient event_id event_type) str10 event_date float eventday
        1 1 345 "01/02/2015" 20090
        1 2 543 "01/02/2015" 20090
        1 3 676 "01/20/2015" 20108
        2 1 567 "01/10/2015" 20098
        2 2 999 "01/20/2015" 20108
        2 3 888 "01/20/2016" 20473
        3 1 888 "01/20/2015" 20108
        end
        format %td eventday
        save "events.dta", replace
        
        * make sure each observation is uniquely identified
        isid patient event_id
        
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input float(patient Rx_id) str10 fill_date float(supply drug_id fillday)
        1 1 "01/01/2015" 30 88 20089
        1 2 "01/01/2015" 90 99 20089
        1 3 "01/15/2015" 15 77 20103
        1 4 "01/17/2015" 21 33 20105
        2 1 "01/01/2015" 15 22 20089
        2 2 "01/07/2015" 30 44 20095
        end
        format %td fillday
        
        * the day of the last pill
        gen lastday = fillday + supply - 1
        format %td lastday
        
        rangejoin eventday fillday lastday using "events.dta", by(patient)
        list patient Rx_id drug_id fillday lastday event_date event_type , sepby(patient Rx_id)
        
        * find unmatched observations
        bysort patient event_id: keep if _n == 1
        keep patient event_id
        merge 1:1 patient event_id using "events.dta", keep(using) nogen
        list

        Comment


        • #5
          Hi Robert Picard , thank you for explaining the option 'rangejoin'. I was wondering if you can also use multiple keyvars?
          I need to match medication start and stopdates with visitdates. However, I don't want to 'match' the med to the visitdate if both the start- and enddate are before the visitdate. Because than that person is not using the medication anymore at the moment of the visit.
          Can you maybe help me out?

          thank you, kind regards,
          Lotte

          Comment


          • #6
            Hi Lotte van Ouw ,
            Did you figure out if this is possible? I'm currently interested in doing something similar.

            Comment


            • #7
              #5 is a good example of why you should not address questions to particular people on Statalist. Robert Picard has not been active here for quite a while now. Others who could have answered the question posed, like me, glanced at the question, saw it was addressed to him, and didn't bother to read on. Only address your question to a particular person if a) you are asking them to elaborate on a previous response in the thread, or b) you are quite certain that they are the only person who can provide an answer.

              Turning to the question itself, no, -rangejoin- does not support multiple keyvars. But multiple keyvars are not needed for this task. To pair up visits with those pharmacy records where the visit date falls between the pharmacy's stop and start dates, you only need one key variable. The trick is to get the master and using data sets right. Unlike -merge- or -joinby-, which produce the same results regardless of which data set is master and which is using, -rangejoin- treats these asymmetrically.

              Code:
              use pharmacy_data, clear
              rangejoin visit_date start_date stop_date using visit_data, // POSSIBLY SOME OPTIONS HERE

              Comment


              • #8
                Hello,

                I'm in a similar situation as the above user where I need to join two datasets using rangejoin, but also keep unmatched observations. I don't quite understand the code suggested by Robert above. I understand that it allows us to join the datasets, then merge it back with the original using dataset and only keep the unmatched observations, but I don't quite understand how it works (particularly the bysort and keep part). If someone could please clarify this, that would be great.

                Additionally, a couple other clarifications:
                1. This code leaves us with all of the unmatched observations from the using dataset, which would then need to be appended into the joined dataset to complete it, correct?
                2. Any unmatched observations in the dataset in memory remain in the joined dataset, correct?

                Thanks in advance!

                Comment


                • #9
                  I think you are not understanding what was written in #4. If you re-read the full post, not just the code,you will see that that code is not intended to, in effect, re-create the original data set but with the additional information from events.dta merged in where it is suitably matched and in range. Rather #4 assumes you want to keep the matched data and then also include in the resulting data set a single observation containing the patient and event_id variables for each patient that did not match.

                  If what you want to do is, instead, re-create the original data but with the additional information from events.dta, then you would use different code.

                  Code:
                  * Example generated by -dataex-. To install: ssc install dataex
                  clear
                  input float(patient event_id event_type) str10 event_date float eventday
                  1 1 345 "01/02/2015" 20090
                  1 2 543 "01/02/2015" 20090
                  1 3 676 "01/20/2015" 20108
                  2 1 567 "01/10/2015" 20098
                  2 2 999 "01/20/2015" 20108
                  2 3 888 "01/20/2016" 20473
                  3 1 888 "01/20/2015" 20108
                  end
                  format %td eventday
                  tempfile events
                  save `events'
                  
                  * make sure each observation is uniquely identified
                  isid patient event_id
                  
                  * Example generated by -dataex-. To install: ssc install dataex
                  clear
                  input float(patient Rx_id) str10 fill_date float(supply drug_id fillday)
                  1 1 "01/01/2015" 30 88 20089
                  1 2 "01/01/2015" 90 99 20089
                  1 3 "01/15/2015" 15 77 20103
                  1 4 "01/17/2015" 21 33 20105
                  2 1 "01/01/2015" 15 22 20089
                  2 2 "01/07/2015" 30 44 20095
                  end
                  format %td fillday
                  
                  * the day of the last pill
                  gen lastday = fillday + supply - 1
                  format %td lastday
                  tempfile drugs
                  save `drugs'
                  
                  
                  rangejoin eventday fillday lastday using `events', by(patient)
                  list patient Rx_id drug_id fillday lastday event_date event_type , sepby(patient Rx_id)
                  
                  //    NOW MERGE WITH THE ORIGINAL DATA SET
                  merge m:1 patient Rx_id using `drugs', keep(match using)

                  Comment


                  • #10
                    Hi Clyde,

                    Thanks for your response. I tried this code, and it did not merge in the unmatched observations from the original using dataset, so I tried merging in the original using dataset (events) instead of the original master dataset (drugs) instead. I believe this gave me my desired result (the rangejoined data plus unmatched observations from the original datasets). Does this seem flawed to you? If not, thanks for helping me figure it out.

                    Comment


                    • #11
                      Oh, sorry. Yes, merging in the events data set is the correct solution to your problem. I really don't recall what I was thinking when I -merge-d in the drugs data set.

                      Comment


                      • #12
                        No problem, thanks again for your help!

                        Comment

                        Working...
                        X