Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merge (/ joinby / rangejoin) on most recent date

    Hi folks

    Apologies if answered elsewhere, but I couldn't find it.

    I have two sets of data (examples below). Both have multiple observations per person, and each observation is associated with a date and some other information. I'd like to merge from my using data on the basis of most recent past date (with respect to the data in memory).

    For context, these are health service use data. I am trying to determine the activity in the first dataset according to the most recent 'outcome' that's recorded in the second dataset.

    So...


    The first:
    ID date Var1 Var2
    1 16feb2007 4 346
    1 26nov2007 34 34
    1 02apr2007 78 346
    1 29oct2008 325 8
    2 03jan2008 13 57
    3 23apr2008 26 586
    3 30apr2008 3 1
    4 25jun2008 6 976
    5 31mar2008 23 5

    The second:
    ID date outcome
    1 04jan2007 0
    1 21may2007 1
    1 01sep2008 0
    2 01jan2008 1
    2 10feb2009 0
    3 04may2008 0
    4 17may2007 0
    4 16jan2009 1
    5 19feb2008 0

    What I'm hoping to achieve:
    ID date Var1 Var2 date_outcome outcome
    1 16feb2007 4 346 04jan2007 0
    1 26nov2007 34 34 21may2007 1
    1 02apr2007 78 346 04jan2007 0
    1 29oct2008 325 8 01sep2008 0
    2 03jan2008 13 57 01jan2008 1
    3 23apr2008 26 586 . .
    3 30apr2008 3 1 . .
    4 25jun2008 6 976 17may2007 0
    5 31mar2008 23 5 19feb2008 0

    I'm sure there's a straightforward solution, but I just can't get my head around it!

    Note that this is a large dataset (>1million obs) with many observations per person, so a solution in which the data had to be converted to wide format would not be ideal.

    Thanks in advance.


  • #2
    1) To make it easy for people to help you, please post your example data using -dataex-, per item 12.2 of the FAQ.
    2) I don't understand what you mean by "determine" in "I am trying to determine the activity in the first dataset according to the most recent 'outcome' that's recorded in the second dataset." So, I am taking you to mean "I want to merge variables onto the observations in the first data set using observations from the second data set that match the first on ID, and for which the outcome date is the largest date that is less than or equal to the date in the first data set."
    3) You did not mention what to do if there are ties in the outcome data set, so I'm presuming you know that they cannot happen.
    4) I presume some observations in your "first" data set might not have outcome information. I presume you want to keep them..

    On that understanding, -joinby- will pair up all the observations that match on ID. Then, you can drop the irrelevant observations, and keep only the most recent one among the relevant paired observations within ID.

    So, give this a try:

    Code:
    clear
    input ID str9 datestring outcome
    1 04jan2007 0
    1 21may2007 1
    1 01sep2008 0
    2 01jan2008 1
    2 10feb2009 0
    3 04may2008 0
    4 17may2007 0
    4 16jan2009 1
    5 19feb2008 0
    end
    gen date_outcome = date(datestring, "DMY")
    drop datestring
    tempfile outcome
    save `outcome'
    //
    clear
    // first data set
    input ID str9 datestring Var1 Var2
    1 16feb2007 4 346
    1 26nov2007 34 34
    1 02apr2007 78 346
    1 29oct2008 325 8
    2 03jan2008 13 57
    3 23apr2008 26 586
    3 30apr2008 3 1
    4 25jun2008 6 976
    5 31mar2008 23 5
    6 01May2010 20 16   
    end
    gen date = date(datestring, "DMY")
    drop datestring
    //
    // Example data now ready, do the real work
    // Form all possible pairs, matched on ID
    joinby ID using `outcome', unmatched(master)
    
    drop if (date_outcome > date) & !missing(date_outcome)  
    // Within ID, keep observation with latest date of outcome
    bysort ID date (date_outcome): keep if (_n == _N)
    //
    format %td date*
    list ID date Var1 Var2 date_outcome outcome


    Comment


    • #3
      Here's an alternative approach that appends the two datasets and then sorts the observations by date within ID groups. Values from the outcome dataset are carried forward when outcome is missing (when the observations come from the master dataset).

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input byte ID str9 date byte outcome
      1 "04jan2007" 0
      1 "21may2007" 1
      1 "01sep2008" 0
      2 "01jan2008" 1
      2 "10feb2009" 0
      3 "04may2008" 0
      4 "17may2007" 0
      4 "16jan2009" 1
      5 "19feb2008" 0
      end
      save "statalist_outcome.dta", replace
      
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input byte ID str9 date int(Var1 Var2) float order
      1 "16feb2007"   4 346 1
      1 "26nov2007"  34  34 2
      1 "02apr2007"  78 346 3
      1 "29oct2008" 325   8 4
      2 "03jan2008"  13  57 5
      3 "23apr2008"  26 586 6
      3 "30apr2008"   3   1 7
      4 "25jun2008"   6 976 8
      5 "31mar2008"  23   5 9
      end
      
      * combine the master with the outcome dataset
      append using "statalist_outcome.dta"
      gen master = !mi(order)
      
      * convert to Stata numeric dates, define outcome date if the obs is not from the master
      gen ndate = date(date, "DMY")
      gen date_outcome = ndate if master == 0
      format %td ndate date_outcome
      
      * order by date within ID groups and carry forward the outcome
      sort ID ndate master
      by ID: replace date_outcome = date_outcome[_n-1] if mi(outcome)
      by ID: replace outcome = outcome[_n-1] if mi(outcome)
      
      * return to the original observations in memory
      keep if master
      sort ID order
      list, sepby(ID)
      and the results:
      Code:
      . list, sepby(ID)
      
           +---------------------------------------------------------------------------------+
           | ID        date   Var1   Var2   order   outcome   master       ndate   date_ou~e |
           |---------------------------------------------------------------------------------|
        1. |  1   16feb2007      4    346       1         0        1   16feb2007   04jan2007 |
        2. |  1   26nov2007     34     34       2         1        1   26nov2007   21may2007 |
        3. |  1   02apr2007     78    346       3         0        1   02apr2007   04jan2007 |
        4. |  1   29oct2008    325      8       4         0        1   29oct2008   01sep2008 |
           |---------------------------------------------------------------------------------|
        5. |  2   03jan2008     13     57       5         1        1   03jan2008   01jan2008 |
           |---------------------------------------------------------------------------------|
        6. |  3   23apr2008     26    586       6         .        1   23apr2008           . |
        7. |  3   30apr2008      3      1       7         .        1   30apr2008           . |
           |---------------------------------------------------------------------------------|
        8. |  4   25jun2008      6    976       8         0        1   25jun2008   17may2007 |
           |---------------------------------------------------------------------------------|
        9. |  5   31mar2008     23      5       9         0        1   31mar2008   19feb2008 |
           +---------------------------------------------------------------------------------+

      Comment


      • #4
        Considering the size of Chris's file, Robert's approach, which avoids creating a very large file of pairs, is likely preferable.

        Comment


        • #5
          Mike Lacy Robert Picard - thanks very much for your fantastic responses.

          Apologies for not adhering to 12.2. I will do better in future! You both interpreted my ramblings perfectly.

          It seems that both methods achieve the desired result. Thank you!

          Comment


          • #6
            @Mike Lacy Hi Mike, can I ask you a question about your post #2? What if "I want to merge variables onto the observations in the first quarterly data set using observations from the second annual data set that match the first on ID based on most recent observations of fiscal-year-end (fym)"? For example, for fym is 1986m5, the most recent historical observation for ym is at 1986m3. Do you know how to merge that way?

            The following is an example:
            quarter fyear permno gvkey tmi1 ym wtio und_inv fym
            1987q1 1986 10003 16950 0 1987m3 0.000885 1986m5
            1987q2 1986 10003 16950 0 1987m6 0.000351 1986m5
            1987q3 1986 10003 16950 0 1987m9 0.00029 1986m5
            1987q4 1986 10003 16950 0 1987m12 0.00031 1986m5
            1994q1 1994 10003 16950 0 1994m3 2.76E-07 1994m12
            1994q2 1994 10003 16950 0 1994m6 5.20E-06 1994m12
            1994q3 1994 10003 16950 0 1994m9 4.26E-06 1994m12
            1994q4 1994 10003 16950 0 1994m12 3.09E-07 1994m12
            How would you adjust your code to work for both occasions? Many thanks for your help in advance!
            Last edited by Jae Li; 12 May 2019, 09:49.

            Comment

            Working...
            X