Merge (/ joinby / rangejoin) on most recent date

Chris Sampson

Join Date: Jun 2014
Posts: 3

Merge (/ joinby / rangejoin) on most recent date

14 May 2017, 01:54

Hi folks

Apologies if answered elsewhere, but I couldn't find it.

I have two sets of data (examples below). Both have multiple observations per person, and each observation is associated with a date and some other information. I'd like to merge from my using data on the basis of most recent past date (with respect to the data in memory).

For context, these are health service use data. I am trying to determine the activity in the first dataset according to the most recent 'outcome' that's recorded in the second dataset.

So...

The first:

ID	date	Var1	Var2
1	16feb2007	4	346
1	26nov2007	34	34
1	02apr2007	78	346
1	29oct2008	325	8
2	03jan2008	13	57
3	23apr2008	26	586
3	30apr2008	3	1
4	25jun2008	6	976
5	31mar2008	23	5

The second:

ID	date	outcome
1	04jan2007	0
1	21may2007	1
1	01sep2008	0
2	01jan2008	1
2	10feb2009	0
3	04may2008	0
4	17may2007	0
4	16jan2009	1
5	19feb2008	0

What I'm hoping to achieve:

ID	date	Var1	Var2	date_outcome	outcome
1	16feb2007	4	346	04jan2007	0
1	26nov2007	34	34	21may2007	1
1	02apr2007	78	346	04jan2007	0
1	29oct2008	325	8	01sep2008	0
2	03jan2008	13	57	01jan2008	1
3	23apr2008	26	586	.	.
3	30apr2008	3	1	.	.
4	25jun2008	6	976	17may2007	0
5	31mar2008	23	5	19feb2008	0

I'm sure there's a straightforward solution, but I just can't get my head around it!

Note that this is a large dataset (>1million obs) with many observations per person, so a solution in which the data had to be converted to wide format would not be ideal.

Thanks in advance.

Tags: data, date, Joinby, merge, rangejoin

Mike Lacy

Join Date: Apr 2014

Posts: 2425
#2

14 May 2017, 08:55

1) To make it easy for people to help you, please post your example data using -dataex-, per item 12.2 of the FAQ.
2) I don't understand what you mean by "determine" in "I am trying to determine the activity in the first dataset according to the most recent 'outcome' that's recorded in the second dataset." So, I am taking you to mean "I want to merge variables onto the observations in the first data set using observations from the second data set that match the first on ID, and for which the outcome date is the largest date that is less than or equal to the date in the first data set."
3) You did not mention what to do if there are ties in the outcome data set, so I'm presuming you know that they cannot happen.
4) I presume some observations in your "first" data set might not have outcome information. I presume you want to keep them..

On that understanding, -joinby- will pair up all the observations that match on ID. Then, you can drop the irrelevant observations, and keep only the most recent one among the relevant paired observations within ID.

So, give this a try:

Code:

clear input ID str9 datestring outcome 1 04jan2007 0 1 21may2007 1 1 01sep2008 0 2 01jan2008 1 2 10feb2009 0 3 04may2008 0 4 17may2007 0 4 16jan2009 1 5 19feb2008 0 end gen date_outcome = date(datestring, "DMY") drop datestring tempfile outcome save `outcome' // clear // first data set input ID str9 datestring Var1 Var2 1 16feb2007 4 346 1 26nov2007 34 34 1 02apr2007 78 346 1 29oct2008 325 8 2 03jan2008 13 57 3 23apr2008 26 586 3 30apr2008 3 1 4 25jun2008 6 976 5 31mar2008 23 5 6 01May2010 20 16 end gen date = date(datestring, "DMY") drop datestring // // Example data now ready, do the real work // Form all possible pairs, matched on ID joinby ID using `outcome', unmatched(master) drop if (date_outcome > date) & !missing(date_outcome) // Within ID, keep observation with latest date of outcome bysort ID date (date_outcome): keep if (_n == _N) // format %td date* list ID date Var1 Var2 date_outcome outcome
1 like
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

14 May 2017, 17:14

Here's an alternative approach that appends the two datasets and then sorts the observations by date within ID groups. Values from the outcome dataset are carried forward when outcome is missing (when the observations come from the master dataset).

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte ID str9 date byte outcome
1 "04jan2007" 0
1 "21may2007" 1
1 "01sep2008" 0
2 "01jan2008" 1
2 "10feb2009" 0
3 "04may2008" 0
4 "17may2007" 0
4 "16jan2009" 1
5 "19feb2008" 0
end
save "statalist_outcome.dta", replace

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte ID str9 date int(Var1 Var2) float order
1 "16feb2007"   4 346 1
1 "26nov2007"  34  34 2
1 "02apr2007"  78 346 3
1 "29oct2008" 325   8 4
2 "03jan2008"  13  57 5
3 "23apr2008"  26 586 6
3 "30apr2008"   3   1 7
4 "25jun2008"   6 976 8
5 "31mar2008"  23   5 9
end

* combine the master with the outcome dataset
append using "statalist_outcome.dta"
gen master = !mi(order)

* convert to Stata numeric dates, define outcome date if the obs is not from the master
gen ndate = date(date, "DMY")
gen date_outcome = ndate if master == 0
format %td ndate date_outcome

* order by date within ID groups and carry forward the outcome
sort ID ndate master
by ID: replace date_outcome = date_outcome[_n-1] if mi(outcome)
by ID: replace outcome = outcome[_n-1] if mi(outcome)

* return to the original observations in memory
keep if master
sort ID order
list, sepby(ID)

and the results:

Code:

. list, sepby(ID)

     +---------------------------------------------------------------------------------+
     | ID        date   Var1   Var2   order   outcome   master       ndate   date_ou~e |
     |---------------------------------------------------------------------------------|
  1. |  1   16feb2007      4    346       1         0        1   16feb2007   04jan2007 |
  2. |  1   26nov2007     34     34       2         1        1   26nov2007   21may2007 |
  3. |  1   02apr2007     78    346       3         0        1   02apr2007   04jan2007 |
  4. |  1   29oct2008    325      8       4         0        1   29oct2008   01sep2008 |
     |---------------------------------------------------------------------------------|
  5. |  2   03jan2008     13     57       5         1        1   03jan2008   01jan2008 |
     |---------------------------------------------------------------------------------|
  6. |  3   23apr2008     26    586       6         .        1   23apr2008           . |
  7. |  3   30apr2008      3      1       7         .        1   30apr2008           . |
     |---------------------------------------------------------------------------------|
  8. |  4   25jun2008      6    976       8         0        1   25jun2008   17may2007 |
     |---------------------------------------------------------------------------------|
  9. |  5   31mar2008     23      5       9         0        1   31mar2008   19feb2008 |
     +---------------------------------------------------------------------------------+

Comment

Mike Lacy

Join Date: Apr 2014

Posts: 2425
#4

15 May 2017, 07:39

Considering the size of Chris's file, Robert's approach, which avoids creating a very large file of pairs, is likely preferable.
Comment
Chris Sampson

Join Date: Jun 2014

Posts: 3
#5

17 May 2017, 11:10

Mike Lacy Robert Picard - thanks very much for your fantastic responses.

Apologies for not adhering to 12.2. I will do better in future! You both interpreted my ramblings perfectly.

It seems that both methods achieve the desired result. Thank you!
Comment

Jae Li

Join Date: May 2017
Posts: 184

12 May 2019, 09:32

@Mike Lacy Hi Mike, can I ask you a question about your post #2? What if "I want to merge variables onto the observations in the first quarterly data set using observations from the second annual data set that match the first on ID based on most recent observations of fiscal-year-end (fym)"? For example, for fym is 1986m5, the most recent historical observation for ym is at 1986m3. Do you know how to merge that way?

The following is an example:

quarter	fyear	permno	gvkey	tmi1	ym	wtio	und_inv	fym
1987q1	1986	10003	16950	0	1987m3	0.000885		1986m5
1987q2	1986	10003	16950	0	1987m6	0.000351		1986m5
1987q3	1986	10003	16950	0	1987m9	0.00029		1986m5
1987q4	1986	10003	16950	0	1987m12	0.00031		1986m5
1994q1	1994	10003	16950	0	1994m3	2.76E-07		1994m12
1994q2	1994	10003	16950	0	1994m6	5.20E-06		1994m12
1994q3	1994	10003	16950	0	1994m9	4.26E-06		1994m12
1994q4	1994	10003	16950	0	1994m12	3.09E-07		1994m12

How would you adjust your code to work for both occasions? Many thanks for your help in advance!

Last edited by Jae Li; 12 May 2019, 09:49.

Announcement

Merge (/ joinby / rangejoin) on most recent date

Comment

Comment

Comment

Comment

Comment