Joinby to combine two datasets, both of which do not uniquely identify individuals

Jacki STrenio

Join Date: Oct 2017

Posts: 3
#1

Joinby to combine two datasets, both of which do not uniquely identify individuals

26 Apr 2021, 16:38

I have two datasets that I would like to combine, but it's proving to be quite complicated. The unit of analysis will ultimately be 'assessment taken', but unfortunately there is no unique assessment identifier in either dataset. Instead, the first dataset is organized by youthid, however, youthid does not uniquely identify each row as many youth have taken multiple assessments. Therefore, I've been using both "youthid" and "assessment date" to uniquely identify each observation. I need to merge this dataset to another dataset that has many different outcomes for each of these individuals and choose JUST the outcome that is nearest to, but after, the assessment. Unfortunately, in this using dataset, the identifier for individuals is again youthid, but youthid does not uniquely identify each row since many youth have multiple outcomes. Therefore, I'm using 'youthid' and 'disposition start date' to uniquely identify each row.

My current plan is to use joinby to make all pairwise matches between youthid-assessment date and disposition start date, then calculate the time differences between assessment date and disposition start date and only keep the smallest time difference that is positive for each youthid-assessment date observation.

After joinby, something along the lines of:

Code:

gen durationassess=dispositionstartdate-assessmentdate sort youthid (durationass) /// sort by youthid and by durationassess within youth, sorts from smallest (largest negative) to largest egen freq = count(1), by (youthid) drop if freq>1 & durationasess<0 /// drop observations in which the disposition date is before the assessment date by youthid (durationassess), sort: gen nvals2 = _n==1 /// this tags the first observation in each youthid (first is shortest duration between assessment and disposition) drop if nvals2==0 /// only keep the youthid-assessment date joined by disposition date that is closest

This seems inelegant--is there something simpler? Am I missing something? I have included dataex of the key variables for the joinby in the examples below, but please let me know if it would be helpful to have any additional information or examples.

dataex of dataset one

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input double youthid float assessmentdatelockeddate 3107260 20163 850167563496 20158 852445987089 21369 852445987089 21483 852642606056 21390 852719779437 21308 852813799260 21853 852813799260 21930 852813799260 22011 852813799260 22042 end format %td assessmentdatelockeddate

and, dataex of dataset two

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input double youthid int dispositionstartdate 6854 19799 6854 19801 6854 19870 272306 19655 272306 19655 1006522 19575 1006522 19627 3107260 19977 3107260 20012 3107260 20012 end format %td dispositionstartdate
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30050
#2

26 Apr 2021, 19:59

Well, your example data is a bit unfortunate in that it doesn't actually contain any potential matches. The only value of youthid that appears in both data sets is 3107260. But all of the dispositionstartdate values are earlier than the assessmentdatelockeddate for that one. So when you put these two example datasets together following your rules, you end up with nothing matched.

Nevertheless, I believe the following code does what you want, a bit more cleanly.

Code:

use dataset1, clear rangejoin dispositionstartdate assessmentdatelockeddate . using dataset2, /// by(youthid) by youthid (dispositionstartdate), sort: keep if _n == 1

-rangejoin- is written by Robert Picard. To use it you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer. Both are available from SSC.

The use of -rangejoin- instead of -joinby- brings in only those values of dispositionstartdate that are greater than or equal to assessmentdatelockeddate in the first place, so it is unnecessary to remove the observations where the order is wrong. Then it just comes down to retaining the one closest observation after, which is a single line. If your data set is large, -rangejoin- will also run noticeably faster than -joinby-, and it may prevent you from exceeding available memory as well.
2 likes
Comment
Jacki STrenio

Join Date: Oct 2017

Posts: 3
#3

27 Apr 2021, 12:07

Thank you, Clyde! And, oops, you're right. I just randomly grabbed observations, but should have verified there would be at least one match. I'm looking forward to testing out rangejoin. I will update this thread after I do and try to post a better selection of example data as well.
Comment

Announcement

Joinby to combine two datasets, both of which do not uniquely identify individuals

Comment

Comment