I have two datasets that I would like to combine, but it's proving to be quite complicated. The unit of analysis will ultimately be 'assessment taken', but unfortunately there is no unique assessment identifier in either dataset. Instead, the first dataset is organized by youthid, however, youthid does not uniquely identify each row as many youth have taken multiple assessments. Therefore, I've been using both "youthid" and "assessment date" to uniquely identify each observation. I need to merge this dataset to another dataset that has many different outcomes for each of these individuals and choose JUST the outcome that is nearest to, but after, the assessment. Unfortunately, in this using dataset, the identifier for individuals is again youthid, but youthid does not uniquely identify each row since many youth have multiple outcomes. Therefore, I'm using 'youthid' and 'disposition start date' to uniquely identify each row.
My current plan is to use joinby to make all pairwise matches between youthid-assessment date and disposition start date, then calculate the time differences between assessment date and disposition start date and only keep the smallest time difference that is positive for each youthid-assessment date observation.
After joinby, something along the lines of:
This seems inelegant--is there something simpler? Am I missing something? I have included dataex of the key variables for the joinby in the examples below, but please let me know if it would be helpful to have any additional information or examples.
dataex of dataset one
and, dataex of dataset two
My current plan is to use joinby to make all pairwise matches between youthid-assessment date and disposition start date, then calculate the time differences between assessment date and disposition start date and only keep the smallest time difference that is positive for each youthid-assessment date observation.
After joinby, something along the lines of:
Code:
gen durationassess=dispositionstartdate-assessmentdate sort youthid (durationass) /// sort by youthid and by durationassess within youth, sorts from smallest (largest negative) to largest egen freq = count(1), by (youthid) drop if freq>1 & durationasess<0 /// drop observations in which the disposition date is before the assessment date by youthid (durationassess), sort: gen nvals2 = _n==1 /// this tags the first observation in each youthid (first is shortest duration between assessment and disposition) drop if nvals2==0 /// only keep the youthid-assessment date joined by disposition date that is closest
dataex of dataset one
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input double youthid float assessmentdatelockeddate 3107260 20163 850167563496 20158 852445987089 21369 852445987089 21483 852642606056 21390 852719779437 21308 852813799260 21853 852813799260 21930 852813799260 22011 852813799260 22042 end format %td assessmentdatelockeddate
and, dataex of dataset two
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input double youthid int dispositionstartdate 6854 19799 6854 19801 6854 19870 272306 19655 272306 19655 1006522 19575 1006522 19627 3107260 19977 3107260 20012 3107260 20012 end format %td dispositionstartdate
Comment