Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Joinby to combine two datasets, both of which do not uniquely identify individuals

    I have two datasets that I would like to combine, but it's proving to be quite complicated. The unit of analysis will ultimately be 'assessment taken', but unfortunately there is no unique assessment identifier in either dataset. Instead, the first dataset is organized by youthid, however, youthid does not uniquely identify each row as many youth have taken multiple assessments. Therefore, I've been using both "youthid" and "assessment date" to uniquely identify each observation. I need to merge this dataset to another dataset that has many different outcomes for each of these individuals and choose JUST the outcome that is nearest to, but after, the assessment. Unfortunately, in this using dataset, the identifier for individuals is again youthid, but youthid does not uniquely identify each row since many youth have multiple outcomes. Therefore, I'm using 'youthid' and 'disposition start date' to uniquely identify each row.

    My current plan is to use joinby to make all pairwise matches between youthid-assessment date and disposition start date, then calculate the time differences between assessment date and disposition start date and only keep the smallest time difference that is positive for each youthid-assessment date observation.

    After joinby, something along the lines of:

    Code:
    gen durationassess=dispositionstartdate-assessmentdate
    sort youthid (durationass) /// sort by youthid and by durationassess within youth, sorts from smallest (largest negative) to largest
    egen freq = count(1), by (youthid)
    drop if freq>1 & durationasess<0 /// drop observations in which the disposition date is before the assessment date
    by youthid (durationassess), sort: gen nvals2 = _n==1 /// this tags the first observation in each youthid (first is shortest duration between assessment and disposition)
    drop if nvals2==0 /// only keep the youthid-assessment date joined by disposition date that is closest
    This seems inelegant--is there something simpler? Am I missing something? I have included dataex of the key variables for the joinby in the examples below, but please let me know if it would be helpful to have any additional information or examples.

    dataex of dataset one
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input double youthid float assessmentdatelockeddate
         3107260 20163
    850167563496 20158
    852445987089 21369
    852445987089 21483
    852642606056 21390
    852719779437 21308
    852813799260 21853
    852813799260 21930
    852813799260 22011
    852813799260 22042
    end
    format %td assessmentdatelockeddate

    and, dataex of dataset two
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input double youthid int dispositionstartdate
       6854 19799
       6854 19801
       6854 19870
     272306 19655
     272306 19655
    1006522 19575
    1006522 19627
    3107260 19977
    3107260 20012
    3107260 20012
    end
    format %td dispositionstartdate

  • #2
    Well, your example data is a bit unfortunate in that it doesn't actually contain any potential matches. The only value of youthid that appears in both data sets is 3107260. But all of the dispositionstartdate values are earlier than the assessmentdatelockeddate for that one. So when you put these two example datasets together following your rules, you end up with nothing matched.

    Nevertheless, I believe the following code does what you want, a bit more cleanly.

    Code:
    use dataset1, clear
    rangejoin dispositionstartdate assessmentdatelockeddate . using dataset2, ///
        by(youthid)
    by youthid (dispositionstartdate), sort: keep if _n == 1
    -rangejoin- is written by Robert Picard. To use it you must also install -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer. Both are available from SSC.

    The use of -rangejoin- instead of -joinby- brings in only those values of dispositionstartdate that are greater than or equal to assessmentdatelockeddate in the first place, so it is unnecessary to remove the observations where the order is wrong. Then it just comes down to retaining the one closest observation after, which is a single line. If your data set is large, -rangejoin- will also run noticeably faster than -joinby-, and it may prevent you from exceeding available memory as well.

    Comment


    • #3
      Thank you, Clyde! And, oops, you're right. I just randomly grabbed observations, but should have verified there would be at least one match. I'm looking forward to testing out rangejoin. I will update this thread after I do and try to post a better selection of example data as well.

      Comment

      Working...
      X