Overly complicated merging advice

Christopher Barnes

Join Date: Aug 2014

Posts: 6
#1

Overly complicated merging advice

31 Oct 2014, 07:43

Hi, I need to merge two datasets of dissimilar patient test data, and I'm absolutely stuck with this problem with no idea what approach to take.

Here's the situation:

Dataset 1 - patients having test no. 1
Lab numbers (unique), dates/times, patient identifier
- multiple tests per patient

Dataset 2 - patients having test no. 2, all these patients have had test no. 1
Lab numbers (unique, not shared with dataset 1), dates/times, patient identifier
- multiple tests per patient
- many more of test no. 2 than test 1

I need to merge dataset 2 into dataset 1, taking the highest result of test 2 within 48 hours of each obs of test 1, such that I can eventually check the performance of test 2 against test 1. I would be very appreciative with some advice on how to go about doing this.

Thanks in advance,
Chris
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

31 Oct 2014, 08:39

So this is not really a merge, because you have many instances of the same patient in each data set, and combining patient with date still does not uniquely identify: you have to select a date within 48 hours. -merge- doesn't do that. -joinby- will enable you to match up every record in dataset1 with each record for the same patients in dataset2. Then you can select the potential matches.

I'm going to assume that your dates/time are Stata datetime variables, not strings, and not simple date variables. I'm also going to assume that you literally mean 48 hours, and not just done on the same date or the day after or the day after that. And I'm also going to assume that "within 48 hours" means within the subsequent 48 hours, not within 48 hours before or after. If these assumptions are not correct, you can modify the code accordingly. I will also assume that while a patient may have had test1 more than once, he/she cannot have had it more than once on any given date. (If that's not true, your problem is ill-posed as there is no way to identify how to assign a test2 that matches both of those. In fact, to be correctly posed, it should be the case that test1 is never done twice on the same patient within 48 hours.)

Code:

use dataset1, clear isid patient_id test_date // ONLY ONE TEST1 PER PATIENT PER DATE rename test_date test_date1 // BRING IN TEST2 DATA joinby patient_id using datset2 rename test_date test_date2 // REDUCE TO WITHIN 48 HOUR MATCHES local 48hrs = 1000*60*60*48 // milliseconds keep if test_date2 - test_date1 <= `48hrs' // NOW SELECT HIGHEST VALUE AS THE UNIQUE MATCH drop if missing(test2_result) by patient_id test_date1 (test_2_result), sort: keep if _n == _N

With the above code you will end up with a data set that excludes any patients in dataset1 for whom no match is found. If you want to bring them in with missing values for the dataset2 variables you can accomplish that by -merge-ing the results on patient_id test_date with dataset1.

This code may fail under certain circumstances. For a patient had test1 on day 1, and had test1 again on day2, and then had test2 on day3, you will end up with test2 on day 3 matched to both instances of test1. That may or may not be what you want. Your description of the required match is consistent with that, but it may not have been a complete description. If you would only want one of those matches, you would need to decide how to select which test1 to match with that test2.
Comment
Mike Barker

Join Date: Apr 2014

Posts: 37
#3

31 Oct 2014, 09:04

I think you probably want to append these data sets, not merge them. When you merge data sets, you are matching the observations between both data sets. Right now, for both data sets, each observation is a patient-test. There is no patient-test in data set 1 that is the same patient-test in data set 2. The matching that you want to do (taking the highest result of test 2 within 48 hours of each obs of test 1), requires information from both data sets. So you can't do that before somehow combining them. In order to merge these data sets, you would have to first change both data sets so that each row is a single patient. Then you could match patient records to patient records.

Instead, just append the two data sets. You're new data set should have an additional variable, testtype, equal to 1 or 2. Create the testtype variable before appending the two data sets.

Then you can:
1. sort the data by patient id and date/time.
2. Identify observations where testtype 2 is within 48 hours of testtype 1
3. Drop testtype 2 observations that are not within 48 hours of testtype 1
4. collapse the data by patient id and testtype, keeping the (max) value of testtype2
5. number each test-pair, so you can either xtset the data or reshape it from long to wide

Probably the most difficult step will be number 2. If you run into trouble, you should post again, showing what the data looks like at that point.

Mike
Comment
Christopher Barnes

Join Date: Aug 2014

Posts: 6
#4

31 Oct 2014, 09:10

Thanks for the help, I have a rough idea what to do now. I was thinking of doing some horrifically complicated affair and then using merge, but using joinby or append would make much more sense.
Comment
Christopher Barnes

Join Date: Aug 2014

Posts: 6
#5

02 Nov 2014, 09:33

Tried it out using joinby and the bit of code Clyde offered, which works brilliantly and does almsot everything I want. The only question I have is that I would like to keep the results in dataset 1, and have the variables from dataset 2 as missing values. How would I go about doing that using the merge command however? Use of multiple data sets I find is a bit confusing for me.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#6

02 Nov 2014, 11:38

To bring back the observations of dataset 1 that were dropped, after running the code posted previously just do:

Code:

clonevar test_date = test_date1 merge 1:1 patient_id test_date using dataset1

This will bring back any dataset1 records that were lost, filling in the dataset2 variables with missing values, and it will leave unchanged those records that were retained.
Comment

Announcement

Overly complicated merging advice

Comment

Comment

Comment

Comment

Comment