Duplicates merges due to dates overlap included in the range values chosen in rangejoint

Eduardo Nunez

Join Date: Apr 2014

Posts: 3
#1

Duplicates merges due to dates overlap included in the range values chosen in rangejoint

12 Sep 2023, 12:00

Dear Statalisters,

I am presently working in a project that involves an Electronic Health Records (EHR) dataset, comprising multiple data modules linked via the patient ID (pID). For the sake of this discussion, I will narrow the focus to two key modules: the patient's hospitalization data and the Lab data module. My aim is to merge specific laboratory values from the lab_data module to the patient's hospitalization records. In this context, each patient's hospitalization record is marked by a pID and a corresponding date, while the lab module similarly captures the pID alongside the date of the laboratory test.

To address this task, I employed rangejoin — a tool developed by Robert Picard and accessible via SSC. This facilitates the merging of laboratory data based on a range encompassing the patient's hospital admission date. It is pertinent to mention that both datasets are structured in a long format.

My dilemma centers on circumventing the potential duplication of lab data. This is especially problematic as the date range specified in rangejoin might encapsulate adjacent dates of hospital admissions. It's my understanding that rangejoin doesn't natively support the if command. Consequently, I'm eager to understand if the merging process, within the defined range, would cease if the upper date range encompasses the date of a subsequent row, or conversely, if the lower date range comprises the date of the antecedent row.

Appended below are truncated versions of both datasets, complemented by my scripting. Notably, my script incorporates SSC community-provided packages - rangejoin — a tool developed by Robert Picard, and - datacheck— (credited to Krishnan Bhaskaran).

* Dataset 1 – Patient's hospitalizations
* Example generated by -dataex-. For more info, type help dataex
clear
input long pID byte visit int(fecha1 time_d1 time_d2)
73709 1 21782 0 7
73709 2 21789 7 8
73709 3 21797 15 8
73709 4 21805 23 .
74442 1 22539 0 14
74442 2 22553 14 70
74442 3 22623 84 91
74442 4 22714 175 119
74442 5 22833 294 271
74442 6 23104 565 .
75124 1 21963 0 8
75124 2 21971 8 7
75124 3 21978 15 27
75124 4 22005 42 91
75124 5 22096 133 .
76664 1 22371 0 5
76664 2 22376 5 7
76664 3 22383 12 21
76664 4 22404 33 70
76664 5 22474 103 23
76664 6 22497 126 103
76664 7 22600 229 .
77099 1 22110 0 3
77099 2 22113 3 20
77099 3 22133 23 5
77099 4 22138 28 27
77099 5 22165 55 38
77099 6 22203 93 7
77099 7 22210 100 26
77099 8 22236 126 3
77099 9 22239 129 4
77099 10 22243 133 7
77099 11 22250 140 .
87535 1 21691 0 29
87535 2 21720 29 35
87535 3 21755 64 35
87535 4 21790 99 91
87535 5 21881 190 124
87535 6 22005 314 107
87535 7 22112 421 350
87535 8 22462 771 .
89364 1 22449 0 10
89364 2 22459 10 67
89364 3 22526 77 73
89364 4 22599 150 .
99934 1 21797 0 12
99934 2 21809 12 7
99934 3 21816 19 14
99934 4 21830 33 21
99934 5 21851 54 49
99934 6 21900 103 .
109747 1 22098 0 21
109747 2 22119 21 21
109747 3 22140 42 154
109747 4 22294 196 204
109747 5 22498 400 .
end
format %td fecha1
save file1, replace

* Dataset 2 – Patient's Labs
* Example generated by -dataex-. For more info, type help dataex
clear
input long pID float lab_date int lab_time_d double ca125_lab
73709 19550 0 24
73709 19561 11 31
73709 19856 306 13
73709 20100 550 18
73709 20149 599 17
73709 20268 718 16
73709 20634 1084 23
73709 20828 1278 21
73709 20921 1371 18
73709 21098 1548 92
73709 21285 1735 14
73709 21510 1960 15
73709 21773 2223 39
73709 21782 2232 28
73709 21784 2234 25
74442 20486 0 7
74442 22083 1597 6
74442 22532 2046 46
74442 22551 2065 12
74442 22574 2088 9
74442 22614 2128 7
74442 22707 2221 7
74442 22826 2340 8
75124 21864 0 7
75124 21957 93 11
75124 21971 107 8
75124 22096 232 7
75124 22781 917 8
75124 22963 1099 8
76664 21535 0 20
76664 22001 466 31
76664 22193 658 20
76664 22371 836 61
76664 22376 841 61
76664 22379 844 59
76664 22396 861 32
76664 22466 931 19
76664 22473 938 22
76664 22494 959 24
76664 22538 1003 52
76664 22599 1064 31
76664 22683 1148 30
76664 22699 1164 31
76664 22705 1170 38
76664 22733 1198 77
76664 22761 1226 85
77099 22033 0 33
77099 22035 2 35
77099 22053 20 33
77099 22104 71 29
77099 22113 80 30
77099 22132 99 31
77099 22159 126 22
77099 22202 169 14
77099 22232 199 11
77099 22239 206 13
77099 22246 213 13
77099 22295 262 11
77099 22351 318 11
77099 22361 328 11
87535 21616 0 77
87535 21663 47 28
87535 21685 69 16
87535 21717 101 14
87535 21748 132 11
87535 21783 167 9
87535 21878 262 11
87535 22005 389 10
87535 22102 486 10
87535 22118 502 11
87535 22124 508 12
87535 22174 558 12
87535 22216 600 10
87535 22448 832 11
87535 22483 867 12
87535 22516 900 12
87535 22613 997 12
87535 22792 1176 11
87535 22942 1326 13
87535 22953 1337 14
87535 22970 1354 13
89364 19521 0 9
89364 20256 735 13
89364 20264 743 14
89364 20858 1337 25
89364 21020 1499 12
89364 21208 1687 14
89364 21364 1843 12
89364 21747 2226 12
89364 22111 2590 16
89364 22231 2710 14
89364 22442 2921 31
89364 22449 2928 23
89364 22453 2932 20
89364 22487 2966 14
89364 22522 3001 15
89364 22599 3078 16
89364 22650 3129 17
89364 22771 3250 10
89364 22939 3418 15
end
format %td lab_date
rename lab_date fecha1
save file2, replace

// Merging Lab values from dataset B to dataset A using rangejoin ++++++++++++++++++++++++++++++++++++++++++++++++++ +++
use file1, clear

// Creating range dates ++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++
gen int date1 = fecha1 - 3
gen int date2 = fecha1 + 3
format %td date1 date2
sum date1 date2

sort pID fecha1
list pID visit fecha1 date1 date2, noobs sepby(pID) ab(20)

// Merging Lab values - using rangejoin ++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++
count // n = 56
rangejoin fecha1 date1 date2 using file2, by(pID) keepusing(ca125_lab)
count // n = 58

sort pID fecha1
list pID visit fecha1 date1 date2 fecha1_U time_d1 time_d2 ca125_lab, noobs sepby(pID) ab(20)

The list shows that pID # 77099 on visits # 8, 9, and 10 had the same lab values because the range chosen in rangejoint include the hospitalization dates for the following admission dates.

// Creating difference in dates(fecha1 fecha1_U) ++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++
gen fecha1_gap = fecha1_U - fecha1 if !missing(fecha1, fecha1_U)
label var fecha1_gap "fecha1_U - fecha1"
order fecha1_gap, after(fecha1_U)

tab fecha1_gap, m
/*
fecha1_U - |
fecha1 | Freq. Percent Cum.
------------+-----------------------------------
-3 | 3 5.17 5.17
-2 | 1 1.72 6.90
-1 | 4 6.90 13.79
0 | 10 17.24 31.03
2 | 1 1.72 32.76
3 | 4 6.90 39.66
. | 35 60.34 100.00
------------+-----------------------------------
Total | 58 100.00
*/

sort pID fecha1
list pID visit fecha1 fecha1_U fecha1_gap, noobs sepby(pID) ab(30)

// Checking for duplicates ++++++++++++++++++++++++++++++++++++++++++++++++++ +++
duplicates list pID visit, sepby(pID visit)

duplicates tag pID visit, gen(dups)
tab dups, m

sort pID fecha1
datacheck dups == 0, by(pID fecha1) varshow(pID fecha1 fecha1_U fecha1_gap ca125_lab) abbreviate(25) noobs flag
sort pID fecha1
list pID fecha1 fecha1_U fecha1_gap ca125_lab if _contra == 1, noobs sepby(pID fecha1) ab(25)

egen new_pid = group(pID visit)

sort pID fecha1
list new_pid pID fecha1 fecha1_U fecha1_gap ca125_lab if _contra == 1, noobs sepby(pID fecha1) ab(25)

bys new_pid (fecha1_U): gen byte new_visit = _n

sort pID new_visit
list new_pid pID new_visit fecha1 fecha1_U fecha1_gap ca125_lab if _contra == 1, noobs sepby(pID fecha1) ab(25)

sort pID fecha1 fecha1_U
list new_pid pID new_visit fecha1 fecha1_U fecha1_gap ca125_lab, noobs sepby(pID fecha1) ab(25)

tab new_visit, m

foreach varname of varlist fecha1_U ca125_lab {
bys new_pid (fecha1_U): gen `varname'_2 = `varname' if (_contra == 1 & new_visit == 2)
bys new_pid (fecha1_U): replace `varname'_2 = `varname'_2[2] if _contra == 1
}

format fecha1_U_2 %td
sort pID fecha1 fecha1_U
list new_pid pID new_visit visit fecha1 fecha1_U fecha1_U_2 ca125_lab ca125_lab_2 if _contra == 1, noobs sepby(new_pid) ab(25)

// Deleting duplicated rows ++++++++++++++++++++++++++++++++++++++++++++++++++ +++
bys new_pid (fecha1_U): drop if (new_visit != 1 & _contra == 1)

count // n = 56

sort pID fecha1
list pID visit fecha1 fecha1_U fecha1_U_2 ca125_lab ca125_lab_2, noobs sepby(pID) ab(25)

I deeply appreciate any guidance or insights you can provide.

Eduardo Nunez, MD, MPH
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#2

12 Sep 2023, 13:26

-rangejoin- does not know that you don't want the same lab to be attributed to two different hospitalizations that both fall within 3 days of the lab. And there is no way, in -rangejoin- to specify that restriction. So, yes, you will be getting duplicates of some of the labs. If the same lab result falls within three days of two (or more) hospitalizations, it will appear associated with both (all) of those.

If that is not what you want, then you have to come up with some rule for which hospitalization to associate the lab value with. Normally, when I am combining lab data with hospitalization data, the hospitalization data set has both an admission and discharge date, and I -rangejoin- those labs that fall between those two dates. But you have only one date (I can't tell if it's the admission date, the discharge date, or something else from what you've shown). So you need a rule. Associate to the first admission? The last? Pick the one with the closest date (which might have ties and then you need a rule for breaking the ties)? Pick one at random? Once you have settled on a rule, it's a matter of writing a -keep- or -drop- command that applies the rule after -rangejoin- has run.
1 like
Comment
Eduardo Nunez

Join Date: Apr 2014

Posts: 3
#3

12 Sep 2023, 16:44

The date listed as fecha1 is admission date. I also have discharge date for each patient. Following your advice, I will establish the range using these two dates.
Thank you Clyde.
Comment

Announcement

Duplicates merges due to dates overlap included in the range values chosen in rangejoint

Comment

Comment