Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Duplicates merges due to dates overlap included in the range values chosen in rangejoint

    Dear Statalisters,

    I am presently working in a project that involves an Electronic Health Records (EHR) dataset, comprising multiple data modules linked via the patient ID (pID). For the sake of this discussion, I will narrow the focus to two key modules: the patient's hospitalization data and the Lab data module. My aim is to merge specific laboratory values from the lab_data module to the patient's hospitalization records. In this context, each patient's hospitalization record is marked by a pID and a corresponding date, while the lab module similarly captures the pID alongside the date of the laboratory test.

    To address this task, I employed rangejoin — a tool developed by Robert Picard and accessible via SSC. This facilitates the merging of laboratory data based on a range encompassing the patient's hospital admission date. It is pertinent to mention that both datasets are structured in a long format.

    My dilemma centers on circumventing the potential duplication of lab data. This is especially problematic as the date range specified in rangejoin might encapsulate adjacent dates of hospital admissions. It's my understanding that rangejoin doesn't natively support the if command. Consequently, I'm eager to understand if the merging process, within the defined range, would cease if the upper date range encompasses the date of a subsequent row, or conversely, if the lower date range comprises the date of the antecedent row.

    Appended below are truncated versions of both datasets, complemented by my scripting. Notably, my script incorporates SSC community-provided packages - rangejoin — a tool developed by Robert Picard, and - datacheck— (credited to Krishnan Bhaskaran).



    * Dataset 1 – Patient's hospitalizations
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long pID byte visit int(fecha1 time_d1 time_d2)
    73709 1 21782 0 7
    73709 2 21789 7 8
    73709 3 21797 15 8
    73709 4 21805 23 .
    74442 1 22539 0 14
    74442 2 22553 14 70
    74442 3 22623 84 91
    74442 4 22714 175 119
    74442 5 22833 294 271
    74442 6 23104 565 .
    75124 1 21963 0 8
    75124 2 21971 8 7
    75124 3 21978 15 27
    75124 4 22005 42 91
    75124 5 22096 133 .
    76664 1 22371 0 5
    76664 2 22376 5 7
    76664 3 22383 12 21
    76664 4 22404 33 70
    76664 5 22474 103 23
    76664 6 22497 126 103
    76664 7 22600 229 .
    77099 1 22110 0 3
    77099 2 22113 3 20
    77099 3 22133 23 5
    77099 4 22138 28 27
    77099 5 22165 55 38
    77099 6 22203 93 7
    77099 7 22210 100 26
    77099 8 22236 126 3
    77099 9 22239 129 4
    77099 10 22243 133 7
    77099 11 22250 140 .
    87535 1 21691 0 29
    87535 2 21720 29 35
    87535 3 21755 64 35
    87535 4 21790 99 91
    87535 5 21881 190 124
    87535 6 22005 314 107
    87535 7 22112 421 350
    87535 8 22462 771 .
    89364 1 22449 0 10
    89364 2 22459 10 67
    89364 3 22526 77 73
    89364 4 22599 150 .
    99934 1 21797 0 12
    99934 2 21809 12 7
    99934 3 21816 19 14
    99934 4 21830 33 21
    99934 5 21851 54 49
    99934 6 21900 103 .
    109747 1 22098 0 21
    109747 2 22119 21 21
    109747 3 22140 42 154
    109747 4 22294 196 204
    109747 5 22498 400 .
    end
    format %td fecha1
    save file1, replace


    * Dataset 2 – Patient's Labs
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long pID float lab_date int lab_time_d double ca125_lab
    73709 19550 0 24
    73709 19561 11 31
    73709 19856 306 13
    73709 20100 550 18
    73709 20149 599 17
    73709 20268 718 16
    73709 20634 1084 23
    73709 20828 1278 21
    73709 20921 1371 18
    73709 21098 1548 92
    73709 21285 1735 14
    73709 21510 1960 15
    73709 21773 2223 39
    73709 21782 2232 28
    73709 21784 2234 25
    74442 20486 0 7
    74442 22083 1597 6
    74442 22532 2046 46
    74442 22551 2065 12
    74442 22574 2088 9
    74442 22614 2128 7
    74442 22707 2221 7
    74442 22826 2340 8
    75124 21864 0 7
    75124 21957 93 11
    75124 21971 107 8
    75124 22096 232 7
    75124 22781 917 8
    75124 22963 1099 8
    76664 21535 0 20
    76664 22001 466 31
    76664 22193 658 20
    76664 22371 836 61
    76664 22376 841 61
    76664 22379 844 59
    76664 22396 861 32
    76664 22466 931 19
    76664 22473 938 22
    76664 22494 959 24
    76664 22538 1003 52
    76664 22599 1064 31
    76664 22683 1148 30
    76664 22699 1164 31
    76664 22705 1170 38
    76664 22733 1198 77
    76664 22761 1226 85
    77099 22033 0 33
    77099 22035 2 35
    77099 22053 20 33
    77099 22104 71 29
    77099 22113 80 30
    77099 22132 99 31
    77099 22159 126 22
    77099 22202 169 14
    77099 22232 199 11
    77099 22239 206 13
    77099 22246 213 13
    77099 22295 262 11
    77099 22351 318 11
    77099 22361 328 11
    87535 21616 0 77
    87535 21663 47 28
    87535 21685 69 16
    87535 21717 101 14
    87535 21748 132 11
    87535 21783 167 9
    87535 21878 262 11
    87535 22005 389 10
    87535 22102 486 10
    87535 22118 502 11
    87535 22124 508 12
    87535 22174 558 12
    87535 22216 600 10
    87535 22448 832 11
    87535 22483 867 12
    87535 22516 900 12
    87535 22613 997 12
    87535 22792 1176 11
    87535 22942 1326 13
    87535 22953 1337 14
    87535 22970 1354 13
    89364 19521 0 9
    89364 20256 735 13
    89364 20264 743 14
    89364 20858 1337 25
    89364 21020 1499 12
    89364 21208 1687 14
    89364 21364 1843 12
    89364 21747 2226 12
    89364 22111 2590 16
    89364 22231 2710 14
    89364 22442 2921 31
    89364 22449 2928 23
    89364 22453 2932 20
    89364 22487 2966 14
    89364 22522 3001 15
    89364 22599 3078 16
    89364 22650 3129 17
    89364 22771 3250 10
    89364 22939 3418 15
    end
    format %td lab_date
    rename lab_date fecha1
    save file2, replace


    // Merging Lab values from dataset B to dataset A using rangejoin ++++++++++++++++++++++++++++++++++++++++++++++++++ +++
    use file1, clear


    // Creating range dates ++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++
    gen int date1 = fecha1 - 3
    gen int date2 = fecha1 + 3
    format %td date1 date2
    sum date1 date2

    sort pID fecha1
    list pID visit fecha1 date1 date2, noobs sepby(pID) ab(20)


    // Merging Lab values - using rangejoin ++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++
    count // n = 56
    rangejoin fecha1 date1 date2 using file2, by(pID) keepusing(ca125_lab)
    count // n = 58

    sort pID fecha1
    list pID visit fecha1 date1 date2 fecha1_U time_d1 time_d2 ca125_lab, noobs sepby(pID) ab(20)

    The list shows that pID # 77099 on visits # 8, 9, and 10 had the same lab values because the range chosen in rangejoint include the hospitalization dates for the following admission dates.


    // Creating difference in dates(fecha1 fecha1_U) ++++++++++++++++++++++++++++++++++++++++++++++++++ +++++++++++++++++++++++++
    gen fecha1_gap = fecha1_U - fecha1 if !missing(fecha1, fecha1_U)
    label var fecha1_gap "fecha1_U - fecha1"
    order fecha1_gap, after(fecha1_U)

    tab fecha1_gap, m
    /*
    fecha1_U - |
    fecha1 | Freq. Percent Cum.
    ------------+-----------------------------------
    -3 | 3 5.17 5.17
    -2 | 1 1.72 6.90
    -1 | 4 6.90 13.79
    0 | 10 17.24 31.03
    2 | 1 1.72 32.76
    3 | 4 6.90 39.66
    . | 35 60.34 100.00
    ------------+-----------------------------------
    Total | 58 100.00
    */

    sort pID fecha1
    list pID visit fecha1 fecha1_U fecha1_gap, noobs sepby(pID) ab(30)


    // Checking for duplicates ++++++++++++++++++++++++++++++++++++++++++++++++++ +++
    duplicates list pID visit, sepby(pID visit)

    duplicates tag pID visit, gen(dups)
    tab dups, m

    sort pID fecha1
    datacheck dups == 0, by(pID fecha1) varshow(pID fecha1 fecha1_U fecha1_gap ca125_lab) abbreviate(25) noobs flag
    sort pID fecha1
    list pID fecha1 fecha1_U fecha1_gap ca125_lab if _contra == 1, noobs sepby(pID fecha1) ab(25)

    egen new_pid = group(pID visit)

    sort pID fecha1
    list new_pid pID fecha1 fecha1_U fecha1_gap ca125_lab if _contra == 1, noobs sepby(pID fecha1) ab(25)

    bys new_pid (fecha1_U): gen byte new_visit = _n

    sort pID new_visit
    list new_pid pID new_visit fecha1 fecha1_U fecha1_gap ca125_lab if _contra == 1, noobs sepby(pID fecha1) ab(25)

    sort pID fecha1 fecha1_U
    list new_pid pID new_visit fecha1 fecha1_U fecha1_gap ca125_lab, noobs sepby(pID fecha1) ab(25)

    tab new_visit, m


    foreach varname of varlist fecha1_U ca125_lab {
    bys new_pid (fecha1_U): gen `varname'_2 = `varname' if (_contra == 1 & new_visit == 2)
    bys new_pid (fecha1_U): replace `varname'_2 = `varname'_2[2] if _contra == 1
    }

    format fecha1_U_2 %td
    sort pID fecha1 fecha1_U
    list new_pid pID new_visit visit fecha1 fecha1_U fecha1_U_2 ca125_lab ca125_lab_2 if _contra == 1, noobs sepby(new_pid) ab(25)


    // Deleting duplicated rows ++++++++++++++++++++++++++++++++++++++++++++++++++ +++
    bys new_pid (fecha1_U): drop if (new_visit != 1 & _contra == 1)

    count // n = 56

    sort pID fecha1
    list pID visit fecha1 fecha1_U fecha1_U_2 ca125_lab ca125_lab_2, noobs sepby(pID) ab(25)



    I deeply appreciate any guidance or insights you can provide.

    Eduardo Nunez, MD, MPH

  • #2
    -rangejoin- does not know that you don't want the same lab to be attributed to two different hospitalizations that both fall within 3 days of the lab. And there is no way, in -rangejoin- to specify that restriction. So, yes, you will be getting duplicates of some of the labs. If the same lab result falls within three days of two (or more) hospitalizations, it will appear associated with both (all) of those.

    If that is not what you want, then you have to come up with some rule for which hospitalization to associate the lab value with. Normally, when I am combining lab data with hospitalization data, the hospitalization data set has both an admission and discharge date, and I -rangejoin- those labs that fall between those two dates. But you have only one date (I can't tell if it's the admission date, the discharge date, or something else from what you've shown). So you need a rule. Associate to the first admission? The last? Pick the one with the closest date (which might have ties and then you need a rule for breaking the ties)? Pick one at random? Once you have settled on a rule, it's a matter of writing a -keep- or -drop- command that applies the rule after -rangejoin- has run.

    Comment


    • #3
      The date listed as fecha1 is admission date. I also have discharge date for each patient. Following your advice, I will establish the range using these two dates.
      Thank you Clyde.

      Comment

      Working...
      X