Collapsing Subsets of Grouped Observations

Joshua Rosen

Join Date: Nov 2020
Posts: 4

Collapsing Subsets of Grouped Observations

07 Jan 2021, 16:05

Hello,
I have a data file that lists changes in the type of oxygen delivery device being used by a patient in the hospital. It is in long format with each row representing a change to the oxygenation device. A sample of data for 3 patients is given below. Note that there are additional variables here that are not shown for clarity. elapsed_time is the time in hours that the patient was receiving oxygen through the specified type of device (calculated using the difference in the date_change field between the n and n+1 observation). escalation_therapy_nonvent is an indicator variable denoting if the delivery device counts as a "therapy escalation.
Note: all dates and IDs in the sample data-set have been modified for posting here

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str5 registryid double date_change float elapsed_time str19 delivery_device float escalation_therapy_notvent
"1" 1746208620000   .5497222 "bvm"                 .
"1" 1.7462109e+12  26.583334 "ventilator"          .
"1" 1.7463066e+12   7.933055 "nc"                  .
"1" 1746335159000   .4833333 "simple mask"         .
"1" 1746336899000  35.916943 "nc"                  .
"1" 1.7464662e+12   5.816389 "room air"            .
"1" 1746487139000  18.733612 "nc"                  .
"1" 1746554580000  124.94972 "ventilator"          .
"1" 1747004399000  2.0002778 "nc"                  .
"1" 1.7470116e+12   309.5333 "room air"            .
"1" 1748125920000       1.45 "nc"                  .
"1" 1748131140000 .016666668 "room air"            .
"1" 1.7481312e+12   .3330556 "non-rebreather"      1
"1" 1748132399000   2.283611 "nc"                  .
"1" 1748140620000       1.45 "non-rebreather"      1
"1" 1748145840000   3.483333 "hfnc"                1
"1" 1748158380000  1.1666666 "non-rebreather"      1
"1" 1748162580000  14.133333 "hfnc"                1
"1" 1.7482179e+12   334.9333 "room air"            .
"1" 1749423660000  153.11665 "ventilator"          .
"1" 1749974880000  .11666667 "bvm"                 .
"1" 1.7499753e+12       58.6 "ventilator"          .
"1" 1750187220000          . "ventilator"          .
"3" 1.7457636e+12   2.499722 "non-rebreather"      1
"3" 1745772599000  25.166945 "nc"                  .
"3" 1.7458632e+12  .08333334 "room air"            .
"3" 1.7458635e+12   56.91667 "nc"                  .
"3" 1.7460684e+12          . "room air"            .
"4" 1.7458956e+12  20.433056 "ventilator"          .
"4" 1745969159000  1.7166667 "venti mask"          1
"4" 1745975339000   29.03333 "hfnc"                1
"4" 1746079859000   .8169444 "high flow generator" 1
"4" 1.7460828e+12      17.75 "hfnc"                1
"4" 1.7461467e+12         30 "nc"                  .
"4" 1.7462547e+12 .033333335 "room air"            .
"4" 1746254820000  27.116667 "nc"                  .
"4" 1746352440000  .08305556 "room air"            .
"4" 1746352739000   79.51694 "nc"                  .
"4"  1.746639e+12         .5 "room air"            .
"4" 1.7466408e+12          . "nc"                  .
end
format %tc date_change

What I would like to do is collapse consecutive entries that represent an escalation of therapy (escalation_therapy_notvent == 1), which I would refer to as an "escalation episode" into a single observation with date_change == the first observation in the group being collapsed and the elapsed_time == total(elapsed_time for collapsed observations). There are additional variables not shown here and similar to date_change, I would want the entries from the first observation in an episode to be present in the collapsed observation.

I have been playing around with using various options with sortyby and referencing the _n+1th observation, but haven't been able to come up with something that will work with the variability present in my data (both in terms of the number of consecutive observations that may be present) and the number of "groups" within each patient. For example, a patient can have multiple "episodes" of escalation that are separated by normal oxygen delivery.

The purpose of this manipulation is that I want to see the distribution of typical durations of an escalation episode, but charting practices by providers and the clinical course of patients can vary considerably so some patients have 20 entries that correspond to a single "episode" but other patients have 1 or 2.

Thank you for any help that you can provide.

-JR

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30095
#2

07 Jan 2021, 16:39

Code:

recode escalation_therapy_notvent (. = 0) by registryid (date_change), sort: gen spell = /// !escalation_therapy_notvent | /// (escalation_therapy_notvent != escalation_therapy_notvent[_n-1]) by registryid (date_change): replace spell = sum(spell) ds registryid spell elapsed_time, not local keep_first `r(varlist)' collapse (sum) elapsed_time (first) `keep_first', by(registry_id spell)

Notes:
1. Your variable escalation_therapy_notvent is constructed as 1 = yes, . = no. This is, in general, not a useful way to create a yes/no variable in Stata, and it is particularly noxious in this situation. First essential is to recode no as zero so that Stata's logical expressions will handle it properly.

2. Following what you said, elapsed time is summed, and registry_id and spell are constant in each collapsed observation. All other variables take the chronologically first from their spell. That's what you asked for. But for delivery_device that strikes me as wrong. Actually, I'm not sure what would be an appropriate thing to do with the delivery device variable here, as it is not consistent within spells of escalation.
Comment
Joshua Rosen

Join Date: Nov 2020

Posts: 4
#3

07 Jan 2021, 17:27

Thanks very much Clyde, this is great and again demonstrates for me the power of using the by command and thinking creatively with it.

To your points:
1) Thank you for pointing it out, this was definitely sloppy on my part

2) For the delivery_device variable, it is actually no longer important for the purposes of this study once you collapse it as we are treating them as equivalent for our outcome of interest. All of those values represent an escalation of care above baseline (captured by the escalation_therapy_notvent variable) that may be chosen for different reasons, but the outcome of interest is escalation in care, not the particular device that is used. I probably could have left it out of the example but wanted to give a bit more context for the data's purpose
Comment

Announcement

Collapsing Subsets of Grouped Observations

Comment

Comment