Finding rows across dataset in row-specific intervals: those used to compute stats in -rangestat-

Paula de Souza Leao Spinola

Join Date: Jun 2015
Posts: 384

Finding rows across dataset in row-specific intervals: those used to compute stats in -rangestat-

15 Aug 2022, 10:23

I have a dataset where each row is identified by a hospitalization id. Each row contains information on the hospitalization id, hospital id where it took place as well as date of admission and date of discharge of the given hospitalization.

I would like to know, for each hospitalization, the id of all other hospitalizations concluded in the 30 days before the beginning of the given hospitalization in the given hospital where it took place.

Below is a simple example of 6 hospitalizations taking place in 2 hospitals. I also created the date range encompassing the 30-day interval prior to the beginning of each given hospitalization.

Code:

clear
set obs 6
gen hospitalization_id = _n
gen hospital_id = "A" if _n<4
replace hospital_id = "B" if _n>3
gen date_start = mdy(1,1,2000) if _n==1
replace date_start = mdy(1,12,2000) if _n==2
replace date_start = mdy(1,20,2000) if _n==3
replace date_start = mdy(2,10,2000) if _n==4
replace date_start = mdy(2,12,2000) if _n==5
replace date_start = mdy(2,12,2000) if _n==6
gen date_end = mdy(1,3,2000) if _n==1
replace date_end = mdy(1,18,2000) if _n==2
replace date_end = mdy(1,22,2000) if _n==3
replace date_end = mdy(2,11,2000) if _n==4
replace date_end = mdy(2,14,2000) if _n==5
replace date_end = mdy(2,17,2000) if _n==6
format date* %td

gen date_range1 = date_start - 31
gen date_range2 = date_start - 1
format date* %td

I am trying to write a code that adds a column with all the hospitalizations that took place in the 30-day interval range before the given hospitalization started (and in sample hospital). I managed to write a code that tabulates the information I am looking for.

Code:

levelsof hospitalization_id, local(ids)
foreach i of local ids {
    display "hospitalization `i': list of hospitalizations in 30-day period before given hospitalization started, in the same hospital"
    tab hospitalization_id if inrange(date_end,date_range1[`i'],date_range2[`i']) & hospital_id == hospital_id[`i']
}

The output of this code is copied below

Code:

hospitalization 1: list of hospitalizations in 30-day period before given hospitalization started, in the sa
> me hospital
no observations
hospitalization 2: list of hospitalizations in 30-day period before given hospitalization started, in the sa
> me hospital

hospitaliza |
    tion_id |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          1      100.00      100.00
------------+-----------------------------------
      Total |          1      100.00
hospitalization 3: list of hospitalizations in 30-day period before given hospitalization started, in the sa
> me hospital

hospitaliza |
    tion_id |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          1       50.00       50.00
          2 |          1       50.00      100.00
------------+-----------------------------------
      Total |          2      100.00
hospitalization 4: list of hospitalizations in 30-day period before given hospitalization started, in the sa
> me hospital
no observations
hospitalization 5: list of hospitalizations in 30-day period before given hospitalization started, in the sa
> me hospital

hospitaliza |
    tion_id |      Freq.     Percent        Cum.
------------+-----------------------------------
          4 |          1      100.00      100.00
------------+-----------------------------------
      Total |          1      100.00
hospitalization 6: list of hospitalizations in 30-day period before given hospitalization started, in the sa
> me hospital

hospitaliza |
    tion_id |      Freq.     Percent        Cum.
------------+-----------------------------------
          4 |          1      100.00      100.00
------------+-----------------------------------
      Total |          1      100.00

.

I would like to add this information into the dataset into a new column. The command -rangestat- informs how many hospitalizations fall into the given row's time interval. If more than 1 hospitalization fall into the given row's interval, the entire row should be repeated so as to inform all of them. I managed to do this below:

Code:

. rangestat (count) nhosp=hospitalization_id, interval(date_end date_range1 date_range2) by(hospital_id)

. replace nhosp=0 if nhosp==.
(2 real changes made)

. expand nhosp
(2 zero counts ignored; observations not deleted)
(1 observation created)

. sort hospitalization_id

. list, sepby(hospital_id)

     +-----------------------------------------------------------------------------+
     | hos~n_id   hos~l_id   date_st~t    date_end   date_ra~1   date_ra~2   nhosp |
     |-----------------------------------------------------------------------------|
  1. |        1          A   01jan2000   03jan2000   01dec1999   31dec1999       0 |
  2. |        2          A   12jan2000   18jan2000   12dec1999   11jan2000       1 |
  3. |        3          A   20jan2000   22jan2000   20dec1999   19jan2000       2 |
  4. |        3          A   20jan2000   22jan2000   20dec1999   19jan2000       2 |
     |-----------------------------------------------------------------------------|
  5. |        4          B   10feb2000   11feb2000   10jan2000   09feb2000       0 |
  6. |        5          B   12feb2000   14feb2000   12jan2000   11feb2000       1 |
  7. |        6          B   12feb2000   17feb2000   12jan2000   11feb2000       1 |
     +-----------------------------------------------------------------------------+

Now I would like to add the information I managed to tabulate into a new column in this dataset. It would do somehow a similar job as -rangestat- but instead of computing a given statistic using the rows that fall in the given interval it would inform which these rows are.

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#2

15 Aug 2022, 10:40

Well, I'm not sure exactly what you want here. You are going to, potentially, identify more than one previous hospitalization for each hospitalization_id in your data, but you want the end result to be a single variable. I'll take that to mean that you want the final result to be a string variable that lists the id's of the previous hospitalizations.

Code:

preserve keep hospitalization_id hospital_id date_end tempfile end_dates save `end_dates' restore gen lower_bound = date_start - 30 rangejoin date_end lower_bound date_start using `end_dates', by(hospital_id) rename hospitalization_id_U other_hospitalization by hospitalization_id (date_end_U), sort: gen int seq = _n drop date_end_U lower_bound reshape wide other_hospitalization, i(hospitalization_id) j(seq) egen wanted = concat(other_hospitalization*), punct(" ") drop other_hospitalization*

-rangejoin- is written by Robert Picard and is available from SSC. To use it, you must also have -rangejoin- installed, which, it appears, you already do have.
Comment

Paula de Souza Leao Spinola

Join Date: Jun 2015
Posts: 384

15 Aug 2022, 12:31

This is awesome, Clyde Schechter! The first part of the code is exactly what I wanted - thank you!

Now I am trying to add another layer of complexity to the code. I also have physician id in my data and would like to find out the hospitalizations taking place during the 30-day period period to the start of a given hospitalization that were performed int the same hospital by *other physicians*.

I wrote the code below, which seems to work. Basically, I just repeated your code and then dropped hospitalizations performed by the same physician. Given that my data is massive and each physician is very active, I was wondering if there was a more efficient way of doing so.

Code:

clear
set obs 8
gen hospitalization_id = _n
gen physician_id = 1 if _n<4
replace physician_id = 2 if inrange(_n,4,5)
replace physician_id = 3 if inrange(_n,6,8)
gen hospital_id = "A" if inrange(_n,1,5)
replace hospital_id = "B" if _n>5
gen date_start = mdy(1,1,2000) if _n==1
replace date_start = mdy(1,12,2000) if _n==2 | _n==4
replace date_start = mdy(1,20,2000) if _n==3 | _n==5
replace date_start = mdy(2,10,2001) if _n==6
replace date_start = mdy(2,11,2001) if _n==7 
replace date_start = mdy(2,12,2001) if _n==8
gen date_end = mdy(1,3,2000) if _n==1
replace date_end = mdy(1,18,2000) if _n==2 | _n==4
replace date_end = mdy(1,22,2000) if _n==3 | _n==5
replace date_end = mdy(2,11,2001) if _n==6
replace date_end = mdy(2,14,2001) if _n==7
replace date_end = mdy(2,17,2001) if _n==8
format date* %td

// Lists all of these hospitalizations
preserve
    keep hospitalization_id hospital_id physician_id date_end
    tempfile end_dates
    save `end_dates'
restore

gen lower_bound = date_start - 31
gen upper_bound = date_start - 1
format *bound %td

rangejoin date_end lower_bound upper_bound using `end_dates', by(hospital_id)
drop date_end_U

list, sepby(hospital_id physician_id)

replace hospitalization_id_U = . if physician_id == physician_id_U
bys hospitalization_id: drop if hospitalization_id_U == . & _N>1
drop physician_id_U

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#4

15 Aug 2022, 13:50

Well, you could combine the second and third to last commands into a single one:

Code:

bys hospitalization_id: drop if physician_id == physician_id_U & _N > 1

Theoretically, this will be a little faster than what you have, but I'd be very surprised if the difference is noticeable even in a very big data set.

I would definitely remove the -list- command. With a huge data set it's going to waste a lot of time listing out an enormous amount of output that will be too large for you to manage or use in any effective way anyhow. That should make a noticeable difference in the performance of the code.

Other than that, I don't see anything that can be improved.
Comment

Announcement

Finding rows across dataset in row-specific intervals: those used to compute stats in -rangestat-

Comment

Comment

Comment