Help with large dataset: getting average number of surgeries performed by hospital and surgeon

Christopher Dall

Join Date: Sep 2023

Posts: 11
#1

Help with large dataset: getting average number of surgeries performed by hospital and surgeon

11 Sep 2023, 12:00

Hi all,

First time posting, so please forgive any errors.

I have a large dataset of patients undergoing surgery (approx 140k). This data set is large but the relevant data is organized as follows

Patient ID Hospital_ID Year Surgeon ID

1 1 2005

2 1 2006

3 2 2007

4 3 2008

5 4

My goal is to determine the average number of surgeries performed by hospital and by surgeon per year and then divide the patients into 4 groups based on hospital/surgeon volume. To do this, I need to first sort by year or capture the unique hospital and surgeon IDs, then find the average number of observations (aka surgeries performed) by the hospitals and surgeons for that year, and then generate a new variable assigning patients into the quartiles based on hospital and surgeon volume.

I've thus far tried to generate code to find the unique hospital_ID codes by using the following:
bys Hospital_ID: generate _first = (_n == 1)

This has allowed me to determine if it's the first instance of the hospital ID performing the surgery on the patient, but honestly I have no idea where to go from here. Any insight would be greatly appreciated!

Thanks,
Chris
Tags: data, multilevel
Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#2

11 Sep 2023, 12:09

I don't understand what you are trying to do. In a given year, a hospital, or a surgeon does a certain number of procedures. What are you trying to average across? Do you want the average number of procedures each year across all surgeons, and all hospitals? Or do you actually want the number of procedures each surgeon or hospital does? Or do you want to calculate the number of procedures done by each surgeon, and in each hospital in every year and then average that across the years? Or something else?

Also, the same surgeon might operate at multiple hospitals, right? So do you want to just add it up for the surgeon across all hospitals, or do you want a separate total for each combination of surgeon and hospital?

In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 18, 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
Christopher Dall

Join Date: Sep 2023

Posts: 11
#3

11 Sep 2023, 12:41

Thank you for such a quick reply. I am trying to get the average number of procedures performed each year on both a surgeon level and hospital level. So for example, say hospital ID 1 does 25 surgeries, hospital ID 2 does 50 surgeries, hospital 3 does 75 surgeries and hospital 4 does 100 and a fifth hospital does 0 (just to make the math easy). My goal is to determine the median and IQR for the number of surgeries performed by the hospitals ( in this case would be 50, 25-75)). I could then assign patients who underwent surgery into an appropriate quartile of hospital volume. I would do the same with surgeon volume separately, so it would be two separate variables created (making It or if the surgeon operates at multiple hospitals)

I will certainly update this post asap with the Dataex information. Thank you for the information!!!

Last edited by Christopher Dall; 11 Sep 2023, 12:43.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30164
#4

11 Sep 2023, 12:58

OK, based on the table you showed in #1, and assuming the data are actually of suitable types, you can probably do his most simply with:

Code:

frame put surgeon_id year, into(surgeons) frame surgeons { contract surgeon_id year, freq(procedures_this_surgeon) egen quartile_this_surgeon = xtile(procedures_this_surgeon), /// nq(4) by(year) } frame put hospital_id year, into(hospitals) frame hospitals { contract hospital_id year, freq(procedures_this_hospital) egen quartile_this_hospital = xtile(procedures_this_hospital), /// nq(4) by(year) } frlink m:1 surgeon_id year, frame(surgeons) frget *_this_surgeon, from(surgeons) frlink m:1 hospital_id year, frame(hospitals) frget *_this_hospital, from(hospitals)

Notes:
1. The -egen xtile()- function is not part of official Stata. You can get it as part of the -egenmore- package, from SSC.
2. The code evidently relies on frames, which means you must have version 16 or later to run this. Note that when not using the current version of Stata (18 as of now) one should include the version you are using in your post to avoid getting responses that you cannot use.
Comment
Christopher Dall

Join Date: Sep 2023

Posts: 11
#5

11 Sep 2023, 14:24

This worked excellently! Thank you so much!!!! I also appreciate your patience-I'm very new to stata (trying as much as possible to learn).
Comment

Christopher Dall

Join Date: Sep 2023
Posts: 11

18 Sep 2023, 13:35

Sorry to bring this up, while I had thought this worked, I'm not sure it has in hindsight. I've attached the data code and some more information to help troubleshoot.

My goal was to place the hospitals & surgeons into quartiles based on volume of surgeries performed, but it looks like they are unequally distributed within the quartiles, but I am unsure as to why. I wonder if it has to do with the frequency command. When I tabulate the hospital quartiles following this code, I get the code output below (just showing one year for ease of reading). Only showing the data for hospitals, but having the same issue with surgeons (and the data is arranged the same way)

Code:

-> year = 2016

quartile_th |
is_hospital | Freq. Percent Cum.
------------+-----------------------------------
1 | 286 2.71 2.71
2 | 1,118 10.58 13.28
3 | 2,516 23.80 37.09
4 | 6,650 62.91 100.00
------------+-----------------------------------
Total | 10,570 100.00

For reference, the code I used was the one you gave me (Copied below), and I am using STATA18.

Code:

frame put PROV_ID year, into(hospitals)
frame hospitals {
contract PROV_ID year, freq(procedures_this_hospital)
egen quartile_this_hospital = xtile(procedures_this_hospital), nq(4) by(year)
}

frlink m:1 PROV_ID year, frame(hospitals)
frget *_this_hospital, from(hospitals)

MY dataset (from dataex) is listed below as well.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long PAT_KEY float year long PROV_ID
-1884967568 2005 103
-1884966826 2004 103
-1884966619 2005 103
-1884966139 2005 103
-1884965640 2005 103
-1884965393 2004 103
-1884965239 2005 103
-1884964785 2004 103
-1884964367 2005 103
-1884963641 2005 103
-1884963495 2005 103
-1884963306 2005 103
-1884963225 2004 103
-1884962942 2005 103
-1884962900 2005 103
-1884962524 2005 103
-1884962476 2005 103
-1884961710 2005 103
-1884961466 2005 103
-1884960422 2005 103
-1884960418 2005 103
-1884960362 2005 103
-1884960342 2005 103
-1884960060 2005 103
-1884959454 2005 103
end

Last edited by Christopher Dall; 18 Sep 2023, 13:42.

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35775
#7

19 Sep 2023, 03:45

The limitations of quantile binning are frequently raised here. Stata's procedures are necessarily constrained by the frequencies that occur, given the rule that observations with the same value must be assigned to the same quantile bin (class, interval). If there are few distinct values and/or many ties, the problem is often severe. The same problem would bite in any other software.

Your data example does not show procedures per hospital and in any case is not large enough to show the problem, but it should be apparent if you look carefully at

Code:

tab procedures year

after contract.

For some discussions, see https://journals.sagepub.com/doi/pdf...867X1201200413 Section 4

https://journals.sagepub.com/doi/pdf...867X1801800311 Section 6

Your choices include avoiding binning altogether and accepting that unequal frequencies are not at all surprising.

Last edited by Nick Cox; 19 Sep 2023, 04:06.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35775
#8

19 Sep 2023, 04:04

Cross-posted at https://www.reddit.com/r/stata/comme...of_procedures/

It is a request here and a rule on Reddit's r/stata that you flag cross-postings.

For our policy see https://www.statalist.org/forums/help#crossposting

The good folks at Reddit can fend for themselves, I trust.
Comment

Patient ID	Hospital_ID	Year	Surgeon ID
1	1	2005
2	1	2006
3	2	2007
4	3	2008
5	4

Announcement