Restructuring database based on a different identification key

Kevin Liu

Join Date: Jul 2022

Posts: 5
#1

Restructuring database based on a different identification key

06 Aug 2022, 10:37

Hi everyone,

I am working with a large healthcare administrative database. As it is, the database is organized by a unique patient encounter ID# which constitute each row. Associated with each encounter are a number of variables, which define the number of columsn in this dataset. These variables/columns include information for surgery type, treating physician, complications, treating hospital characteristics (size, location, etc) and so on.

I want to restructure the database so that rather than each row being defined by the specific patient encounter ID, I use a unique physician ID instead. Basically turning this database a physician-based dataset rather than a patient-based dataset so that I can describe surgeon trends.

I've been using the collapse function to this point though it's pretty cumbersome. Does anyone know of any alternative functions that could point me to? Thank you!
Tags: None
Daniel Schaefer

Join Date: Mar 2020

Posts: 842
#2

06 Aug 2022, 14:09

Hi Kevin,

I take it you are describing a single table that you are working with, and not a full SQL database, correct? At the risk of being pedantic "restructure the database" sounds a bit violent. Surely you don't want to mutate the database. Rather, you want to take the current dataset as an input and produce a new dataset as an output, preserving the original dataset in the process? And this new dataset should include summary statistics for your physicians?

Data frames are very useful for this kind of procedure. It's hard to say exactly which features you need without knowing more about the problem. If you want specific advice on how to take better advantage of data frames for your problem, please post the details of what you are trying to do along with any relevant code. An example dataset generated with -dataex- would also be very useful. If you look through the FAQ at the top of the page I'm sure you will see that this is typically the preferred format for asking questions.

Finally, I take it you have programming experience in a language other than Stata. Note that there is a distinction in Stata between a "command" and a "function." Consider the following line of code:

Code:

egen newv1 = group(v1 v2)

Here, -egen- is a command whereas group() is a function. A line will always start with a command, and not every command will take advantage of functions.
1 like
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2449
#3

06 Aug 2022, 16:54

I doubt any "restructuring" needs to be done. To my understanding, every observation in your data set has a code or name for the physician. Perhaps you might like to assign a numeric code for each physician, which could be done with:

Code:

encode PhysicianName, generate(PhysicianCode)

Of course, spelling errors might cause problems with the preceding, and there are solutions to that (e.g., probabilistic matching), but that's another issue.

I suspect that you somehow think that Stata privileges one variable or another in some way, e.g. PatientID over PhysicianName, but that's most likely a misunderstanding. Or, perhaps you think you want a data set in which each observation contains all the data on each patient that that physician encountered. That would be a very undesirable thing to do in Stata, as that so-called "wide" layout is rarely useful in Stata.

Each of your current observations is an encounter, and whether that encounter is "owned" by the physician or by the patient wouldn't matter to Stata for all the purposes I can think of right now. Now, perhaps there is something you want to do which you think requires or would be easier with another kind of layout or structure. If that's true, I'd encourage you to post again here and explain what that is so we can explain how you might do that. You mention use of -collapse-, which makes me think that you are wanting to produce summaries by physician (e.g., mean number of patients treated). There are better ways to do that.

You can, for visual and perhaps other purposes, put the physician variable in the first column (I like to do that kind of thing) using -order Physician Name-. You might also like to sort your observations by PhysicianName and perhaps other things as well, in which case see -help sort- and -help gsort-. However, these things, which matter to human beings, don't in general matter to Stata.
Comment
Kevin Liu

Join Date: Jul 2022

Posts: 5
#4

15 Aug 2022, 18:33

Hi everyone,

Thank you all for your comments, which were extremely helpful.

For context, I am working with a dataset where each row represents a unique patient encounter with a corresponding unique patient identifier. Each encounter is associated with a physician ID number as well, though these may repeat as physicians treat multiple patients.
Other variables (columns of the table) which are pertinent to my question include: type of surgery (hip or knee arthroplasty), year in which surgery occurred, patient age (at time of encounter), sex, presence/absence of comorbidities (each comorb is one column), presence/absence of a complication (each complication is a column with 0/1).

My outcome of interest is the odds of a joint infection. My independent variable at this time is the total # of surgeries performed by a given surgeon in a given year.
I was able to calculate the odds ratio of each surgeon-year unit manually using the equation (OR=a*d/b*c), where a=infections in a year for surgeon , b = (total surgeries - # of infections) or non-infections. C= total infections for all surgeons in a given year. D= (total surgeries - # of total infections).

The overarching goal, however, is to plot on y-axis: adjusted odds ratio of a joint infection for each surgeon-year unit and on the x-axis: annual volume of surgeries.
The adjusted odds ratio would hopefully account for patient-level factors (age, sex, comorbidities) which might increase or decrease the odds of a joint infection. For example, a surgeon taking on very complex, sick surgeries is likely going to have more joint infections. Hopefully there is a way to control for that such that his/her given OR is not higher than it should be.

I've attached an image of the unadjusted OR vs annual volume graph that I made. I want to eventually show that at a certain annual volume, the odds ratio of joint infection reaches 1 (or some asymptote, whatever that may be).

It seems that I may need to run an individual regression for each surgeon-year unit (about 18,000). Hopefully there is a more elegant way to accomplish this than for me to manually do it.

Any help would be greatly appreciated. Thank you!

Attached Files
Comment
Kevin Liu

Join Date: Jul 2022

Posts: 5
#5

15 Aug 2022, 21:38

An example of what I'm trying to do is done in this paper: https://pubmed.ncbi.nlm.nih.gov/29122321/
Comment

Announcement

Restructuring database based on a different identification key

Comment

Comment

Comment

Comment