Data Manipulation

Adam Tanious

Join Date: Aug 2017

Posts: 4
#1

Data Manipulation

22 Aug 2017, 18:33

I am working on a big database and from the database have extracted key subjects that fit the cohort I am looking to study based on common characteristics.
To be more specific, I am working on a database and extracting patients that had a specific type of surgery. I generated the variables based on procedure codes which are available in the data set.
Now that I have the list of patients that have had this particular surgery, I want to now use the specific surgeon-identifier code that comes in the row of data for every patient and go back to the original data set and extract every surgery performed by that surgeon.
How do I go back to the original data set and tell it to look for only the surgeons I found after my first extraction.

In my .do file, I reference the extracted data for people who have had the surgery I am studying, then collapsed the data based on the surgeon ID code that is listed in the database.
Now that I have these unique ID codes (that are unique to the year of the surgery), I want to go back to the original data set, and pull out every surgery done by these surgeons.
Any help would be greatly appreciated.

Thank you so much for any and all help.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

22 Aug 2017, 18:58

It would have been helpful to see examples of the data bases in order to give concrete advice. Absent that, the generic answer is that -merge- is your friend here. You have, it appears, already created a data set consisting of the IDs of the surgeons who performed the cases you are interested in, and have reduced it to one observation (not "row") per surgeon. So let's call this data set surgeons.dta, and it contains, crucially, the variable surgeon_id, and all values of surgeon_id are distinct (i.e. no value of surgeon_id occurs twice in this data set.) You also have the big database, which I'll just call big_database.dta, and it, too, contains the variable surgeon_id, along with other information that you want to retrieve. Then it's just this:

Code:

use surgeons, clear merge 1:m surgeon_id using big_database, keep(match) nogenerate

This will cause Stata to match the surgeons.dta data set with the big database, pairing up observations that have the same value of surgeon_id. The -keep(match)- option then tells Stata to retain only those observations where the surgeon_id appears in both data sets. That will give you everything there is in big_database.dta that pertains to the particular surgeons you are interested in.

Do read -help merge-, and also click through to the linked manual section. -merge- is one of the fundamental data management tools that every Stata user needs to become proficient at using. Invest some time in learning it in greater generality than what is shown in this highly focused response to your question. That time will be amply repaid in the long run.
2 likes
Comment
Adam Tanious

Join Date: Aug 2017

Posts: 4
#3

23 Aug 2017, 14:40

Mr.Schechter,

Thank you so much for your response. I did not want to inundate readers with the code but am happy to provide it:
--
use "Extracted_database.dta" **database of extracted patients that have have the 2 surgeries I am calling into question
log using "Analysis.log", replace
set mem 4g

generate count=1

collapse (count) key (sum) Surgery_1 Surgery_2, by(surgid_1 surgid_2 year)
save "Volume.dta", replace
---

Up until this point, I am telling SAS to reference the data file which had all pt's who had received surgery_1 or surgery_2.
Then I collapsed the data to show simply the surgeon ID (surgid_1 or surgid_2 based on the year of the surgery as it differed by year) and saved this as another data file.

Based on your response, my next step would be to do the following on the the same .do file

--
local grains ""Big_database_year1.dta" "big_database_year2.dta" **Calling SAS to now look to the big_database for the years in question**
foreach x of local grains {

sort surgid_1 surgid_2 year **Calling SAS to now sort the big database based on the surgeon ID based on the year**
save temp1.dta, replace **Saving this as a temporary file**
use "`x'", clear

use Volume.dta, clear
merge 1:m surgid_1 surgid_2 year using temp1.dta, keep(match) nogenerate **this is what will merge the surgid_1 and surgid_2 information from the surgeon ID's i have pulled and merge with the big database, which will then give me every patient that the surgeons I have specified have operated on?**

tab _merge
drop _merge

erase temp1.dta

}

stop
--
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#4

23 Aug 2017, 16:23

The only thing that confuses me at this point is that your unit of analysis here seems to be a pair of surgeons, defined by surgid_1 and surg_id2, rather than individual surgeons. Since most surgery is performed by a team, this is certainly plausible, but I just want to make sure that is what you intend.

Assuming it is, the code you show looks like it should work properly in Stata, but not so much in SAS I think!
Comment
Adam Tanious

Join Date: Aug 2017

Posts: 4
#5

23 Aug 2017, 18:13

Yes, I meant STATA. I apologize... I have training in SAS and am learning STATA code currently which is why I am posting.

Its not that there are pair of surgeons. It is the fact that the database assigns a unique surgeon ID code to the surgeon who performs the surgery (it's a national database). In the years I am studying, they used 2 different variable names for surgeon IDs (surgid_1 and surgid_2). That is why I am using both, so that both are captured as some data points will have entries under the surgid_1 variable and others will have entries in the surgid_2 variable.

Again, thank you so much for all your help. It has been invaluable.

Best,
Adam
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#6

23 Aug 2017, 21:58

I see. That makes sense.

So, welcome to Stata. I think you will find that Stata is very different from SAS in the way you think about data and in the way you write code. Personally, I never liked working in SAS and was relieved to get away from it. When I first started using Stata, back in version 4, it was "love at first sight." I abandoned SAS and never looked back. Of course, SAS has its partisans, too, and they tend to dislike Stata. It's a different cognitive style, I suppose. Anyway, I hope you enjoy Stata.
Comment
Adam Tanious

Join Date: Aug 2017

Posts: 4
#7

24 Aug 2017, 08:45

So I ran the program overnight (its a large Big_database.dta). Unfortunately, it did not pull any new observations from the big_database.dta. It simply kept the observations from the Volume.dta file that I generated from the original extraction to obtain the original surgeon ID's.

This is the exact code I ran:

____

use "Extraction.dta"
log using "Analysis1.log", replace
set mem 4g

generate count=1

collapse (count) MRN (sum) Surg_1 Surg_2, by(surgid_1 surgid_2 year)
save "Volume.dta", replace

local grains ""big_Database_year1.dta" "big_Database_year2.dta""

foreach x of local grains {

sort surgid_1 surgid_2 year
save "temp1.dta", replace
use " 'x' ", clear

use "Volume.dta", clear
merge 1:m surgid_1 surgid_2 year using "temp1.dta", keep(match) nogenerate

erase temp1.dta

}

stop

____

The goal is to use the Surgeon ID's generated in Volume.dta to extract every procedure done by that surgeon from the Big_database.dta.

Any thoughts why it did not merge the two datasets?

Thank you again,
Adam
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#8

24 Aug 2017, 10:33

So I see a few problems in

foreach x of local grains {

sort surgid_1 surgid_2 year
save "temp1.dta", replace
use " 'x' ", clear

use "Volume.dta", clear
merge 1:m surgid_1 surgid_2 year using "temp1.dta", keep(match) nogenerate

erase temp1.dta

}

First, -use "'x'"- is not going to read in any files: it will just produce the error message "file 'x'.dta" not found. The left quote in front of x needs to be the special ` character located to the left of the 1! key on a US keyboard, not the ordinary single quote character. In fact, it should have broken execution altogether, so perhaps you have not copied here accurately what you did.

Next, if you trace the flow of what data is where in the loop, it does not seem to lead where you want it to go. The goal is to use Volume.dta as a template of surgid's to pull information from the two big_Database's. Here's how I would structure this.

Code:

local grains big_Database_year_1 big_Database_year_2 use Volume, clear foreach g of local grains { merge 1:m surgid_1 surgid_2 year using `g', keep(match master) tab _merge drop _merge }

At the end of that, the data in memory should be what you want.

So try running that. When it runs, review the Stata output. After each merge command you will get some output that shows how many observations matched. If it turns out that no observations are matching, then there is a problem with the data. Perhaps there is a difference in the way surgid_1 and surgid_2 were coded in the two databases, issues of upper vs lower case, or padding with blanks, or something like that? A visual inspection of the two data sets in the Browser might shed light on that. Feel free to post back with examples of the data and with output if this isn't working out.

Note that I have changed the options on the -merge command from what came before. That is because -keep(match)- would have been fine if only one data set was being merged, but the removal of unmatched observations after the first merge would bleed the second merge of possibilities to match, so that had to be expanded to -keep(match master)-. I also dropped the -nogenerate- option because that would have suppressed the information about how many matches you were getting. I was being too optimistic! (In fact, in my own practice, when I am developing a program I never use the -nogenerate- option until I'm sure all my -merge-s are working as expected. Then I sprinkle them in at the end to eliminate the no-longer-needed output. I shouldn't have jumped the gun with yours.)
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment