foreach, reshape or merge for giant dataset?

Geoff Dougherty

Join Date: May 2015

Posts: 42
#1

foreach, reshape or merge for giant dataset?

02 Jul 2015, 10:16

Hi, all. I have a dataset containing 26 million hospital discharge records. Each record has 25 fields containing a code for a procedure performed during the patient's stay.

I have a passel of flags that I need to set for various procedures (i.e., did the patient undergo coronary bypass surgery?) that can appear in any one of the 25 procedure fields. For obvious reasons, I'm interested in doing this as efficiently as possible.

Currently I've implemented this as follows:

Code:

foreach v of varlist icd_prcdr_cd* { replace cabg=1 if inlist(`v', "3610", "3611", "3612", "3613", "3614","3615", "3616") // flag for CABG operation replace valve=1 if strpos(`v', "350") | strpos(`v', "351") | strpos(`v', "352") // flag for valve operation etc. etc. // do this for 50 more flags }

This works, but is fairly slow. I'm wondering if it might be more productive to create a separate, normalized table with a unique id for patient record and a single field for procedure code, and simply run through each record of that table without -foreach-. I've tried this for one or two procedures and it seems speedy, but I don't know if it will scale.

Alternatively, I could create a third table containing the procedure code and flag variable for each procedure I'm interested in, and merge it with the normalized table.

Any thoughts on which of these approaches is likely to be the fastest would be great.

Thanks much.
Tags: None
Joe Canner

Join Date: Mar 2014

Posts: 580
#2

02 Jul 2015, 10:33

Geoff,

I do a lot of this sort of thing with very similar (if not identical) datasets on a regular basis and I use the method you outline. reshape is probably not a good idea in this case, as it is quite time consuming for large data sets (at least it was in Stata 12). However, once you have finished the reshape, the merge you describe might be fairly efficient. So, if you are only going to do one reshape and many merges, the total time might be faster. If you are only doing one merge, your original method is probably faster.

BTW, be careful with the strpos() function. In your example, strpos(`v',"350") will find the string "350" anywhere in the ICD-9 code variable, which is probably not what you want. Rather, you probably want something like substr(`v',1,3)=="350" or (better) inrange(substr(`v',1,3),"350","352) which implements the entire if clause.

Regards,
Joe
Comment
Geoff Dougherty

Join Date: May 2015

Posts: 42
#3

03 Jul 2015, 09:56

Thanks, Joe. This is really helpful.
Comment

Announcement

foreach, reshape or merge for giant dataset?

Comment

Comment