Hi - I am working with a very large datasets (~15-20GB) and I would like to reduce the number of observations in each dataset.
For example, say I have medical procedures dataset (PROCEDURES) with variables ID (patient's ID), proc (procedure code for the patients procedure), and gender (gender of the patient).
Let's say I only care about patients who underwent proc 23 BUT if the patient did undergo proc 23, then I also want to keep all their other observations (ie, ID == 1 underwent proc 23 so I want to keep the first two rows).
My approach so far has been to read in the PROCEDURES dataset and write 'drop if proc != 23', this way I can obtain the unique values of ID for which I want to keep in my datasets (but these are large datasets so this list ends up being ~100,000). Since I am working with many datasets that use ID as the unique identifier (one dataset with procedure codes like above, one dataset with patient address information, one dataset with insurance information, etc.), I would like to preserve this list so that I can go into each separate datasets and only keep the observations that have an ID in the unique list.
Since the list is so long, I don't think it would be useful to write them all out in a keep if command. Does anybody have any suggestions of how to do this effectively? I'm hoping there is something more simple than what I have tried below - which seems to work but is extremely slow compared to running drop commands on these datasets.
I saved the unique list as a separate dataset calling it `Unique ID' and then performing a merge command as follows:
merge 1:m ID using PROCEDURES
keep if _merge == 3
drop _merge
Maybe there is some way to store the list in a macro? I've also been referring to https://www.stata.com/support/faqs/d...-observations/
Cheers,
Peter
For example, say I have medical procedures dataset (PROCEDURES) with variables ID (patient's ID), proc (procedure code for the patients procedure), and gender (gender of the patient).
ID | proc | gender |
1 | 23 | M |
1 | 20 | M |
2 | 19 | F |
2 | 18 | F |
3 | 23 | F |
4 | 13454 | F |
Let's say I only care about patients who underwent proc 23 BUT if the patient did undergo proc 23, then I also want to keep all their other observations (ie, ID == 1 underwent proc 23 so I want to keep the first two rows).
My approach so far has been to read in the PROCEDURES dataset and write 'drop if proc != 23', this way I can obtain the unique values of ID for which I want to keep in my datasets (but these are large datasets so this list ends up being ~100,000). Since I am working with many datasets that use ID as the unique identifier (one dataset with procedure codes like above, one dataset with patient address information, one dataset with insurance information, etc.), I would like to preserve this list so that I can go into each separate datasets and only keep the observations that have an ID in the unique list.
Since the list is so long, I don't think it would be useful to write them all out in a keep if command. Does anybody have any suggestions of how to do this effectively? I'm hoping there is something more simple than what I have tried below - which seems to work but is extremely slow compared to running drop commands on these datasets.
I saved the unique list as a separate dataset calling it `Unique ID' and then performing a merge command as follows:
merge 1:m ID using PROCEDURES
keep if _merge == 3
drop _merge
Maybe there is some way to store the list in a macro? I've also been referring to https://www.stata.com/support/faqs/d...-observations/
Cheers,
Peter
Comment