Keeping/Dropping variables based on large list

Peter Brodeur

Join Date: Apr 2020

Posts: 4
#1

Keeping/Dropping variables based on large list

04 Apr 2020, 12:00

Hi - I am working with a very large datasets (~15-20GB) and I would like to reduce the number of observations in each dataset.

For example, say I have medical procedures dataset (PROCEDURES) with variables ID (patient's ID), proc (procedure code for the patients procedure), and gender (gender of the patient).

ID proc gender

1 23 M

1 20 M

2 19 F

2 18 F

3 23 F

4 13454 F

Let's say I only care about patients who underwent proc 23 BUT if the patient did undergo proc 23, then I also want to keep all their other observations (ie, ID == 1 underwent proc 23 so I want to keep the first two rows).

My approach so far has been to read in the PROCEDURES dataset and write 'drop if proc != 23', this way I can obtain the unique values of ID for which I want to keep in my datasets (but these are large datasets so this list ends up being ~100,000). Since I am working with many datasets that use ID as the unique identifier (one dataset with procedure codes like above, one dataset with patient address information, one dataset with insurance information, etc.), I would like to preserve this list so that I can go into each separate datasets and only keep the observations that have an ID in the unique list.

Since the list is so long, I don't think it would be useful to write them all out in a keep if command. Does anybody have any suggestions of how to do this effectively? I'm hoping there is something more simple than what I have tried below - which seems to work but is extremely slow compared to running drop commands on these datasets.

I saved the unique list as a separate dataset calling it `Unique ID' and then performing a merge command as follows:

merge 1:m ID using PROCEDURES
keep if _merge == 3
drop _merge

Maybe there is some way to store the list in a macro? I've also been referring to https://www.stata.com/support/faqs/d...-observations/

Cheers,
Peter
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#2

04 Apr 2020, 13:42

Code:

by ID, sort: egen had_proc_23 = max(proc == 23) keep if had_proc_23 keep ID by ID, sort: keep if _n == 1 save ids_who_had_proc_23, replace

will give you a data set containing all and only the IDs who had procedure 23 at some point, sorted and ready to -merge- with other data sources.

In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
Peter Brodeur

Join Date: Apr 2020

Posts: 4
#3

04 Apr 2020, 16:13

Hi Dr. Schechter,

Thanks for your response and especially for the pointer on how to interact in the forum as I'm a medical student new to STATA.

One thing I overlooked was that actually my proc variable is a string variable since some patients have HCPCS in addition to CPT codes. I tried the code and got the mismatch error - do you know if there is a way to resolve this issue? Dataex is below - I had to recreate data since the medical data I am working with is sensitive

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte ID str5 proc str1 gender 1 "23" "M" 1 "J20" "M" 2 "19" "F" 2 "18" "F" 3 "23" "F" 4 "13454" "F" end

I really appreciate the help!

Cheers,
Peter
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#4

04 Apr 2020, 18:37

It's a very simple change:

Code:

by ID, sort: egen had_proc_23 = max(proc == "23")

The rest of the code is unchanged.
1 like
Comment
Peter Brodeur

Join Date: Apr 2020

Posts: 4
#5

05 Apr 2020, 08:40

Hi Dr. Schechter,

Thanks for your response, the code you wrote works great! However, the code for the 1:m merges based on the unique IDs seems to be running quite slow compared to other commands I've written for these datasets.

I've created the loop (code below) and STATA (64 bit version 15) hasn't finished the first merge of the loop within 12 hours. Each dataset is ~10-20GB so maybe the size is an issue? Since the keep commands you wrote above took a few minutes to run - I wasn't sure whether there was a way to save the unique IDs in a vector as a global or something and then use that to perform more simplified drop/keep commands. I imagine STATA is working very hard for the merging process.

Do you have any suggestions as to how to speed up this process of dropping unwanted IDs from each dataset?

Code:

foreach x in "CLAIM" "CONDITION_CODE" "DIAG" "LOCATION" "NDC" "OCCURRENCE" "OCCURRENCE_SPAN" "PAYOR" "PROVIDER" "RACE" "REV" "VALUE" { clear *the next line reads in the unique IDs generated from the PROCEDURES dataset use "/Volumes/easystore/BREAST CANCER DTAs/Outpatient/2016 OP/SPARCS_PROCEDURES_2016_Limited_OP_2016 breast cancer_IDs.dta" *the following code is to merge into the unique IDs to just extract data for patients who underwent a specific procedure merge 1:m CLM_TRANS_ID using "/Volumes/easystore/BREAST CANCER DTAs/Outpatient/2016 OP/SPARCS_`x'_2016_Limited_OP_2016 breast cancer.dta" keep if _merge == 3 drop _merge save "/Volumes/easystore/BREAST CANCER DTAs/Outpatient/2016 OP/SPARCS_`x'_2016_Limited_OP_2016 breast cancer_merged.dta" clear }

Thanks so much for your help!

Best,
Peter
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30065
#6

05 Apr 2020, 14:34

You may or may not be able to speed this up.

You've already noticed that even just the by sort is time consuming. That's not really surprising. Sorting time is proportional to N log N, where N is the number of observations in the data set. So if N is big, as yours apparently is, sorting is going to be slow. -merge- has to sort both data sets as its first step, so -merge- is a double whammy: on the one hand it has to read in a large data set (and reading files is slow) and then sort it (again, slow).

There isn't a whole lot you can do about this, but here's a few things that may help a little big:

1. Install Sergio Correa's -ftools- suite (available from SSC) and use -fsort- to sort the data sets. So, for example, rather than -use-ing the breast cancer IDs data, I would start by -use-ing the 2016 Breast Cancer data set. -fsort- it on CLM_TRANS_ID and then do the -merge-. That may be faster. -fsort- isn't always faster than -sort-, but it sometimes is.

2. File reading is always slow, but put the data set on the fastest drive you can. Network drives are especially slow because you have to wait for everything to move over the network, during which time you are competing with everything else on your network for its bandwidth. So if your file is on a network drive, try to get it moved to a local drive, preferably an SSD, or, if not, then the fastest hard drive you have.

These may help a bit, but don't expect miracles. Working with large files is slow.

I wasn't sure whether there was a way to save the unique IDs in a vector as a global or something and then use that to perform more simplified drop/keep commands.

I'm not sure what you mean. There's nothing complicated about the -keep- commands in #2. So I'm not sure what you have in mind. While you could get might be able to store the IDs you want in a local macro or matrix, any way I can think of to then use that structure to identify which observations to keep would actually be much slower.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10186
#7

05 Apr 2020, 20:22

The string function -regexm()- can accept up to 1000 strings, so I am not convinced that a brute-force approach using locals will not do better than 12 hours. In fact, because you will load the data set only once, I am thinking minutes instead of hours. Because the suggested solution makes use of levelsof, make sure that your strings are single words before starting. For this, see

Code:

help strtoname

With \(\approx\) 100,000 IDs to match, you just need \(\approx\) 100 locals. Once you have the dataset with the list of IDs to keep (ensure no ID duplicates), run the following:

Code:

gen order=_n levelsof ID if order<=1000, local(keep1) sep(|) clean forval i=2/100{ levelsof ID if inrange(order, `=`i'000-999', `=`i'000'), local(keep`i') sep(|) clean }

Open the full dataset and then run the following:

Code:

gen tokeep=0 forval i=1/100{ replace tokeep = regexm(" " + ID + " ", "['!?,\. ](`keep`i'')['!?,\. ]") if !tokeep } keep if tokeep
Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10186

05 Apr 2020, 23:23

I'm hoping there is something more simple than what I have tried below - which seems to work but is extremely slow compared to running drop commands on these datasets.

I saved the unique list as a separate dataset calling it `Unique ID' and then performing a merge command as follows:

merge 1:m ID using PROCEDURES
keep if _merge == 3
drop _merge

Thinking about it again, there is an easier solution to this problem. Just append the datasets, generate an indicator and drop observations from the dataset containing matches. Here is a MWE.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str4 ID
"id1"  
"id2"  
"id7"  
"id9"  
"id14"
end

save matches

* Example generated by -dataex-. To install: ssc install dataex
clear
input str4 ID float somevar
"id1"  45
"id2"  35
"id3"  18
"id4"   4
"id5"  45
"id6"  10
"id7"  42
"id7"  72
"id7"  27
"id10" 75
"id11" 48
"id12" 51
"id13" 16
"id14" 97
"id14" 89
"id14" 88
"id14" 59
"id18"  9
"id19" 44
"id20" 93
end
save dataset2

use matches, clear
gen group=1
keep ID group
save matches, replace

use dataset2, clear
append using matches
bys ID (group): gen tokeep=group[1]
drop if group==1
keep if tokeep==1

Res.:

Code:

. l, sep(10)

     +---------------------------------+
     |   ID   somevar   group   tokeep |
     |---------------------------------|
  1. |  id1        45       .        1 |
  2. | id14        97       .        1 |
  3. | id14        88       .        1 |
  4. | id14        89       .        1 |
  5. | id14        59       .        1 |
  6. |  id2        35       .        1 |
  7. |  id7        42       .        1 |
  8. |  id7        27       .        1 |
  9. |  id7        72       .        1 |
     +---------------------------------+

ID	proc	gender
1	23	M
1	20	M
2	19	F
2	18	F
3	23	F
4	13454	F

Announcement

Keeping/Dropping variables based on large list

Comment

Comment

Comment

Comment

Comment

Comment

Comment