Managing large data-sets with limited IDs

Alex Slade

Join Date: Apr 2020

Posts: 3
#1

Managing large data-sets with limited IDs

30 Jan 2023, 14:27

I am working with several large data tables (e.g., A,B,C... F) each of which have millions of patient observations and outcomes unique to each of the data set. I have a list of only a few patient IDs that I am interested in. I wish to create a data set that has multiple columns A, B, C ....Z for only the patients I am interested in. Is there a user-written merge command that will allow me to do this easily? Thank you so much!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#2

30 Jan 2023, 14:54

In the simplest situation it is fairly easy to do. The simplest situation is where each of the data sets A, B, C, ... contains only one observation for each patient ID. Create a Stata data set containing only one variable, the patient ID, and only those which you are interested in. Make sure you include each ID only once. Make sure you give that variable the same name that the patient ID variable has in all of those other data sets. (And make sure that those other data sets also use the same name for the patient ID variable--if not, fix them so they all conform.) Then it's very simple:

Code:

use my_patient_ID_data_set, clear foreach x in A B C D E F { merge 1:1 patient_ID_variable using `x', keep(master match) nogenerate }

If, however, some of the data sets contain multiple observations of the same patient, then it is more complicated and your question cannot be answered without a clear explanation of what the final results should look like and more information about those data sets (that is, examples from the data set--you can fake the actual data if there are confidentiality issues). Use the -dataex- command to show the example data. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

By a clear explanation of what the results should be like, I mean: suppose data set B contains 2 observations for some patient, and data set E contains 3. How would we decide which observations from E to pair up with which observations from B? For example, you might want to match the observations not just on patient ID but also the date, or something like that.
Comment
Alex Slade

Join Date: Apr 2020

Posts: 3
#3

30 Jan 2023, 15:07

Thanks so much for your reply. There are many observations per patient in each of the data sets A through Z. At this point, I am just trying to winnow down the data sets to create a master file with all the patient observations and outcomes A--->Z (in each table) only for the patient IDs of interest. I am not trying to merge anything from A to B or B to D, but just capture a file that has all the observations from tables A to Z merged to the patient ID.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30357

30 Jan 2023, 15:28

Oh, well that's simpler:

Code:

foreach x in A B C D E F {
    use my_patient_ID_data_set, clear
    merge 1:m patient_ID_variable using `x', keep(match) nogenerate
    save `x'_IDs_of_interest_only, replace
}

Announcement

Managing large data-sets with limited IDs

Comment

Comment

Comment