Help with merging two datasets

Maria AbouKhalil

Join Date: Jul 2016

Posts: 6
#1

Help with merging two datasets

24 Jul 2016, 14:26

Hi,

I am trying to merge two datasets (or extract information from one dataset using another) and I am running into some problems and would love to have somebody's opinion on what the best way to proceed:

- My first dataset includes amongst other variables : id (patient's unique identifier), date1 (the date of their surgery). This dataset includes only patients from my whole cohort who have had a surgery.
Here is an example from dataex command: input str10 id long date1
"1002706622" 17179
"1004835910" 15929
"1005071077" 15620

-My second dataset includes amongst other variables: id (patient's unique identifier), no_seq_sej_hosp (an identification specific to each hospitalization), date_admis (admission date), date_depart (discharge date), date_arriv (date patients arrived to the ER). This dataset includes information on all hospitalizations for all patients in my cohort. So one patient will have one "id" but can have different "no_seq_sej_hosp" as they can have different hospitalizations for different reasons.

Here is an an example from dataex command:

input str10 id str16 no_seq_sej_hosp long(date_admis date_depart)
"0057128888" "9956322975623467" 14418 14420
"0057128888" "2954352785623469" 14101 14101
"0057128888" "9951302355603363" 15596 15601
"0057128888" "9957372985683867" 17078 17085
"0165250763" "0957382445643364" 16019 16030
"0165250763" "1956382275683369" 16554 16554

So a single patient could've been hospitalized for a variety of reasons, but I am only interested in the ones who have a hospitalization for surgery which are patients in the first dataset, and these patients will inevitably have their date1 from the first dataset included between date_depart and date_admis from the second dataset, and their hospitalization code no_seq_sej_hosp will then allow me to find these specific hospitalizations in different datasets.

Is there any way I can extract from the second dataset information only concerning patients who are in the first dataset by using the fact that their hospitalization day has to be included between the date_depart and date_admis of the second dataset?

I would really appreciate anybody's insight,
Thank you for your help,

Maria Abou Khalil
Tags: None

Oded Mcdossi

Join Date: Jun 2014
Posts: 577

25 Jul 2016, 00:19

Hi Maria, in the help file of merge command you'll find the option to keep only unmatched and matched observations from the master file and leave out all those from the using file. I've changed a bit your data so that you can see the results.

Code:

clear*
input str10 id str16 no_seq_sej_hosp long(date_admis date_depart)
"0057128888" "9956322975623467" 14418 14420
"0057128888" "2954352785623469" 14101 14101
"0057128888" "9951302355603363" 15596 15601
"0057128888" "9957372985683867" 17078 17085
"0165250763" "0957382445643364" 16019 16030
"0165250763" "1956382275683369" 16554 16554
end
tempfile using
save `using'
clear
input str10 id long date1
"0057128888" 17179
"1004835910" 15929
"1005071077" 15620
end
isid id, sort

merge 1:m id using `using', keep(master match)
l, sep(0)


     +-------------------------------------------------------------------------------+
     |         id   date1    no_seq_sej_hosp   date_a~s   date_d~t            _merge |
     |-------------------------------------------------------------------------------|
  1. | 0057128888   17179   9956322975623467      14418      14420       matched (3) |
  2. | 1004835910   15929                             .          .   master only (1) |
  3. | 1005071077   15620                             .          .   master only (1) |
  4. | 0057128888   17179   2954352785623469      14101      14101       matched (3) |
  5. | 0057128888   17179   9951302355603363      15596      15601       matched (3) |
  6. | 0057128888   17179   9957372985683867      17078      17085       matched (3) |
     +-------------------------------------------------------------------------------+

Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

25 Jul 2016, 08:53

If the first dataset can contain more than one surgery per id then merge is not the correct command to use as this would lead to a many to many merge. The appropriate alternative would then be joinby. Since there's an additional requirement to match surgeries to hospital stays, you can use rangejoin (from SSC) to perform the task. To install rangejoin, type in Stata's Command window:

Code:

ssc install rangejoin

Here a quick example that follows the format of your data. I include multiple surgeries per id and a case where two surgeries occur within the same hospital stay.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str10 id long date1
"0057128888" 15929
"0057128888" 17080
"1004835910" 15929
"0165250763" 16020
"0165250763" 16021
end
format %td date1
save "surgeries.dta", replace

* Example generated by -dataex-. To install: ssc install dataex
clear
input str10 id str16 no_seq_sej_hosp long(date_admis date_depart)
"0057128888" "9956322975623467" 14418 14420
"0057128888" "2954352785623469" 14101 14101
"0057128888" "9951302355603363" 15596 15601
"0057128888" "9957372985683867" 17078 17085
"0165250763" "0957382445643364" 16019 16030
"0165250763" "1956382275683369" 16554 16554
end
format %td date_admis
format %td date_depart

* find all surgeries from patients with the same -id- where the
* value of -date1- is within -date_admis- and -date_depart-
rangejoin date1 date_admis date_depart using "surgeries.dta", by(id)

list, sepby(id) noobs

and the results

Code:

. list, sepby(id) noobs

  +-------------------------------------------------------------------+
  |         id    no_seq_sej_hosp   date_ad~s   date_de~t       date1 |
  |-------------------------------------------------------------------|
  | 0057128888   9956322975623467   23jun1999   25jun1999           . |
  | 0057128888   2954352785623469   10aug1998   10aug1998           . |
  | 0057128888   9951302355603363   13sep2002   18sep2002           . |
  | 0057128888   9957372985683867   04oct2006   11oct2006   06oct2006 |
  |-------------------------------------------------------------------|
  | 0165250763   0957382445643364   10nov2003   21nov2003   11nov2003 |
  | 0165250763   0957382445643364   10nov2003   21nov2003   12nov2003 |
  | 0165250763   1956382275683369   28apr2005   28apr2005           . |
  +-------------------------------------------------------------------+

Comment

Maria AbouKhalil

Join Date: Jul 2016

Posts: 6
#4

25 Jul 2016, 09:12

Fantastic thank you very much for your help Oded and Robert! I will try this on my dataset right away!
Greatly appreciate the guidance!
Maria
Comment

Announcement