Data manipulation - your experience with merging

Denise Vella

Join Date: Aug 2022

Posts: 187
#1

Data manipulation - your experience with merging

14 Dec 2022, 16:05

Hi I’m working with very large datasets - ones that contain for 1mill observations with 20+ variables.

lets call these datasets: hospital dataset and procedure dataset

Question: in your experience would you recommend keeping the number of merges to a minimum and just merge once then work on replicates from that one time merge ? Or do you think its equally as effective with the following process. I should think the process below is just more manageable as one knows exactly that all values are unique using the process below and there shouldnt be any issues when merging 1:1

Step 1:
identify the hes Unique values in the hospitaldataset merge 1:1 onto my proceduredataset

Step2:
Identify replicate values in the hospital dataset and pick the first one and merge onto my proceduredataset using 1:1

Step3:
Look at the ones with same procedureno but different hospitalid no and identify what’s different. Then final merge onto procedure dataset.

That’s a total of 3 merges. Would you recommend this or would you recommend , just do 1 merge from hospitaldataset to proceduredataset and then deal with replicates within that dataset - as the above requires 3 merges.
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

14 Dec 2022, 16:55

identify the hes Unique values in the hospitaldataset merge 1:1 onto my proceduredataset

I assume you mean something like

Code:

use proceduredataset merge 1:1 using hospitaldataset_uniques

A 1:1 merge will not work in this case because proceduredataset has multiple observations for each hospital. What you would need is

Code:

use proceduredataset merge m:1 using hospitaldataset_uniques

which reflects the fact that proceduredataset has multiple observations for each hospital

Now as an alternative you suggest

just do 1 merge from hospitaldataset to proceduredataset and then deal with replicates within that dataset

For this you might then be thinking something like

Code:

use proceduredataset merge m:m using hospitaldataset

since both datasets have multiple observations for each hospital.

The following is copied word-for-word from the documentation of the merge command in the Stata Data Management Reference Manual PDF included in the Stata installation and accessible from Stata's Help menu.

m:m merges

m:m specifies a many-to-many merge and is a bad idea. In an m:m merge, observations are matched within equal values of the key variable(s), with the first observation being matched to the first; the second, to the second; and so on. If the master and using have an unequal number of observations within the group, then the last observation of the shorter group is used repeatedly to match with subsequent observations of the longer group. Thus m:m merges are dependent on the current sort order—something which should never happen.

Because m:m merges are such a bad idea, we are not going to show you an example. If you think that you need an m:m merge, then you probably need to work with your data so that you can use a 1:m or m:1 merge. Tips for this are given in Troubleshooting m:m merges below.
If you are thinking about using merge m:m, it is a near certainty that at least one of the following is true:

1. Your merge key actually does uniquely identify the observations in one of your data sets, so you can use merge 1:m or merge m:1 or maybe even merge 1:1.

2. You are failing to take account of one or more additional variables in your data set that, combined with the variables you are trying to use as the merge key, uniquely identify the observations in one or both of your data sets, so you can use merge 1:m or merge m:1 or merge 1:1 with the expanded merge key.

3. You are really trying to accomplish what joinby, a different command, does: create in the output dataset every possible combination of an observation from the first dataset and an observation from the second dataset, both having the same key. (SQL users take note! I fell for this myself: to Stata m-to-m does not mean m-by-m.)

4. You actually need to append your datasets rather than merge them.

5. The data sets you are trying to merge are incorrectly configured or contain data errors that need to be fixed.

Let me add one more possibility. I found that your question isn't fully clear without more detail, or at a minimum it is too difficult to guess at a good answer from what you have shared. Please help us help you. Show example data that display the problem you are trying to resolve - either actual data, or made-up data that has the important features, not a verbal description. Show the code you would anticipate running - the actual commands, not a verbal description. The Statalist FAQ provides advice on effectively posing your questions, posting data, and sharing Stata output.

Last edited by William Lisowski; 14 Dec 2022, 17:00.
Comment
Denise Vella

Join Date: Aug 2022

Posts: 187
#3

14 Dec 2022, 17:12

I wasn’t really asking for codes - I am aware what to use but rather your experience in terms of

1. Are there disadvantages if I have to merge 3 separate very large dta files onto one massive dataset {lets call this procedure dataset}?
Two dta files - have unique values so 1:1
(which i saved myself after cleaning them from original dataset {lets call this originaldata} using code i wrote to identify replicates and dropping some values)

B.
another dta file - has replicates which needs further work on mergining to {proceduredataset} identify which ones needs to be dropped

2. or do you think its better if I just use {originaldata} merge onto {proceduredataset}. This would be one merge and then run the replicate code I wrote to obtain the 2 clean dta unique value dta files and move on to identify further replicates in B.
this means 1 merge rather than 3 merges.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

14 Dec 2022, 18:10

The description in #3 is substantially different than the description in post #1, which did not discuss 3 datasets. I addressed the question in post #1 with my discussion in post #2.

Perhaps someone will understand the situation as described in post #3 confidently enough to advise you, but that someone is not me.
Comment

Announcement

Data manipulation - your experience with merging

Comment

Comment

Comment