Merging

Denise Vella

Join Date: Aug 2022

Posts: 187
#1

Merging

15 Dec 2022, 05:40

When I think I'm nearly there - another problem crops up.

I have two big datasets - here's a sample

hospital dataset

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float(procedureno ssi dehiscence pe) str3 uniqueno float admissiondate 112 0 1 1 "11A" 22645 113 1 1 1 "12A" 22690 113 1 1 1 "13A" 22705 114 1 0 1 "14A" 22737 115 1 0 1 "15A" 22738 end format %td admissiondate

procedure dataset

Code:

* Example generated by -dataex-. For more info, type help dataex clear input float(procedureno indexcode new_dvt new_mi pe date) 112 9 . 1 1 22646 112 10 1 1 1 22646 113 13 1 1 . 22706 113 15 1 0 . 22706 114 16 1 0 1 22738 115 17 . . . 22739 end format %td date

Aim:

I would like to merge the hospital data set onto the procedure dataset.

The hospital dataset consists of duplicate patients with the same procedureno but different uniqueno. In order to decide row to keep, I need to merge the hospital dataset onto the proceduredataset.
Then create a dummy variable whereby Admissiondate - date. The one with the admissiondate closest to the opdate will be kept, the other dropped.

*date is the day the operation took place

However:

A.
There are duplicates in the procedure dataset due to patient having a Right + Left operation - therefore have a different indexcode but the same procedureno (as the operation was done in the same sitting)
Both right and left operation need to remain in the data set.
Only procedureno is the variable that is shared amongst the two datasets

Intially I was going to merge 1:m - however on realising the problem with A of course I couldn't do it.

Then I thought of merge m:m --> however there has been a lot of critism about this.

So then I though of what if I pull the date variable only from the procedure dataset merge this onto the hospital dataset, create the dummy variable Admission - date. Keep the ones I need. Save this as unique dataset.

use "hospitaldataset.dta"
merge procedureno using "proceduredataset.dta", keep (date)

It then tells me i'M USING old syntax - but it still does the job. Any critisims?

Last edited by Denise Vella; 15 Dec 2022, 06:16.
Tags: None
Daniel Feenberg

Join Date: Oct 2014

Posts: 323
#2

15 Dec 2022, 06:10

Would it help to define a new double procedure 2112 that comprises left plus right? Then remove the two 112 procedure observations and add a 2112 record with the same information? I think that would simplify the merge.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

15 Dec 2022, 10:33

In the output of help merge there is a link to the help file for the old syntax, it is

Code:

help merge_10

Careful reading of that documentation suggests that the following applies in your case.

Code:

If none of the three unique options are specified, observations in neither the master nor the using dataset are required to be unique, although they could be. If they are not unique, records that have the same values of the match variables are joined by observation until all the records on one side or the other are matched; after that, the final record on the shorter side is duplicated over and over again to match with the remaining records needing to be matched on the longer side.

In other words, you have done the equivalent of a merge m:m in the current syntax. You will note the similarity to the following explanation copied word-for-word from the documentation of the merge command in the Stata Data Management Reference Manual PDF included in the Stata installation and accessible from Stata's Help menu.

m:m merges

m:m specifies a many-to-many merge and is a bad idea. In an m:m merge, observations are matched within equal values of the key variable(s), with the first observation being matched to the first; the second, to the second; and so on. If the master and using have an unequal number of observations within the group, then the last observation of the shorter group is used repeatedly to match with subsequent observations of the longer group. Thus m:m merges are dependent on the current sort order—something which should never happen.

Because m:m merges are such a bad idea, we are not going to show you an example. If you think that you need an m:m merge, then you probably need to work with your data so that you can use a 1:m or m:1 merge. Tips for this are given in Troubleshooting m:m merges below.

With that said, if you are thinking about using merge m:m, it is a near certainty that at least one of the following is true:

1. Your merge key actually does uniquely identify the observations in one of your data sets, so you can use merge 1:m or merge m:1 or maybe even merge 1:1.

2. You are failing to take account of one or more additional variables in your data set that, combined with the variables you are trying to use as the merge key, uniquely identify the observations in one or both of your data sets, so you can use merge 1:m or merge m:1 or merge 1:1 with the expanded merge key.

3. You are really trying to accomplish what joinby, a different command, does: create in the output dataset every possible combination of an observation from the first dataset and an observation from the second dataset, both having the same key. (SQL users take note! I fell for this myself: to Stata m-to-m does not mean m-by-m.)

4. You actually need to append your datasets rather than merge them.

5. The data sets you are trying to merge are incorrectly configured or contain data errors that need to be fixed.
Comment

Announcement

Comment

Comment