How can I combine datasets when m:m merge seems appropriate but not?

Lucy Zhao

Join Date: Jan 2017

Posts: 9
#1

How can I combine datasets when m:m merge seems appropriate but not?

15 Sep 2022, 16:06

Hi everyone!

I have the following two datasets:

input country year var1
1 2011 0.6
1 2012 0.5
1 2015 0.4
end

input country year var2
1 2012 30
1 2013 28
1 2014 22
end

I wish to combine them to get

input country year var1 var2
1 2011 0.6 .
1 2012 0.5 30
1 2013 . 28
1 2014 . 22
1 2015 0.4 .
end

I tried

merge m:m country year using merge_try2_MPI.dta

It works fine with the above two small datasets.

But I read in Baum (2009)'s "An Introduction to Stata Programming" that "A many-to-many merge has no unique outcome ... To quota the Stata Data Management Refernce Manual:' Because m:m merges are such a bad idea, we are not going to show you an example' " (p. 116-7).

And my actual datasets are large. I wonder how I can combine them in the above fashion properly, if not using m:m merge?

Many thanks to your help!!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30354
#2

15 Sep 2022, 16:15

At least in your example data, the combination of country and year identifies unique observations in both data sets. Assuming the full data sets are like that it's -merge 1:1 country year using merge_try2_MPI-.

The statements that you have quoted about m:m merges are, frankly, very understated. -merge m:m- produces data salad, associating observations in one data set with observations in the other that have nothing to do with each other. The only reason it appears to work in your example is because country and year uniquely identify observations in both data sets, and in that situation (i.e. the situation where -merge 1:1- works, -merge m:m- ends up doing the same thing.) BUT even if you "know" that your data are uniquely identified by the merge key variables, you should not rely on it. Because we are often mistaken about what we think we "know" about our data. Even professionally curated data sets sometimes contain nasty surprises. And even a data set that starts out pristine can easily get corrupted during the course of later data management. So if you "know" that your data are uniquely identified by the merge key variables, if you are right, -merge 1:1- will work. If you are wrong about that, then -merge 1:1- will do you the enormous favor of telling you that something is wrong, so that you can investigate and fix it. By contrast -merge m:m- will just blunder on and produce a corrupted and useless data set and not even have the courtesy to tell you about it. If you are lucky, further work with the damaged data set will soon produce obviously wrong results and you will eventually figure out what went wrong. If you are unlucky, that won't happen until you are presenting your findings to others and somebody in your audience calls it out. And if you are really unlucky, it won't happen until somebody uses and relies on your results and gets harmed as a result.

-merge m:m- is an atrocity. It should NEVER be used. Do yourself a favor and forget that it even exists.

Added: If you try -merge 1:1 country year- and you get "country and year do not uniquely identify observations..." in one of the data sets, then your full dataset(s) are not like the examples shown. In that case you need to find the offending observations, come to an understanding of how they got there, and figure out what to do about it. As this post is already getting long, I won't go into all the possibilities here. Suffice it to say, if you run into this problem, do post back showing the code and Stata's output.

Last edited by Clyde Schechter; 15 Sep 2022, 16:20.
1 like
Comment
Lucy Zhao

Join Date: Jan 2017

Posts: 9
#3

15 Sep 2022, 16:46

Hello Professor Schechter,

Thanks so much!

Indeed, country and year actually do uniquely identify observations in my data sets (both those I made up, and fortunately those I actually work with).

My apologies for being less careful before I posted here.

But I'm glad I did --- it is good to know that

"-merge m:m- is an atrocity. It should NEVER be used. Do yourself a favor and forget that it even exists."

Thank you so much! Also, over the years I benefited from many of your posts/replies here -- THANK YOU for all that too
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2172
#4

15 Sep 2022, 17:13

Think about it like this. Let's say we have students and classes and class times, and we have unique IDs for students. In one dataset we have the class times and CRN or whatever they use at other schools, and in the other we have students and their unique numbers and the name of the course.

I don't know about you, but I can't be in more than one class at a time. So here we use 1:1 merges because you can only have one unique student in one class at one time. But let's say now we wanna merge students to which departments these classes are in.

Students CAN be in multiple departments at once, or at least, they can take classes that are situated within more than one department at a time/semester. Here, we'd do 1:m cuz of what I just mentioned. All of this is premised on a set of variables uniquely ID-ing the thing you're merging/merging on.

In m:m, that rule is thrown out the window, and produces the monstrosity that we see it as. If you ever DO need to do an m:m, the command you're REALLY looking for is joinby. I only even use this when for some reason I'm merging wide datasets, but barring that, you'll never even need this.
1 like
Comment
Lucy Zhao

Join Date: Jan 2017

Posts: 9
#5

18 Sep 2022, 10:38

Hello Jared

Thank you for your kind reply! The example you gave is brilliant and you articulated it so well. Indeed --- I agree joinby is better for this purpose and with wide datasets.

So we have a consensus that -m:m merge- is bad. Good to know that.

Thank you again!
Comment

Announcement

How can I combine datasets when m:m merge seems appropriate but not?

Comment

Comment

Comment

Comment