Randomness in many-to-many merge

Michael Wiebe

Join Date: Jul 2016

Posts: 19
#1

Randomness in many-to-many merge

09 Nov 2023, 16:04

I'm debugging a many-to-many merge in someone else's code (I swear!), and I noticed that it was giving unreproducible results. Here's an example to illustrate:

I want to merge data2 with data1, to add the variable 'type' to data1. There are four observations in data2, but all have the same id (id=2), so we don't have a unique identifier.

There are two observations in data1 with id=2: one with type="y" and one with type="x". According to the manual:

Matching is performed by combining observations with equal values of varlist; within matching values, the first observation in the master dataset is matched with the first matching observation in the using dataset; the second, with the second; and so on. If there is an unequal number of observations within a group, then the last observation of the shorter group is used repeatedly to match with subsequent observations of the longer group.

So I'd expect this m:m merge to match "a" with "y", and "b", "c", and "d" with "x".
And if I run this merge many times, the count of type=="x" should be 3 every time.

However, when I run this code, I find that the count of type=="x" is either 3 or 1, split roughly in half. So half of the time it's matching "a" with "y", and b/c/d with "x"; and the other half it's the reverse.

What's going on?

Code:

clear input long id str1 type 1 "x" 3 "y" 1 "y" 2 "y" 2 "x" end save data1, replace clear input long id str1 name 2 "a" 2 "b" 2 "c" 2 "d" end save data2, replace clear set seed 1 local N 500 matrix O = J(`N',1,.) forval i = 1/`N' { use data2, clear merge m:m id using data1, keepusing(type) keep(1 3) nogen count if type=="x" matrix O[`i',1] = r(N) } svmat O keep O1 tab O
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#2

09 Nov 2023, 16:28

Well, you only needed to read one sentence beyond where you stopped in the manual:

Thus m:m merges are dependent on the current sort
order—something which should never happen.

That's all there is to it.

Now, you are probably thinking that according to the sort order in which you show your data in your examples you expect the results to be what you you said. The reason that doesn't happen is that -merge- requires that the data sets being -merge-d be already sorted on the link variable(s) (id, in your case) before the -merge-ing beings. As a courtesy, if the data sets are not already sorted on the link variable(s), Stata will first sort them that way. But because your link variable id does not uniquely identify the observations in either data set, the resulting order of the other variable is not determined and Stata's -sort- procedure randomizes indeterminate sorts. That is why you are getting indeterminate results.

You can produce determinate results if you pre-sort the data sets on id. I illustrate that result, here -sort-ing with the -stable- option, which preserves the pre-existing order of the other variables within id rather than randomizing it:

Code:

clear input long id str1 type 1 "x" 3 "y" 1 "y" 2 "y" 2 "x" end sort id, stable save data1, replace clear input long id str1 name 2 "a" 2 "b" 2 "c" 2 "d" end sort id, stable save data2, replace clear set seed 1 local N 500 matrix O = J(`N',1,.) forval i = 1/`N' { use data2, clear merge m:m id using data1, keepusing(type) keep(1 3) nogen sorted count if type=="x" matrix O[`i',1] = r(N) } svmat O keep O1 tab O

for which the end result is:

Code:

. tab O O1 | Freq. Percent Cum. ------------+----------------------------------- 3 | 500 100.00 100.00 ------------+----------------------------------- Total | 500 100.00

All of that said, you also left out the first sentence of the manual paragraph you quoted, which is

m:m specifies a many-to-many merge and is a bad idea.

and a follow in in the subsequent paragraph:

Because m:m merges are such a bad idea, we are not going to show you an example. If you think
that you need an m:m merge, then you probably need to work with your data so that you can use a
1:m or m:1 merge. Tips for this are given in Troubleshooting m:m merges below.

This is excellent advice. If anything, it is too tepid. (And it also overlooks the possibility that what is really needed is -joinby-, a different command.) I have been using Stata on a more or less daily basis since 1994. In all of that time, I have only once come upon a situation where -merge m:m- would have produced a usable result. Even in that situation, there was a better way to get to the same result. In short, in lengthy experience with Stata, I have never come upon a single situation in which I would use -merge m:m-. So, I'm wondering what your situation is that is causing you to look to -merge m:m-, and strongly suspecting that you have not thought it through.

Last edited by Clyde Schechter; 09 Nov 2023, 16:34.
Comment
Michael Wiebe

Join Date: Jul 2016

Posts: 19
#3

09 Nov 2023, 16:43

The reason that doesn't happen is that -merge- requires that the data sets being -merge-d be already sorted on the link variable(s) (id, in your case) before the -merge-ing beings. As a courtesy, if the data sets are not already sorted on the link variable(s), Stata will first sort them that way.

Ah, that explains it, thanks! (Note: `sort` randomly orders the data before sorting, so each run of the code produces a different order within tied values).

As I said, I'm debugging someone else's code. I'm writing a replication, so I need to fix it to make my own code reproducible.

Last edited by Michael Wiebe; 09 Nov 2023, 16:53.
Comment

Announcement

Randomness in many-to-many merge

Comment

Comment