Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Randomness in many-to-many merge

    I'm debugging a many-to-many merge in someone else's code (I swear!), and I noticed that it was giving unreproducible results. Here's an example to illustrate:

    I want to merge data2 with data1, to add the variable 'type' to data1. There are four observations in data2, but all have the same id (id=2), so we don't have a unique identifier.

    There are two observations in data1 with id=2: one with type="y" and one with type="x". According to the manual:
    Matching is performed by combining observations with equal values of varlist; within matching values, the first observation in the master dataset is matched with the first matching observation in the using dataset; the second, with the second; and so on. If there is an unequal number of observations within a group, then the last observation of the shorter group is used repeatedly to match with subsequent observations of the longer group.
    So I'd expect this m:m merge to match "a" with "y", and "b", "c", and "d" with "x".
    And if I run this merge many times, the count of type=="x" should be 3 every time.

    However, when I run this code, I find that the count of type=="x" is either 3 or 1, split roughly in half. So half of the time it's matching "a" with "y", and b/c/d with "x"; and the other half it's the reverse.

    What's going on?

    Code:
    clear
    input long id str1 type
    1 "x"
    3 "y"
    1 "y"
    2 "y"
    2 "x"
    end
    save data1, replace
    
    clear
    input long id str1 name
    2 "a"
    2 "b"    
    2 "c"   
    2 "d"   
    end
    save data2, replace
    
    clear
    set seed 1
    local N 500
    matrix O = J(`N',1,.)
    
    forval i = 1/`N' {
    use data2, clear
    merge m:m id using data1, keepusing(type) keep(1 3) nogen
    count if type=="x"
    matrix O[`i',1] = r(N)
    }
    svmat O
    keep O1
    tab O

  • #2
    Well, you only needed to read one sentence beyond where you stopped in the manual:
    Thus m:m merges are dependent on the current sort
    order—something which should never happen.
    That's all there is to it.

    Now, you are probably thinking that according to the sort order in which you show your data in your examples you expect the results to be what you you said. The reason that doesn't happen is that -merge- requires that the data sets being -merge-d be already sorted on the link variable(s) (id, in your case) before the -merge-ing beings. As a courtesy, if the data sets are not already sorted on the link variable(s), Stata will first sort them that way. But because your link variable id does not uniquely identify the observations in either data set, the resulting order of the other variable is not determined and Stata's -sort- procedure randomizes indeterminate sorts. That is why you are getting indeterminate results.

    You can produce determinate results if you pre-sort the data sets on id. I illustrate that result, here -sort-ing with the -stable- option, which preserves the pre-existing order of the other variables within id rather than randomizing it:
    Code:
    clear
    input long id str1 type
    1 "x"
    3 "y"
    1 "y"
    2 "y"
    2 "x"
    end
    sort id, stable
    save data1, replace
    
    clear
    input long id str1 name
    2 "a"
    2 "b"    
    2 "c"   
    2 "d"   
    end
    sort id, stable
    save data2, replace
    
    clear
    set seed 1
    local N 500
    matrix O = J(`N',1,.)
    
    forval i = 1/`N' {
        use data2, clear
        merge m:m id using data1, keepusing(type) keep(1 3) nogen sorted
        count if type=="x"
        matrix O[`i',1] = r(N)
    }
    svmat O
    keep O1
    tab O
    for which the end result is:
    Code:
    . tab O
    
             O1 |      Freq.     Percent        Cum.
    ------------+-----------------------------------
              3 |        500      100.00      100.00
    ------------+-----------------------------------
          Total |        500      100.00

    All of that said, you also left out the first sentence of the manual paragraph you quoted, which is
    m:m specifies a many-to-many merge and is a bad idea.
    and a follow in in the subsequent paragraph:
    Because m:m merges are such a bad idea, we are not going to show you an example. If you think
    that you need an m:m merge, then you probably need to work with your data so that you can use a
    1:m or m:1 merge. Tips for this are given in Troubleshooting m:m merges below.
    This is excellent advice. If anything, it is too tepid. (And it also overlooks the possibility that what is really needed is -joinby-, a different command.) I have been using Stata on a more or less daily basis since 1994. In all of that time, I have only once come upon a situation where -merge m:m- would have produced a usable result. Even in that situation, there was a better way to get to the same result. In short, in lengthy experience with Stata, I have never come upon a single situation in which I would use -merge m:m-. So, I'm wondering what your situation is that is causing you to look to -merge m:m-, and strongly suspecting that you have not thought it through.
    Last edited by Clyde Schechter; 09 Nov 2023, 16:34.

    Comment


    • #3
      The reason that doesn't happen is that -merge- requires that the data sets being -merge-d be already sorted on the link variable(s) (id, in your case) before the -merge-ing beings. As a courtesy, if the data sets are not already sorted on the link variable(s), Stata will first sort them that way.
      Ah, that explains it, thanks! (Note: `sort` randomly orders the data before sorting, so each run of the code produces a different order within tied values).

      As I said, I'm debugging someone else's code. I'm writing a replication, so I need to fix it to make my own code reproducible.
      Last edited by Michael Wiebe; 09 Nov 2023, 16:53.

      Comment

      Working...
      X