Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merge m:m vs joinby

    I have been told and seen a whole lot of posts that merging m:m is just a horrible idea. But nobody has seemed to be able to explain clearly why that is.

    My master data set has one observation per patent and the using data set is a census dataset. I am trying to find as many possible matches as possible using the sound of the inventors name, initials, city, etc. I can't really collapse my first data set into individuals since I don't know if two people with the same name are actually the same person, so I don't really want to wrangle my data to do a merge 1:m, although I guess I theoretically could. I have been told that joinby is the best solution, which is great. I ran them both (joinby and merge m:m) and the merge returned more matches than joinby.

    So ultimately my question is what is the difference and why is merge m:m a problem?
    Last edited by Daniel Sabey; 06 May 2020, 09:56.

  • #2
    Welcome to Statalist.

    The following is copied word-for-word from the documentation of the merge command in the Stata Data Management Reference Manual PDF included in the Stata installation and accessible from Stata's Help menu.

    m:m merges

    m:m specifies a many-to-many merge and is a bad idea. In an m:m merge, observations are matched within equal values of the key variable(s), with the first observation being matched to the first; the second, to the second; and so on. If the master and using have an unequal number of observations within the group, then the last observation of the shorter group is used repeatedly to match with subsequent observations of the longer group. Thus m:m merges are dependent on the current sort order—something which should never happen.

    Because m:m merges are such a bad idea, we are not going to show you an example. If you think that you need an m:m merge, then you probably need to work with your data so that you can use a 1:m or m:1 merge. Tips for this are given in Troubleshooting m:m merges below.
    If you are thinking about using merge m:m, it is a near certainty that at least one of the following is true:

    1. Your merge key actually does uniquely identify the observations in one of your data sets, so you can use merge 1:m or merge m:1 or maybe even merge 1:1.

    2. You are failing to take account of one or more additional variables in your data set that, combined with the variables you are trying to use as the merge key, uniquely identify the observations in one or both of your data sets, so you can use merge 1:m or merge m:1 or merge 1:1 with the expanded merge key.

    3. You are really trying to accomplish what joinby, a different command, does: create in the output dataset every possible combination of an observation from the first dataset and an observation from the second dataset, both having the same key. (SQL users take note! I fell for this myself: to Stata m-to-m does not mean m-by-m.)

    4. The data sets you are trying to merge are incorrectly configured or contain data errors that need to be fixed.

    Comment


    • #3
      An example may illustrate this. If you're familiar with database terminology, joinby is an inner join. merge m:m is just... wrong on so many levels.

      Dataset 1:
      var1 var2
      1 a
      1 b
      2 a
      2 b
      Dataset 2:
      var1 var3
      1 c
      1 d
      2 a
      2 a

      joinby on var1 will generate the following:
      var1 var2 var3
      1 a c
      1 a d
      1 b c
      1 b d
      2 a a
      2 a a
      2 b a
      2 b a

      merge m:m var1 generates the following:
      var 1 var2 var3
      1 a c
      1 b d
      2 a a
      2 b a

      Comment

      Working...
      X