Can I m:m merge here? Is it always bad?

Jaycob Applegate

Join Date: Jan 2023
Posts: 40

Can I m:m merge here? Is it always bad?

26 Jan 2023, 13:17

Hi all,

I have read throughout several posts that we should always avoid m:m merging in Stata, however, I am in a situation where I think it offers the appropriate solution. Data set A (shown below) is a file that has parents linked to children, meaning that children can be in the data multiple times if they have multiple parents in the data (at times both biological and maybe step). Here, house hold is hhid and opn is people in the household that are not the respondent, in this case children

Data set B (shown below) is a file that has extended information (essentially a variable that would be be k13educ) but only contains HHID and OPN. When I merge these two using "m:m hhid opn", I seem to get the desired result, however, I am concerned that this is not appropriate.

Data set A

Code:

. list hhidpn hhid pn opn k11educ k12educ kidid karel if hhid ==  "011378"

        +------------------------------------------------------------------------+
        |   hhidpn     hhid    pn   opn   k11educ   k12educ        kidid   karel |
        |------------------------------------------------------------------------|
   829. | 11378010   011378   010   101        .p        .p   0113780101   1.kid |
   830. | 11378010   011378   010   102        .p        .p   0113780102   1.kid |
   831. | 11378020   011378   020   101        .p        .p   0113780101   1.kid |
   832. | 11378020   011378   020   102        .p        .p   0113780102   1.kid |
   833. | 11378020   011378   020   151         .         .   0113780151   1.kid |
        +------------------------------------------------------------------------+

Data set B

Code:

list hhid opn pe029 if hhid == "011378"

       +----------------------+
       |   hhid   opn   pe029 |
       |----------------------|
  340. | 011378   101       . |
  341. | 011378   101       . |
  342. | 011378   102       . |
  343. | 011378   102       . |
  344. | 011378   504       . |
       |----------------------|
  345. | 011378   504       . |
  346. | 011378   505       . |
  347. | 011378   505       . |
       +----------------------+

Data set A

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input long hhidpn str6 hhid str3(pn opn) byte(k11educ k12educ) str10 kidid byte karel
11378010 "011378" "010" "101" .p .p "0113780101" 1
11378010 "011378" "010" "102" .p .p "0113780102" 1
11378020 "011378" "020" "101" .p .p "0113780101" 1
11378020 "011378" "020" "102" .p .p "0113780102" 1
11378020 "011378" "020" "151"  .  . "0113780151" 1
end
label values k11educ EDUC
label values k12educ EDUC
label values karel RELR
label def RELR 1 "1.kid", modify

Data set B

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str6 hhid str3 opn byte pe029
"011378" "101" .
"011378" "101" .
"011378" "102" .
"011378" "102" .
"011378" "504" .
"011378" "504" .
"011378" "505" .
"011378" "505" .
end

Tags: None

George Ford

Join Date: Aug 2014

Posts: 3142
#2

26 Jan 2023, 13:41

or use joinby
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30084
#3

26 Jan 2023, 14:04

It may be that in this particular instance -merge m:m- will, fortuitously, give you correct results. But that will be purely happenstance. It will appear to work correctly with the example data because for each combination of hhid and pn, either the exact same number of observations appear in both data sets (and, for it to actually work correctly for this example, they must be in the "right" order concerning any other variables you might not have shown), or the combination appears in only one of the data sets. It is unlikely that this fortunate circumstances prevails throughout a data set of any substantial size, and you cannot rely on it. I suspect if you were to carefully inspect the full results of your -merge m:m- you would find that it is, in fact, pairing up wrong observations.

Your data set B appears just to be a malformed dataset that contains only two distinct observations, each with 4 surplus exact copies. If the full data set also has only one distinct observation for each hhid pn pair, the rest being exact copies, then the simplest approach that will safely merge this data to dataset A will be to re-create data set B so that it doesn't have these surplus copies, and then -merge m:1- with that re-created data set B. In the course of doing that, you should also review carefully all of the data management that created data set B. Usually, such a data set has been incorrectly created in the first place, and you may find, beside the surplus exact copies, additional errors that need to be fixed to assure that its contents are correct. If the data management was correct other than that surplus copies were not deleted along the way, then you can quickly fix data set B with -duplicates drop- and then -merge m:1-, again, will do the job.

Now it may be that a fuller sample from data set B would reveal that the surplus observations on hhid pn pairs are not always exact copies, but disagree on the value of pe029. In that case, you have a serious problem, because there is no way for Stata, or for that matter, you or I, to know which value of pe029 belongs with which record in dataset A. -merge m:m- will not care about that problem and just pair them up in a meaningless way. It will also give you no warning of what has happened. You are likely to discover this problem only later, when further work with the data set produces absurd results due to the mismatched data. If you are lucky, this will be before you have presented your findings to others for them to rely on. Sometimes in life it is better to be lucky than smart, but I don't think this is one of those situations.

So, without a fuller understanding of your data sets, I cannot confidently advise you on the proper way to combine them. But I can assure you that one of the following is almost certainly true:
One of these data sets would, if correctly created, contain only a single observation for any combination of hhid and pn. This might be as simple as just removing the surplus observations if they all agree exactly, or might require identifying which of the surplus variables to retain, or some way of combining the surplus variables to create a single observation that has correct values for the variables.

There are other variables in one or both data sets, perhaps not shown in your examples, that, in combination with hhid and pn, enable you to carry out a 1:1, 1:m or m:1 match. Such a variable, in data sets about people in households, might be a time variable, for instance.

The kind of additional variables referred to in 2 exist in principle, but your data sets do not contain them. In that case, you need to augment your data sets with the additional information needed, re-creating them, perhaps from different or better sources, or just correcting the data management that led to the loss of those additional variables if the raw data files have them.

What you really want to do is not even a -merge- in the first place. If you want to pair up every observation for a given hhid pn combination in dataset A with every observation for the same hhid pn combination in dataset B, that is not possible with the -merge- command (and, no, it is not what -merge m:m- does). For that task you use the -joinby- command.

There is, in fact, a very unusual situation where -merge m:m- is the appropriate command. But I have been using Stata since 1994 on a more or less daily basis, and in that entire time I have encountered it only once. Moreover, even in that situation, there was a better way to get the desired result. You will not go wrong if you just forget about that obscure possibility and subscribe to the maxim that -merge m:m- should never be used.

Added: Crossed with #2.

Last edited by Clyde Schechter; 26 Jan 2023, 14:10.
1 like
Comment
Jaycob Applegate

Join Date: Jan 2023

Posts: 40
#4

26 Jan 2023, 16:03

Hey Clyde,

Thank you for such a thorough write up. I wasn't too familiar with the joinby command, but it does seem to accomplish what I intend to do. I will move forward assuming that I should always avoid m:m
Comment
Jared Greathouse

Join Date: Sep 2021

Posts: 2170
#5

26 Jan 2023, 17:28

Is it always bad?

in a word, 'yep'
Comment

Announcement

Can I m:m merge here? Is it always bad?

Comment

Comment

Comment

Comment