merging two data sets

Vishal Sharma

Join Date: Sep 2018

Posts: 60
#1

merging two data sets

13 May 2019, 14:44

hello,
my project involves merging two very large data sets . I used an id number to merge the two data sets and got 95% to match using the code : merge m:m id using data1.dta .

i was told to use an additional id code to try to get the other 5% to match. I m not sure how to do this in stata. I just want to match/merge with the ones that were not previously matched with the first merge.

any feedback if this can be done?

thanks
Vishal
Tags: None
Mike Lacy

Join Date: Apr 2014

Posts: 2425
#2

13 May 2019, 15:15

1) I can't see how an additional id code in a merge would lead to more matches. I'd expect the opposite, as more key (id) fields should make the match more restrictive.
2) In line with an opinion commonly voiced here, I am unable to imagine a situation in which an m:m merge does something desirable.

On these two counts, I'd encourage you to post a small data sample of your two files using -dataex-, and to explain your situation so what we could help you be sure that an m:m match really does what you want. If the latter is true, I'd find it really interesting--seriously, no sarcasm intended.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

13 May 2019, 16:07

Let me add to Mike's comment the following, which is copied word-for-word from the documentation of the merge command in the Stata Data Management Reference Manual PDF included in the Stata installation and accessible from Stata's Help menu.

m:m merges

m:m specifies a many-to-many merge and is a bad idea. In an m:m merge, observations are matched within equal values of the key variable(s), with the first observation being matched to the first; the second, to the second; and so on. If the master and using have an unequal number of observations within the group, then the last observation of the shorter group is used repeatedly to match with subsequent observations of the longer group. Thus m:m merges are dependent on the current sort order—something which should never happen.

Because m:m merges are such a bad idea, we are not going to show you an example. If you think that you need an m:m merge, then you probably need to work with your data so that you can use a 1:m or m:1 merge. Tips for this are given in Troubleshooting m:m merges below.
So the opinion commonly voiced here reflects the advice of the authors of Stata on the use of merge m:m - or rather, on not using it.

Do let us better understand your data so we can advise a better approach. If there is one observation per individual in at least one dataset, then merge 1:m or merge m:1 is likely to work. If you really need to match 2 observations of an individual in one datset with three observations of the same individual from the other dataset, giving 6 observations in the (what now will become very large) output dataset, then joinby may be what you need rather than merge.

Last edited by William Lisowski; 13 May 2019, 16:22.
Comment

Announcement

merging two data sets

Comment

Comment