merging many data sets

Ngawang Dendup

Join Date: Nov 2015

Posts: 13
#1

merging many data sets

17 Aug 2016, 03:23

Hi there,
I am trying to merge 150 data sets and I realized my way of merging data is not efficient. I merge as follow:

use data1.dta, clear
sort id
save data1.dta, replace

use data2.dta, clear
sort id
save data2.dta, replace

merge 1:1 _n using data1.dta
drop _merge
sort id
save merge1.dta, replace

use data3.dta, clear
sort id
save data3.dta, replace

merge 1:1 _n uisng merg1.dta
drop _merge.dta
sort id
save merge2.dta, replace
.
.
.
.so on ....I realized its not efficient way to merge data when you have large data set.

Then I tried to use to loop function, and I am fairly new in terms of using loop function and I tried to merge with first 10 data set and I some how failed to merge. Following is my syntax:

local data data1-data10
foreach x of local data{
use `x', clear
sort id
save `x'.dta, replace

merge 1:1 _n using `x'
drop _merge
save merge_all.dta

end

Above function does not merge at all. Your help will highly appreciated. Many thanks.
Tags: None
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#2

17 Aug 2016, 03:28

Originally posted by Ngawang Dendup View Post

Hi there,
I am trying to merge 150 data sets and I realized my way of merging data is not efficient. I merge as follow:

use data1.dta, clear
sort id
save data1.dta, replace

use data2.dta, clear
sort id
save data2.dta, replace

merge 1:1 _n using data1.dta
drop _merge
sort id
save merge1.dta, replace

use data3.dta, clear
sort id
save data3.dta, replace

merge 1:1 _n uisng merg1.dta
drop _merge.dta
sort id
save merge2.dta, replace
.
.
.
.so on ....I realized its not efficient way to merge data when you have large data set.

Then I tried to use to loop function, and I am fairly new in terms of using loop function and I tried to merge with first 10 data set and I some how failed to merge. Following is my syntax:

local data data1-data10
foreach x of local data{
use `x', clear
sort id
save `x'.dta, replace

merge 1:1 _n using `x'
drop _merge
save merge_all.dta

end

Above function does not merge at all. Your help will highly appreciated. Many thanks.

I think it's best to first open all the data and sort it, then just merge all of them.

E.g.

Code:

local data data1-data10 foreach x of local data{ use `x', clear sort id save `x'.dta, replace } local data data2-data10 use data1, clear foreach x of local data{ merge 1:1 id using `x', nogen } save merge_all.dta
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35726
#3

17 Aug 2016, 03:57

I don't think this code will work yet. If you put the text data1-data10 into a local, foreach won't expand it into a list. It's not a varlist.

Code:

forval k = 1/10 { use data`k', clear sort id save data`k'.dta, replace } use data1, clear forval k = 2/10 { merge 1:1 id using data`k', nogen } save merge_all.dta
Comment
Ngawang Dendup

Join Date: Nov 2015

Posts: 13
#4

17 Aug 2016, 04:32

Thank you so much Jesse. I tried with your code but it is merging only data1.dta and data10.dta. Your help is really appreciated and would be grateful if you could also look into it once again. Thank you.
Comment
Ngawang Dendup

Join Date: Nov 2015

Posts: 13
#5

17 Aug 2016, 04:38

Thank you Nick Cox, and really really appreciated for the help.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#6

17 Aug 2016, 08:41

There is something strange about this whole thread. After sorting out the syntax errors, Ngawang is sorting each data set on a variable id, and then serially -merge-ing them 1:1 _n.

Evidently, if one is going to -merge 1:1 _n-, then the data sets need to be in their proper sort order before you -merge- them. But if -sort id- can accomplish that, it is simpler to just -merge 1:1 id-, which does not require that any of the data sets be pre-sorted. This will work just fine if id uniquely identifies observations in each data set.

If id does not uniquely identify observations in each data set, then you won't be able to -merge 1:1 id-. But, if that is the case, your current procedure, which sorts on id, will produce a random shuffle of the observations within groups having the same id, a different random shuffle in each data set. Those will then be -merge-d together to produce a data set in which observations from different data sets are randomly matched with each other. Perhaps that is what you want, but it would be unusual.

So I think you need to be clear about what you are doing. If id uniquely identifies observations, all you need is:

Code:

use data1, clear forvalues i = 2/10 { merge 1:1 id using data`i', nogenerate }

If id does not uniquely identify observations, with -merge 1:1 _n- you are in the process of creating a randomly shuffled melange of data sets. Caution!
Comment
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#7

17 Aug 2016, 08:45

Originally posted by Clyde Schechter View Post

...

I recall seeing the message, "using data not sorted" quite frequently after merging. But I think that may have been using a very old Stata version (where I now discovered you can simply specify "sort" in the merge syntax to fix this...)?

Last edited by Jesse Wursten; 17 Aug 2016, 08:49.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35726
#8

17 Aug 2016, 08:49

Clyde: Good point. People who have used Stata for a while will recall that merge did require a prior sort. I have not checked when that requirement was dropped -- most likely when the syntax 1:1 etc was introduced. However, I can't see that the sorting will do any harm, which is the converse of Clyde's point that it is no longer necessary, so I can't see that there is a risk of melange. If the 1:1 merge is not possible, then the command will fail.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30119
#9

17 Aug 2016, 09:09

If id does not uniquely identify observations, then -merge 1:1 id- will fail, but -merge 1:1 _n- will not. And with the sort order within the data sets having been randomized, -merge 1:1 _n- will produce the melange I referred to.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35726
#10

17 Aug 2016, 11:02

OK. That was the original code in #1. I started reading the thread at #2. It wasn't included in anything I was recommending.
Comment
Ngawang Dendup

Join Date: Nov 2015

Posts: 13
#11

17 Aug 2016, 16:55

Thank you Nick, Clyde and Jesse for very useful discussion. I have solved this problem now and thanks to all of you. Highly appreciated for the response, discussion and help.
Comment
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#12

18 Aug 2016, 02:12

Originally posted by Nick Cox View Post

OK. That was the original code in #1. I started reading the thread at #2. It wasn't included in anything I was recommending.

This is my bad - I swapped the _n for id for the exact reason Clyde mentioned, but forgot to mention this in the end.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35726
#13

18 Aug 2016, 02:44

Jesse: Not your fault to improve code! I didn't read #1, so that was my fault.
Comment

Announcement

merging many data sets

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment