how to merge data in stata

lisa bäcker

Join Date: Aug 2015

Posts: 62
#1

how to merge data in stata

19 Aug 2015, 01:40

hello eyeryone,

i have a few data sets and need to merge them (it is panel data):

data set 1:
id date age income ....

data set 2 :
id date address dummy_1 dummy_2

data set 3:
id date address dummy_1 dummy_2

data set 1 should contain all observation (id, date).
In data set 2 you get the information about 50% of the observations (id, date) and in data set 3 the other 50 %, so I guess the main data set I need to take is the first one.
So, how can I merge the data, so that I have just one data set and how can I control it?

thanks a lot in advance!!
lisa

Last edited by lisa bäcker; 19 Aug 2015, 01:47.
Tags: None
lisa bäcker

Join Date: Aug 2015

Posts: 62
#2

19 Aug 2015, 02:30

i tried it like this:
use C:\Main_Document.dta
order id date
drop Dummy_9
drop country
sort id date
save C:\Main_Document2.dta, replace
use C:\First_Document.dta
order id date
drop Dummy_8
label var Dummy_7 “Income“
save C:\First_Document2.dta, replace
merge id date C:\Main_Document2.dta save C:\First_Document2.dta, replace
save C:\merged_data, replace

but it doesnt work. stata tells me:

(note: you are using old merge syntax; see [D] merge for new syntax)
master data not sorted

would be great to get a response
lisa
Comment
Evelyn Ersanilli

Join Date: May 2014

Posts: 43
#3

19 Aug 2015, 02:53

The clue is in the error message
'master data not sorted' -> your master data is 'first_document2'. Your syntax does not include a command for sorting this dataset.
'first_document2' is your master data because this is the file that is open when you run the merge command.

I would also recommend following Stata's advice and reading the manual section on merging data. This includes several helpful examples
From what I gather from your description, 'merge 1:1' is the command for you as each case is present only once in each dataset

Code:

use "C:\Main_Document2.dta", clear merge 1:1 id date using "C:\First_document2.dta", keep(master) tab _merge

The option 'keep(master)' tells Stata to keep all cases in your master data set (main_document2), also if there is no match with cases in the data sets you are merging with.

Code:

use "C:\First_Document2.dta", clear merge 1:1 id date using "C:\Main_document2.dta", keep(using)

Should give you the same data set. Now 'main_document' is the using data and 'first_document2' the master data.
Comment
lisa bäcker

Join Date: Aug 2015

Posts: 62
#4

19 Aug 2015, 02:58

great I will try it! thanks!!
but is it possible to merge more data sets at the same time in my master data?
because the observations for my variables I need to merge in the master data are seperated in 10 data sets.
dont know how to handle this.

Last edited by lisa bäcker; 19 Aug 2015, 03:11.
Comment
Evelyn Ersanilli

Join Date: May 2014

Posts: 43
#5

19 Aug 2015, 03:05

I think this should do the trick

Code:

use "C:\Main_Document2.dta", clear merge 1:1 id date using "C:\First_document2.dta" "C:\Second_document2.dta" , keep(master)

But you will need to sort your datafiles on id and date first

Please note that I'm assuming noone entered the panel after the first round.
If this did happen and you want to include those cases in your merged data set you should drop the 'keep(master)' option

Last edited by Evelyn Ersanilli; 19 Aug 2015, 03:09.
Comment
lisa bäcker

Join Date: Aug 2015

Posts: 62
#6

19 Aug 2015, 03:54

thank you very much!

if I use keep(master) i get this result:

master only (1) | 4,326,566

when i do not use it, I get this one:
master only (1) | 4,326,566
matched (3) | 458,450

Do I understand it right: matched (3) means, that it worked perfectly and all data from my second data set is now in the data master set. I do not need to make another test to be sure, right?

Thanks so much! You helped me so much!
lisa
Comment
Evelyn Ersanilli

Join Date: May 2014

Posts: 43
#7

19 Aug 2015, 04:25

It seems I gave you the wrong advise on the keep option.
I looked it up:

keep(master): observation appeared in master only
keep(using): observation appeared in using only
keep(match): observation appeared in both

If you want to preserve the cases that drop out in later waves, you should not add the 'keep(master) option.
If you only want to keep cases that were present in all waves of the panel you can add 'keep(match)', however if you do this you cannot determine if there was attrition bias.

Last edited by Evelyn Ersanilli; 19 Aug 2015, 05:17.
Comment
lisa bäcker

Join Date: Aug 2015

Posts: 62
#8

19 Aug 2015, 05:09

thanks for your help,

but it still doesn't work.
if I try to merge more than one data set I get this response: invalid '"C:\Second_Document2.dta'
any idea how I can solve the problem?
thank you!!
Comment
Evelyn Ersanilli

Join Date: May 2014

Posts: 43
#9

19 Aug 2015, 05:15

Start off by posting your full code + error message.
Comment
lisa bäcker

Join Date: Aug 2015

Posts: 62
#10

19 Aug 2015, 05:25

use C:\Main_Document.dta
order id date
drop Dummy_9
drop country
sort id date
save C:\Main_Document2.dta, replace

use C:\First_Document.dta
order id date
drop Dummy_8
label var Dummy_7 “Income“
save C:\First_Document2.dta, replace

use C:\Second_Document.dta
order id date
drop Dummy_8
label var Dummy_7 “Income“
save C:\Second_Document2.dta, replace

use C:\Third_Document.dta
order id date
drop Dummy_8
label var Dummy_7 “Income“
save C:\Third_Document2.dta, replace

merge 1:1 id date using "C:\First_document2.dta" "C:\Second_Document2.dta" "C:\Third_Document2.dta"
tab _merge

(note that First, Second and Third dataset contains the same Information, but for different individuals, Master document contains all individuals)

I get the error message:

merge 1:1 id date using "C:\First_document2.dta" "C:\Second_Document2.dta" "C:\Third_Document2.dta"
invalid '"C:\Second_Document2.dta'
r(198);

end of do-file

r(198);
Comment
Evelyn Ersanilli

Join Date: May 2014

Posts: 43
#11

19 Aug 2015, 05:46

Please post your code in code format (see Stata list FAQs).

If you leave out '1:1' do you get what you want?

Code:

merge id date using "C:\First_document2.dta" "C:\Second_Document2.dta" "C:\Third_Document2.dta"

Otherwise you can merge the files one at a time.

Code:

use "C:\Main_Document2.dta", clear foreach file in First Second Third { merge 1:1 id using `file'_Document2.dta rename _merge merge`file' }

Each merge will produce a _merge variable. Because you cannot have multiple variables by the same name you need to rename them in a structured way that allows you to connect them back to the files you are merging.
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#12

19 Aug 2015, 09:03

For those that are puzzled, the merge syntax was changed (for the better) with the introduction of Stata 11 and the functionality of specifying multiple using files was removed.

My reading of the original post and the following comment in #10

(note that First, Second and Third dataset contains the same Information, but for different individuals, Master document contains all individuals)

suggests that the best approach here is to use append to combine datasets with the same variables but for different individuals and then merge the combined dataset with the main dataset. Something like

Code:

* --------- create toy datasets that match description in post ----------------- clear input id date age income 1 2001 20 55 1 2002 21 57 2 2001 30 40 2 2002 31 45 end save "Main_Document.dta", replace clear input id date str20 address dummy_1 dummy_2 1 2001 "here" 1 1 1 2002 "here" 1 0 end save "First_Document.dta", replace clear input id date str20 address dummy_1 dummy_2 2 2001 "there" 0 0 2 2002 "there" 1 1 end save "Second_Document.dta", replace * --------- append datasets from different individuals and then merge ---------- clear append using "First_Document.dta" append using "Second_Document.dta" merge 1:1 id date using "Main_Document.dta" list

Note that you can achieve the same using merge with the update option but I prefer the append/merge solution because you are less likely to miss the error if it happens that there are individuals that appear in more than one dataset. It's also harder to track each individual merge results (_merge variable). This would look like

Code:

clear use "Main_Document.dta" merge 1:1 id date using "First_Document.dta", nogen merge 1:1 id date using "Second_Document.dta", update list
Comment
lisa bäcker

Join Date: Aug 2015

Posts: 62
#13

19 Aug 2015, 09:52

thank you so much, Robert!
That's exactly what I was looking for !
Comment
lisa bäcker

Join Date: Aug 2015

Posts: 62
#14

19 Aug 2015, 10:22

my data set is huge, it's getting confusing..

I just got another error message again:

variables id date do not uniquely identify observations in the using data
r(459);
end of do-file
r(459);

I think there are more observations in the douments than in the master data set, but I just need the observations for them in the master data set.
is there a command I can use?

thank you very much in advance!
Comment
Robert Picard

Join Date: Mar 2014

Posts: 1536
#15

19 Aug 2015, 10:46

If you are following my append/merge example, then the using dataset is your "Main_Document.dta". You need to understand why you have multiple observations with the same values for the variables id and date in "Main_Document.dta". If you are satisfied that these make sense (for example, an individual can have more than one job in a year) and that you want to pair these with annual observations from the appended datasets, then all you need if to modify the merge to allow for multiple observations per key variables (in this case id date).

If you want the final dataset to ignore observations from the documents for values of id and date not found in the "Main_Document.dta", then you add the option to keep only the observations that match or that come from the using dataset ("Main_Document.dta") to the merge as well. Here's a revised example

Code:

* --------- create toy datasets that match description in post ----------------- clear input id date age income jobid 1 2001 20 55 1 1 2001 20 55 2 1 2002 21 57 1 2 2001 30 40 1 2 2002 31 45 1 end save "Main_Document.dta", replace clear input id date str20 address dummy_1 dummy_2 1 2001 "here" 1 1 1 2002 "here" 1 0 end save "First_Document.dta", replace clear input id date str20 address dummy_1 dummy_2 2 2001 "there" 0 0 2 2002 "there" 1 1 3 2001 "whothat" 1 1 end save "Second_Document.dta", replace * --------- append datasets from different individuals and then merge ---------- clear append using "First_Document.dta" append using "Second_Document.dta" merge 1:m id date using "Main_Document.dta", keep(match using) sort id date list
Comment

Announcement

how to merge data in stata

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment