Merging two datasets

Giovanni Palmioli

Join Date: Apr 2016
Posts: 28

Merging two datasets

06 Sep 2016, 15:04

Hi,

I would like to merge two datasets (see below).

Dataset 1 consists of pairs
Dataset 2 doesn't have pairs but covers a longer time period.

I tried with merge but the new observations in dataset 2 (ie. those before 2010 and after 2015q2) didn't get matched with the ID2 variable. I basically would like each pair to cover the longer time period (as in dataset 2).

Any tips would be appreciated.

Dataset 1

Dataset 2

What the merge should look like

ID1

ID2

TIME

ID1

TIME

ID1

ID2

TIME

2010Q1

1999Q4

2010Q2

2000Q1

2010Q3

2000Q2

2010Q4

…

2016Q2

2015Q2

1999Q4

2010Q1

2000Q1

2010Q2

2000Q2

2010Q3

2010Q4

…

2016Q2

…

1999Q4

2015Q2

2000Q1

2010Q1

2000Q2

2010Q2

2010Q3

2010Q4

2016Q2

…

1999Q4

2000Q1

2015Q2

2000Q2

2010Q1

2010Q2

2010Q3

2016Q2

2010Q4

…

2015Q2

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30177
#2

06 Sep 2016, 16:09

Well, you have posted your data example in a way that makes it really cumbersome to try to import it into Stata and experiment on it. In the future, please use -dataex- to do that. (-ssc install dataex-, -help dataex-). Moreover, your use of ellipsis makes it difficult to perceive exactly what you have and what you are trying to accomplish.

So I'm going to just make a somewhat wild guess that what you actually want to do is to pair each observation in data set 1 with every observation in data set 2 that has the same value of ID1. (I can't really infer that from what you've shown, but it comes up often on the forum when people are unable to get what they want from -merge-). If that's what you have in mind, then the command you need is:

Code:

use dataset1 joinby ID1 using dataset2

Please do read the FAQ for excellent advice on how to ask clear questions and how most helpfully to show useful supporting information such as example data, code that was tried, results that were obtained, etc.
1 like
Comment
Oded Mcdossi

Join Date: Jun 2014

Posts: 577
#3

06 Sep 2016, 16:13

I think you are looking for joinby. Type help joinby for more details.
Next time please use code tags (# in the advanced editor tools) to post your data and dataex to list sample of your data.
Comment
Bruce Weaver

Join Date: May 2014

Posts: 1142
#4

06 Sep 2016, 16:16

Are you trying to stack the files or match-merge them? I think it would help greatly if you showed the command syntax you attempted previously.

--
Bruce Weaver
Email: [email protected]
Version: Stata/MP 19.5 (Windows)
2 likes
Comment
Giovanni Palmioli

Join Date: Apr 2016

Posts: 28
#5

07 Sep 2016, 04:14

Originally posted by Clyde Schechter View Post

Well, you have posted your data example in a way that makes it really cumbersome to try to import it into Stata and experiment on it. In the future, please use -dataex- to do that. (-ssc install dataex-, -help dataex-). Moreover, your use of ellipsis makes it difficult to perceive exactly what you have and what you are trying to accomplish.

So I'm going to just make a somewhat wild guess that what you actually want to do is to pair each observation in data set 1 with every observation in data set 2 that has the same value of ID1. (I can't really infer that from what you've shown, but it comes up often on the forum when people are unable to get what they want from -merge-). If that's what you have in mind, then the command you need is:

Code:

use dataset1 joinby ID1 using dataset2

Please do read the FAQ for excellent advice on how to ask clear questions and how most helpfully to show useful supporting information such as example data, code that was tried, results that were obtained, etc.

Sorry about that, I thought the data example I posted was easy to interpret.

I did try

Code:

joinby ID1 using "dataset2"

but it did quite not achieve what I am after. There are lots of duplicates and still can't figure out how the variables in dataset 2 (i.e. Z1, Z2, Z3) got matched. I basically want to add those 3 variables for each pair and make sure that the data for each pair has the time frame of dataset 2 (ie 1999q4-2016q2). Indeed variables Y1 Y2 Y3 will be empty before 2010q4 and after 2015q2.
I hope it is clearer now.
Comment
Giovanni Palmioli

Join Date: Apr 2016

Posts: 28
#6

07 Sep 2016, 04:23

Originally posted by Bruce Weaver View Post

Are you trying to stack the files or match-merge them? I think it would help greatly if you showed the command syntax you attempted previously.

Kind of both...please see the data example.

I tried

Code:

joinby ID1 using "dataset2" merge m:m ID1 using "dataset2"
Comment
Oded Mcdossi

Join Date: Jun 2014

Posts: 577
#7

07 Sep 2016, 05:08

You ignore the advices in #2 and #3 so it is impossible to say much. Please follow the forum FAQ to get better suggestions and to help us to solve your problem.

I think the following will do the trick:

Code:

g str1 ID2n="" levelsof ID2, local (levels) foreach i of local levels { append using data2, gen(ID2_`i') replace ID2n="`i'" if ID2_`i'==1 drop ID2_`i' } replace ID2=ID2n if ID2=="" drop ID2n sort ID1 ID2 TIME

Last edited by Oded Mcdossi; 07 Sep 2016, 05:52.
Comment
Giovanni Palmioli

Join Date: Apr 2016

Posts: 28
#8

07 Sep 2016, 07:29

Originally posted by Oded Mcdossi View Post

You ignore the advices in #2 and #3 so it is impossible to say much. Please follow the forum FAQ to get better suggestions and to help us to solve your problem.

I think the following will do the trick:

Code:

g str1 ID2n="" levelsof ID2, local (levels) foreach i of local levels { append using data2, gen(ID2_`i') replace ID2n="`i'" if ID2_`i'==1 drop ID2_`i' } replace ID2=ID2n if ID2=="" drop ID2n sort ID1 ID2 TIME

Thanks for your input. The problem with this is that the observations (those with the common time frame between the 2 datasets) get duplicated as follows:

Code:

time Y1 Y2 Y2 Z1 Z2 Z3

2006Q1 3 45 2 . . .

2006Q1 . . . 11 24 56

2006Q2 9 25 18 . . .

2006Q2 . . . 20 31 58

..

..
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30177
#9

07 Sep 2016, 09:42

Giovanni,

Several of our best responders have tried to answer your question, and it is clear that none of us is able to understand what you are trying to get as a result here. In my opinion, that is because, as I said earlier, the example you posted is very, very unclear. The use of ... really makes a problem because, evidently, nobody can figure out what is, and what is not, supposed to be in the ... places. Please post, using -dataex-, a small example of your dataset1, a small example of dataset2 that should match up with what you show of dataset1, and then a handworked complete example showing exactly what the result should look like for those examples.
2 likes
Comment
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#10

07 Sep 2016, 09:56

Perhaps try

Code:

merge m:1 ID1 TIME

As far as I can read the example, that would give you start of the desired output.

Then you'd have to do some expanding to recreate the pairs outside the matched area, but without sample data I can't really help there.

Edit: Added "TIME" as per #11 and #12
Edit2: Added part on expanding, which I think will be necessary.

Last edited by Jesse Wursten; 07 Sep 2016, 10:33.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30177
#11

07 Sep 2016, 10:12

Jesse,

That clearly won't work because ID1 does not uniquely identify the observations in either data set.
Comment
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#12

07 Sep 2016, 10:14

Originally posted by Clyde Schechter View Post

Jesse,

That clearly won't work because ID1 does not uniquely identify the observations in either data set.

Sorry, I forgot to add the TIME variable. (I have edited the post)
Comment
Oded Mcdossi

Join Date: Jun 2014

Posts: 577
#13

07 Sep 2016, 10:17

While I'm reluctant to help any more, something inside me keep think about your results. You can get rid of the duplicates by using something like collapse (max) Y* Z* , by(ID1 ID2 TIME)
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

#14

07 Sep 2016, 10:21

Here's my guess at what is wanted:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str2 ID1 str1 ID2 str6 TIME float(Y1 Y2 Y3)
"A1" "B" "2010Q1"  1  2  3
"A1" "B" "2010Q2"  2  4  6
"A1" "B" "2010Q3"  3  6  9
"A1" "B" "2010Q4"  4  8 12
"A1" "B" "2015Q2"  5 10 15
"A1" "C" "2010Q1"  6 12 18
"A1" "C" "2010Q2"  7 14 21
"A1" "C" "2010Q3"  8 16 24
"A1" "C" "2010Q4"  9 18 27
"A1" "C" "2015Q2" 10 20 30
"A2" "B" "2010Q1" 11 22 33
"A2" "B" "2010Q2" 12 24 36
"A2" "B" "2010Q3" 13 26 39
"A2" "B" "2010Q4" 14 28 42
"A2" "B" "2015Q2" 15 30 45
"A2" "C" "2010Q1" 16 32 48
"A2" "C" "2010Q2" 17 34 51
"A2" "C" "2010Q3" 18 36 54
"A2" "C" "2010Q4" 19 38 57
"A2" "C" "2015Q2" 20 40 60
end
save "statalist_data1.dta", replace

* save a list of unique pairs of identifiers in dataset1
bysort ID1 ID2: keep if _n == 1
keep ID1 ID2
save "statalist_data1b.dta", replace
list, sepby(ID1)

* Example generated by -dataex-. To install: ssc install dataex
clear
input str2 ID1 str6 TIME float(Z1 Z2 Z3)
"A1" "1999Q4" -1  -2  -3
"A1" "2000Q1" -2  -4  -6
"A1" "2000Q2" -3  -6  -9
"A1" "2016Q2" -4  -8 -12
"A2" "1999Q4" -5 -10 -15
"A2" "2000Q1" -6 -12 -18
"A2" "2000Q2" -7 -14 -21
"A2" "2016Q2" -8 -16 -24
end
save "statalist_data2.dta", replace

* create pairwise combinations using unique ID1 ID2 codes
joinby ID1 using "statalist_data1b.dta"
list, sepby(ID1)

* match observations by ids and time code
merge 1:1 ID1 ID2 TIME using "statalist_data1.dta"

sort ID1 ID2 TIME
order ID1 ID2 TIME Y1 Y2 Y3 Z1 Z2 Z3
list, sepby(ID1)

Comment

Giovanni Palmioli

Join Date: Apr 2016
Posts: 28

#15

07 Sep 2016, 12:44

Sorry everyone for the lack of clarity, I will be more careful next time.

I am unable to use dataex at the moment but please see below an extract of each dataset and what I would like to achieve further down. I hope this is OK.

id1 id2 - int
id2n - string
year quarter - float
y1 y2 y3 z1 z2 z3 - double

Code:

 DATASET 1

DATASET 2

id1
id2n
id2
year
quarter
y1
y2
y3

id1
year
quarter
z1
z2
z3

1005
AAA
7005
2013
1
.
.
.

1005
2012
1
96
49
25

1005
AAA
7005
2013
2
189
1082
19212

1005
2012
2
92
9
158

1005
AAA
7005
2013
3
.
.
.

1005
2012
3
87
3
94

1005
AAA
7005
2013
4
697
1010
63602

1005
2012
4
47
129
201

1005
AAA
7005
2014
1
.
.
.

1005
2013
1
.
.
.

1005
AAA
7005
2014
2
219
200
50025

1005
2013
2
7
146
221

1005
AAA
7005
2014
3
84
56
100

1005
2013
3
.
.
.

1005
AAA
7005
2014
4
58
97
5623

1005
2013
4
23
69
92

1005
AAA
7005
2015
1
.
.
.

1005
2014
1
1
57
38

1005
AAA
7005
2015
2
984
31
11638

1005
2014
2
.
.
.

1005
BBB
8005
2013
1
114
328
203

1005
2014
3
143
35
97

1005
BBB
8005
2013
2
721
163
323

1005
2014
4
136
69
108

1005
BBB
8005
2013
3
447
723
12

1005
2015
1
81
40
203

1005
BBB
8005
2013
4
549
239
220

1005
2015
2
67
56
39

1005
BBB
8005
2014
1
193
29
388

1005
2015
3
187
161
98

1005
BBB
8005
2014
2
308
75
189

1005
2015
4
1
34
204

1005
BBB
8005
2014
3
82
588
414

1005
2016
1
51
107
143

1005
BBB
8005
2014
4
176
61
147

1005
2016
2
23
45
33

1005
BBB
8005
2015
1
170
809
581








1005
BBB
8005
2015
2
301
207
409










recast double y1
recast double y2
recast double y3
recast float quarter
recast float year

recast double z1
recast double z2
recast double z3
recast float year
recast float quarter
  id1
id2n
id2
year
quarter
y1
y2
y3
z1
z2
z3

1005
AAA
7005
2012
1
.
.
.
96
49
25

1005
AAA
7005
2012
2
.
.
.
92
9
158

1005
AAA
7005
2012
3
.
.
.
87
3
94

1005
AAA
7005
2012
4
.
.
.
47
129
201

1005
AAA
7005
2013
1
.
.
.
.
.


1005
AAA
7005
2013
2
189
1082
19212
7
146
221

1005
AAA
7005
2013
3
.
.
.
.
.


1005
AAA
7005
2013
4
697
1010
63602
23
69
92

1005
AAA
7005
2014
1
.
.
.
1
57
38

1005
AAA
7005
2014
2
219
200
50025
.
.


1005
AAA
7005
2014
3
84
56
100
143
35
97

1005
AAA
7005
2014
4
58
97
5623
136
69
108

1005
AAA
7005
2015
1
.
.
.
81
40
203

1005
AAA
7005
2015
2
984
31
11638
67
56
39

1005
AAA
7005
2015
3
.
.
.
187
161
98

1005
AAA
7005
2015
4
.
.
.
1
34
204

1005
AAA
7005
2016
1
.
.
.
51
107
143

1005
AAA
7005
2016
2
.
.
.
23
45
33

1005
BBB
8005
2012
1
.
.
.
96
49
25

1005
BBB
8005
2012
2
.
.
.
92
9
158

1005
BBB
8005
2012
3
.
.
.
87
3
94

1005
BBB
8005
2012
4
.
.
.
47
129
201

1005
BBB
8005
2013
1
114
328
203
.
.


1005
BBB
8005
2013
2
721
163
323
7
146
221

1005
BBB
8005
2013
3
447
723
12
.
.


1005
BBB
8005
2013
4
549
239
220
23
69
92

1005
BBB
8005
2014
1
193
29
388
1
57
38

1005
BBB
8005
2014
2
308
75
189
.
.


1005
BBB
8005
2014
3
82
588
414
143
35
97

1005
BBB
8005
2014
4
176
61
147
136
69
108

1005
BBB
8005
2015
1
170
809
581
81
40
203

1005
BBB
8005
2015
2
301
207
409
67
56
39

1005
BBB
8005
2015
3
.
.
.
187
161
98

1005
BBB
8005
2015
4
.
.
.
1
34
204

1005
BBB
8005
2016
1
.
.
.
51
107
143

1005
BBB
8005
2016
2
.
.
.
23
45
33

time	Y1	Y2	Y2	Z1	Z2	Z3
2006Q1	3	45	2	.	.	.
2006Q1	.	.	.	11	24	56
2006Q2	9	25	18	.	.	.
2006Q2	.	.	.	20	31	58
..
..

Announcement

Merging two datasets

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment