Problem with merging two datasets

Saliha Amrani

Join Date: Sep 2018
Posts: 14

Problem with merging two datasets

27 Sep 2018, 07:35

Hi,
I've been trying to merge two data sets for hours unsuccessfully. Here is a snapshot of the two data sets.

I've tried merge 1:1 id ymonth.

This is data1

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str28 diagnosis str24 municip_code float(ymonth born_year) long id float(benefit_startdate benefit_stopdate male)
"disease in digestive system" "36" 634 1981 1 634 648 0
"disease in digestive system" "36" 638 1981 1   .   . 0
"disease in digestive system" "36" 641 1981 1   .   . 0
"disease in digestive system" "36" 642 1981 1   .   . 0
"disease in digestive system" "36" 645 1981 1   .   . 0
"disease in digestive system" "36" 646 1981 1   .   . 0
"disease in digestive system" "36" 648 1981 1   .   . 0
"disease in digestive system" "36" 679 1981 1   .   . 0
"unknown"                     "22" 614 1969 2   .   . 0
"unknown"                     "22" 620 1969 2   .   . 0
"unknown"                     "22" 661 1969 2 661 687 0
"unknown"                     "22" 666 1969 2   .   . 0
"unknown"                     "22" 687 1969 2   .   . 0
"Mental disorders"            "83" 598 1988 3 598 648 0
"Mental disorders"            "83" 659 1988 3 659 662 0
"Mental disorders"            "83" 672 1988 3 672 689 0
"Mental disorders"            "83" 689 1988 3   .   . 0
"Mental disorders"            "83" 693 1988 3 693 693 0
"MSDs"                        "42" 596 1969 4 596 648 1
"MSDs"                        "42" 644 1969 4   .   . 1
"disease in digestive system" "11" 526 1969 5 526 567 0
"disease in digestive system" "11" 577 1969 5 577 648 0
"unknown"                     "55" 624 1984 6 624 654 0
"unknown"                     "55" 627 1984 6   .   . 0
"unknown"                     "55" 635 1984 6   .   . 0
"unknown"                     "55" 667 1984 6 667 687 0
end
format %tm ymonth
format %ty born_year
format %tm benefit_startdate
format %tm benefit_stopdate

And I'm trying to merge it with this data set.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float(id ymonth startwork stopwork occupation_code industry worktime)
1 579 579   . 3310  75     39.73
1 594 594   . 3310  85     39.73
1 594 594 633 3310  85     32.23
1 638 638 640 3310  85         7
1 638 638   . 3310  85     13.23
1 641 641   . 3310  85      2.23
1 642 642   . 3310  85      7.77
1 645 645 659 3310  85     32.23
1 645 645   . 3310  85     12.61
1 648 648 678 3310  85     32.23
1 648 648   . 3310  85     32.23
1 679 679   . 3310  85     32.23
2 314 314   . 3491  22     39.73
2 314 314 612 3491  58      9.73
2 314 314   . 3492  22      9.73
2 614 614 663 3492  58      9.73
2 614 614   . 3492  58      9.73
2 620 620 659 3492  58      9.73
2 620 620   . 3492  58      9.73
2 666 666   . 1310  82     39.73
2 687 687   . 3492 349      9.73
3 689 689   . 3423 342 74.170006
4 547 547   . 7237 727     39.73
4 644 644   . 7231 723     39.73
4 644 644   . 7232 721     39.73
5   .   .   .    .   .         .
6 570 570   . 9132  91     39.73
6 580 580   . 5139 512     11.46
6 605 605   . 5140 515       8.8
6 611 611   . 5140 515     14.23
6 627 627   . 5140 515     11.74
6 635 635   . 5140 515     11.35
end
format %tm ymonth
format %tm startwork
format %tm stopwork

In the real dataset I have more observations per individual, one each month between jan 2011 - feb 2018, but because of work history there are some "outliers". I have other data sets that are also going to me merged with these two. They also contain monthly information.

I think the problem is because that for some months I have more observations per month per individual. This is mainly because there has been a change in working hours during that same work period. I would like to create new variables, a dummy for whether there has been a change in working hours, one for the "new" number of working hours. But I'm not sure how to do this.

Can someone please guide me

Regards,
Saliha

Last edited by Saliha Amrani; 27 Sep 2018, 08:24.

Tags: None

Mike Lacy

Join Date: Apr 2014

Posts: 2423
#2

27 Sep 2018, 08:52

An attempt to merge these files gives an error message that tells explicitly what the problem is: "variables id ymonth do not uniquely identify observations in the using data." You need a 1:m merge, corresponding to the fact that your second file (the "using" file) contains multiple observations for combinations of id and ymonth. You'd find it helpful to read about 1:1, 1:m, and m:1 merges within -help merge-. Because you don't say in what respect you find your attempts to merge "unsuccessful," it's hard to go beyond the preceding in trying to help you.
Comment

Martin Paul

Join Date: Jun 2017
Posts: 36

25 Feb 2019, 08:39

Hi statalist,

I am facing a similar problem
I have these two datasets I am trying to merge. please find them attached

. dataex qnno year

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
(qnno year)
 1 2008
 1 2008
 1 2010
 1 2010
 1 2010
 1 2010
 1 2014
 1 2014
 1 2014
 1 2014
 2 2008
 2 2008
 2 2008
 2 2008
 2 2010
 2 2010
 2 2014
 2 2014
 3 2008
 3 2008
 3 2008
 3 2010
 3 2010
 3 2010
 3 2014
 3 2014
 3 2014
 3 2014
 4 2008
 4 2008
 4 2008
 4 2010
 4 2010
 4 2010
 4 2014
 4 2014
 4 2014
 5 2008
 5 2008
 5 2010
 5 2010
 5 2010
 5 2014
 5 2014
 6 2008
 6 2008
 6 2010
 6 2010
 6 2014
 6 2014
 7 2008
 7 2008
 7 2010
 7 2010
 7 2010
 7 2014
 7 2014
 8 2008
 8 2008
 8 2008
 8 2010
 8 2010
 8 2010
 8 2010
 8 2014
 8 2014
 8 2014
 8 2014
 9 2008
 9 2008
 9 2008
 9 2008
 9 2010
 9 2010
 9 2014
 9 2014
 9 2014
10 2008
10 2008
10 2008
10 2008
10 2008
10 2010
10 2010
10 2010
10 2014
10 2014
10 2014
11 2008
11 2008
11 2010
11 2010
11 2010
11 2014
11 2014
11 2014
11 2014
12 2008
12 2008
12 2008
end
label values qnno qnno
label def qnno 1 "shiwangazaw tefera", modify
label def qnno 2 "ato demssie hailu", modify
label def qnno 3 "taddes gurumu", modify
label def qnno 4 "ato gerema kebede", modify
label def qnno 5 "wlo tezeru teshome", modify
label def qnno 6 "ato demssie denegdea", modify
label def qnno 7 "ato derebe demessie", modify
label def qnno 8 "ato dergu abera", modify
label def qnno 9 "ato gassaye teshome", modify
label def qnno 10 "ato tadese sheberu", modify
label def qnno 11 "ato cherenet hayoo", modify
label def qnno 12 "ato tsegaye empyu", modify

------------------ copy up to and including the previous line ------------------

Listed 100 out of 5353 observations
Use the count() option to list more

The other dataset is
Listed 100 out of 5353 observations
Use the count() option to list more

. use "E:\MSc thesis\table 7 crop production inputs and outputs_panel2008_2010_2015.dta"

. dataex qnno year

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input int(qnno year)
1 2008
1 2008
1 2008
1 2008
1 2008
1 2008
1 2008
1 2008
1 2010
1 2010
1 2010
1 2010
1 2010
1 2010
1 2014
1 2014
1 2014
2 2008
2 2008
2 2008
2 2008
2 2008
2 2008
2 2008
2 2008
2 2008
2 2008
2 2008
2 2010
2 2010
2 2010
2 2010
2 2010
2 2010
2 2014
2 2014
2 2014
2 2014
2 2014
2 2014
2 2014
3 2008
3 2008
3 2008
3 2008
3 2008
3 2008
3 2010
3 2010
3 2010
3 2010
3 2010
3 2014
3 2014
3 2014
3 2014
3 2014
3 2014
4 2008
4 2008
4 2008
4 2008
4 2008
4 2008
4 2008
4 2008
4 2008
4 2008
4 2008
4 2010
4 2010
4 2010
4 2010
4 2010
4 2010
4 2010
4 2010
4 2010
4 2010
4 2014
4 2014
4 2014
4 2014
4 2014
4 2014
5 2008
5 2008
5 2008
5 2008
5 2008
5 2008
5 2010
5 2010
5 2010
5 2010
5 2014
5 2014
5 2014
6 2008
6 2008
end
label values qnno qnno
label def qnno 1 "shiwangazaw tefera", modify
label def qnno 2 "ato demssie hailu", modify
label def qnno 3 "taddes gurumu", modify
label def qnno 4 "ato gerema kebede", modify
label def qnno 5 "wlo tezeru teshome", modify
label def qnno 6 "ato demssie denegdea", modify

------------------ copy up to and including the previous line ------------------

Listed 100 out of 12698 observations
Use the count() option to list more

I perform the merge:
merge 1:1 qnno using "E:\MSc thesis\table 7 crop production inputs and outputs_panel2008_2010_2015.dta"
and i get this :

variable qnno does not uniquely identify observations in the master data
r(459);

What could the issue be?

If i perform a many to many merge, my sample size increases. I have been avoiding this because i later have to merge with another data set that has no duplicates

[merge 1:1 qnno using "E:\MSc thesis\table 7 crop production inputs and outputs_panel2008_2010_2015.dta"][/QUOTE]

Announcement

Problem with merging two datasets

Comment

Comment