merge datasets; r(459); variable does not uniquely identify observations in the using data

Jessica Berger

Join Date: Nov 2015

Posts: 9
#1

merge datasets; r(459); variable does not uniquely identify observations in the using data

22 Nov 2015, 01:13

Hello everyone,

I just startet to use stata and have a problem with merging 3 different datasets (with Stata 12).

I used the merge (m:1) command and got an error message r(459) "variable does not uniquely identify observations in the using data".

merge m:1 company_ID using Set 2.dta

merge m:1 company_ID using Set 3.dta

The Datasets look like this:

Set 1 (master):

company_ID job_request_date graduate_id
123 12.11.2014 57878
123 12.10.2014 78878
123 16.11.2014 99899
121 14.11.2014 55744
345 12.10.2014 55879
876 12.09.2014 55879
876 19.09.2014 14787
1000 19.09.2014 14787 (--> not available in Set 2)
. (missing) . (missing) 68994 (--> no job offer --> no company_ID)
... .....

Set 1 contains multiple obersevations for one company.

Set 2:

company_ID number_employees
123 100
121 50
345 600
876 800
... .....

Set 2 contains one observation for each company.

Set 3:

company_ID export_number
123 1
121 1
345 5
876 6
1000 1
... .....

Set 3 contains one observation for each company. Not every Company_ID of Set 3 is included in Set 2.

I want to add the information of Set 2 and Set 3 for each observation in Set 1:

Set merged:

company_ID job_request_date graduate_id number_employees export_number
123 12.11.2014 57878 100 1
123 12.10.2014 78878 100 1
123 16.11.2014 99899 100 1
121 14.11.2014 55744 50 1
345 12.10.2014 55879 600 5
876 12.09.2014 55879 800 6
876 19.09.2014 14787 800 6
1000 19.09.2014 14787 . (missing) 1
... .....

It is possible that the company_ID and the graduate_id are the same but since I define the company_ID as the keyvariable there should not be a problem?

I think that it might be a problem that the different sets contain company_IDs that are not in all ofthe other datasets?! I only know how to merge Data wih exel "vlookup". Add information if you find a matching pair f.e. company_ID = 123 in both files. Does it work the same with the merge command?

I hope you can help me with this problem.
Thank you!

Last edited by Jessica Berger; 22 Nov 2015, 01:28.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35730
#2

22 Nov 2015, 02:59

I suggest using duplicates to check that should be true is indeed so.
Comment

Oded Mcdossi

Join Date: Jun 2014
Posts: 577

22 Nov 2015, 03:09

Jessica, if the example above represents your real data, you can try this code.

Code:

clear*
input company_ID str10 job_request_date graduate_id
123 "12.11.2014" 57878
123 "12.10.2014" 78878
123 "16.11.2014" 99899
121 "14.11.2014" 55744
345 "12.10.2014" 55879
876 "12.09.2014" 55879
876 "19.09.2014" 14787
1000 "19.09.2014" 14787
end
//SAVE THE MASTER FILE
tempfile main
save `main'

clear
input company_ID number_employees
123 100
121 50
345 600
876 800
end
//SAVE THE FIRST USING FILE

tempfile set2
save `set2'

clear
input company_ID export_number
123 1
121 1
345 5
876 6
1000 1
end
//SAVE THE SECOND USING FILE

tempfile set3
save `set3'
//MERGE THE FIRST AND THE SECOND USING WITH THE MASTER FILE
use `main', clear
merge m:1 company_ID using `set2', keep(1 3) nogen
merge m:1 company_ID using `set3', keep(1 3) nogen
list

I've just read Nick's comment and indeed you should check for duplicates in the using files (set2 set3).

Last edited by Oded Mcdossi; 22 Nov 2015, 03:11.

Comment

wbuchanan

Join Date: Mar 2014

Posts: 1362
#4

22 Nov 2015, 03:35

You can also use the

Code:

isid

command to check your assumptions about the primary key in a file.
Comment
Jessica Berger

Join Date: Nov 2015

Posts: 9
#5

30 Nov 2015, 03:42

Thank you all for your advice! I checked for duplicates in my datasets, but there are none. I managed to merge set 1 and set 2 with the command merge m:m company_id using x.dta

I tried to do it the same way with set 3 but I get an error: r(909)

op. sys. refuses to provide memory
Stata's data-storage memory manager has already allocated 1760m bytes and it just attempted to allocate another
16m bytes. The operating system said no. Perhaps you are running another memory-consuming task and the command
will work later when the task completes. Perhaps you are on a multiuser system that is especially busy and the
command will work later when activity quiets down. Perhaps a system administrator has put a limit on what you
can allocate; see help memory. Or perhaps that's all the memory your computer can allocate to Stata.

The merged dataset (set1_set2) has the size of 1,3 GB and set 3 has the size of 40 MB. I already tried to shut down other running programms; to restart my computer (which I bought just a few month ago). I also did run the compress command. Any thoughts how to solve this problem?!

Thank you!
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

30 Nov 2015, 09:48

I would like to reiterate my plea to the powers to be at StataCorp that the many-to-many merge syntax be eliminated.

Jessica stated in #1

Set 2 contains one observation for each company.
Set 3 contains one observation for each company.

But did not quite connect the dots when Stata complained that the following command

Code:

merge m:1 company_ID using Set 2.dta

yielded an error message that would look like

Code:

variable company_ID does not uniquely identify observations in the using data
r(459);

She choose to ignore the good advice she received in this thread and decided to resolve the issue using a many-to-many merge.

Here's an example where company_ID 123 has two observations in Set 2.

Code:

clear
input company_ID number_employees
123 100
123 999
121 50
345 600
876 800
end
save "Set 2.dta", replace

clear
input company_ID str10 job_request_date graduate_id
123 "12.11.2014" 57878
123 "12.10.2014" 78878
123 "16.11.2014" 99899
121 "14.11.2014" 55744
345 "12.10.2014" 55879
876 "12.09.2014" 55879
876 "19.09.2014" 14787
1000 "19.09.2014" 14787
end

merge m:m company_ID using "Set 2.dta"

isid company_ID job_request_date, sort
list if company_ID == 123

Each time you run the code above, you are likely to get a different match. For example:

Code:

     +-----------------------------------------------------------+
     | compan~D   job_requ~e   gradua~d   number~s        _merge |
     |-----------------------------------------------------------|
  2. |      123   12.10.2014      78878        100   matched (3) |
  3. |      123   12.11.2014      57878        999   matched (3) |
  4. |      123   16.11.2014      99899        999   matched (3) |
     +-----------------------------------------------------------+

and

Code:

     +-----------------------------------------------------------+
     | compan~D   job_requ~e   gradua~d   number~s        _merge |
     |-----------------------------------------------------------|
  2. |      123   12.10.2014      78878        100   matched (3) |
  3. |      123   12.11.2014      57878        999   matched (3) |
  4. |      123   16.11.2014      99899        100   matched (3) |
     +-----------------------------------------------------------+

The many-to-many merge issue resurfaces regularly on Statalist. This shows that as long as the option remains available, some users will convince themselves that it is a good option for them, notwithstanding all the words of caution in the official documentation. I fear there are far more users who go many-to-many that we ever hear about on this list.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30121
#7

30 Nov 2015, 10:06

+1 to Robert Picard's comments in #6.

Jessica should go back to her original data sets. While she believes that there is just one observation per company_ID in data sets 2 and 3, the -merge- command is telling her otherwise.

Code:

use dataset2, clear duplicates list company_ID use dataset3, clear duplicates list company_ID
Comment
Jessica Berger

Join Date: Nov 2015

Posts: 9
#8

01 Dec 2015, 04:12

Thank you for your replies...I did not realize that the m:m command will lead to a false match.

I finally managed to merge the datasets!
The problem was, that the datasets had missing values, that were counted as duplicates, which I didn't realize when I first looked for duplicates. After dropping the missing values in set 2 and set 3 the merge command m:1 worked perfectly.

Thank you for your help!
3 likes
Comment

Announcement