variable date does not uniquely identify observations in the master data stata

Stephen Long

Join Date: May 2017

Posts: 4
#16

08 May 2017, 14:41

Ah! That's it. Limiting the duplicate report to the two identifying variables showed me the problem, a simple data entry error in the using data set. Thanks! You guys are life savers.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#17

08 May 2017, 15:47

variables ccode1 year do not uniquely identify observations in the using data
r(459); [Emphasis added[

So it is your using dataset, that has multiple observations with the same value(s) of ccode1 and year. So what you want is -merge 1:m-

Now, it may be that with your original -merge 1:1- you got a message about ccode1 year not identifying unique observations in the master data, prompting you to change it to -merge m:1-. Do NOT, by analogy go to -merge m:m-. What -merge m:m- does, though it is legal syntax, is a very bizarre way of putting the two data sets together that is almost never what is wanted and usually just results in a jumble of unrelated observations being thrown together.

If ccode1 and year do not, jointly, identify unique observations in either data set, then basically you cannot -merge- them together with ccode1 year as the merge key. There are several possibilities here:

1. One or both of the data sets really should have its observations uniquely identified by ccode1 and year. This means that your data sets are incorrect and you need to re-trace how they were built and fix the problem. It may be as simple as running -duplicates drop- to get rid of purely identical observations, but often this situation is a symptom of a more profound problem with the data that requires more extensive work to repair.

2. There is no real reason why ccode1 and year should jointly identify unique observations. Then there are three subscases here.
2a. There are some additional variables (or perhaps just one variable) that, along with ccode1 and year should jointly identify unique observations. In that case, including those additional variables in the merge key will resolve the problem. (For example, maybe ccode1 year and month might do the trick, or ccode1 year and industry, or something like that.)
2b. There are no such additional variables in your data set, but in principle they exist and could be retrieved. Expanding your data sets to include such information then leads back to 2a.
2c. There is no reason at all why ccode1 and year should jointly identify unique observations even with additional variables thrown into the merge key. In that case the very concept of merging the data sets is ill-defined. If I have three observations in the first data set with a given value of ccode1 and year, and four in the second data set with those same values, how are we to know which of the three gets matched with which of the four (and which goes unmatched). It's simply undefined. What is possible under these circumstances is to match each of the three with each of the four, resulting in 12 pairs. This, however, is not done with -merge-; the command to do this is -joinby-.
Added: Crossed with #16, which makes clear that this fell under case 1 and was readily resolved. Perhaps others reading the thread, however, will derive some benefit from the general discussion laid out here.

Last edited by Clyde Schechter; 08 May 2017, 15:49.
Comment
Myriam Bck

Join Date: Aug 2017

Posts: 1
#18

23 Aug 2017, 02:04

I have a similar problem. I'm writing my Master thesis and got the problem by merging several data sets of panel data. There is no problem by merging the first six, but when it comes to the 7th and last, the message "variables ID_t wave do not uniquely identify observations in the master data" occurs.
The data contains information about school children and their transition after school. The first six datasets contain the Cohort profile, the asked parents, the children asked in school, the children asked after leaving school, the school dataset, which contains information about the institution (like end of schooling time). The last data set which I like to merge contains information about Vocational training. I need this data set because of one variable: the starting date of vocational training, so I can calculate the time between school and vocational training. My supervisor said it may be due to the fact, that one person is in two institutions in the same wave and stata has problems matching them. If I would use the m:m command some IDs may be matched to the wrong time points which would be a huge problem.
For a better understanding: I'm examing the transition from school to vocational training in germany for non-A leveler and the impact of social capital on this transition.
Maybe you can help me in this situation.
Thanks and best wishes,
Myriam
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#19

23 Aug 2017, 08:49

I don't think anyone will be able to help you unless you post examples of the data you are working with. Read FAQ #12 for instructions about using the -dataex- command to post example data, and follow them. The examples you need to show are, 1) an example of the result of merging the first 6 data sets, and 2) an example of the 7th data set. Be sure that the particular examples you show include some observations that should match up with each other in the merge.

I'm very glad that you are aware that -merge m:m- is inappropriate.
Comment
NimaDarbari

Join Date: May 2014

Posts: 24
#20

28 Aug 2018, 13:07

I am trying to merge two data sets. My key variable is the date. The date format in both data sets is dd/mm/yy. The data set in memory contains only the date variable with a big number of sorted observations. The "using" data set contains only a handful of observations so I can see all the dates are unique there. I keep receiving the ...not uniquely identify... message. Does someone has a clue what may be wrong?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#21

28 Aug 2018, 13:43

Without seeing the actual -merge- command you tried and the exact error message Stata gave you it is difficult to be specific.

As best I can imagine from your brief description, you believe that the date variable uniquely identifies observations with the using data. GIven that the master data in memory has a large number of observations, my guess is that there are multiple observations for some or all dates. So the command to merge them should be:

Code:

merge m:1 name_of_date_variable using name_of_using_data_set_here

If you tried -merge 1:1- then the command will, appropriately, fail if any value of the date variable occurs more than once in the data in memory.

Another possible source of the problem is that although you think the using data set has no duplicate values of the date variable, you may be wrong. So it is worth running the following before trying to -merge-:

Code:

isid using name_of_using_data_set_here

If the date variable observations are truly unique in the using data set, then Stata will say nothing and move on to your next command. But if you are mistaken about that, Stata will give you an error message telling you how many duplicate observations there are. Then your next task is to go about finding them:

Code:

use name_of_using_data_set_here, clear duplicates tag name_of_date_variable, gen(flag) browse if flag

and figuring out what to do about them.

If what you tried is what I showed above, then please post back showing:

1. Examples of the data from each of the two data sets; use the -dataex- command to do this.

2. The exact -merge- command you tried.

3. The exact response you got from Stata, including any error messages.

For #2 and #3, please be sure to wrap them in code delimiters to enhance their readability here. If you are not familiar with code delimiters or with -dataex-, please read Forum FAQ #12 for detailed instructions.
1 like
Comment
NimaDarbari

Join Date: May 2014

Posts: 24
#22

29 Aug 2018, 02:29

Thank you for your reply Clyde. I found the problem. Though the "using" data set was containing only a handful of visible observations, the number of observations was much larger and Stata consider all those blank date observations as duplicates
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#23

29 Aug 2018, 09:10

I'm glad you found the solution to your problem. I have encountered this situation fairly often myself. My experience is that it typically arises when using data that was initially imported from spreadsheets. For reasons I do not understand, sometimes -import excel- fails to recognize the end of the data in a spreadsheet and brings in a number of blank observations at the end. These blank observations can cause havoc during coding as variables that are supposed to be unique identifiers fail to be so. It has happened to me often enough that I have made it a habit to check for this, and eliminate the blank observations if there are any, after importing data from spreadsheets.
1 like
Comment

Niclas Kammler

Join Date: May 2019
Posts: 3

#24

29 May 2019, 10:20

Hello, I have a similar Problem.
Table_one contains Data on a daily basis and table_two contains monthly averages stated on the 15th of each month. Only one row for one month is existing in table_two. I want the two tables to be merged so that Every Day in table_one is filled with the average Data for each month of table_two.

Table_one

Datum	r_stock1	r_stock2	Monats
1/1/1980	0,001	0,122	1980m1
1/2/1980	0,002	-0,001	1980m1

...

Table_two

Datum	infl_1	infl_2	Monats
1/15/1980	0,0001	0,2	1980m1
2/15/1980	0,0002	0,00009	1980m2

...

Table_expected

Datum	r_stock1	r_stock2	infl_1	infl_2	Monats
1/1/1980	0,001	0,122	0,0001	0,2	1980m1
1/2/1980	0,002	-0,001	0,0001	0,2	1980m1
...	...	...	...	...	...
2/1/1980	0,053	-1,000002	0,0002	0,00009	1980m2
2/2/1980	-0,0092	-0,0001	0,0002	0,00009	1980m2

What i tried so far was:
-> merge 1:m Monats using 'file'
with the outcome "variable Monats does not uniquely identify observations in the master data"
But i think that is not the only problem here...

Can someone please help me? Thanks in advance

Last edited by Niclas Kammler; 29 May 2019, 10:26.

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#25

29 May 2019, 12:57

Well, when Stata says that a variable does not uniquely identify observations in a data set, I have never known it to be wrong.

So the first step is to identify the observations where the same value of Monats recurs:

Code:

duplicates tag Monats, gen(flag) browse if flag

After that, you will have to figure out why those observations are there and what to do about it. It may be that such observations are completely duplicate observations, i.e. they agree on all of the variables. In that case -duplicates drop- will resolve the problem with no loss of information and you can go on to your -merge- command.

If the observations with the same values of Monats disagree on some other variable(s), then you have to figure out which one to keep, or, perhaps, how to combine them all into a single variable, such as by taking means or medians or something like that.

The other thing to bear in mind in either of these situations is this. You have the expectation that the second data set should contain only a unique observation for any value of Monats. But the data do not conform to that expectation. If your expectation was wrong, then fine--you now know better and understand why. But if you had good reasons to expect unique identification by Monats, it means that the data set was incorrectly created. When a data set contains one error like this, it may well contain others: the entire data management process leading up to it is suspect. So you should confer with whoever provided you with this data set and discuss what has gone wrong. Then you should ignore this version and work with the corrected one.

In the future, when showing data examples, please use the -dataex- command to do so. If you are running version 15.1 or a fully updated version 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
1 like
Comment
Sarah Edgington

Join Date: Apr 2014

Posts: 284
#26

29 May 2019, 13:04

Am I understanding correctly that you started with table_one in memory (the master file) and table_two is your using file? If so I think you have mis-specified your merge command. My understanding is that you expect duplicates of the variable Monats in table_one but not table_two. So your command would be

Code:

merge m:1 Monats using "table_two"

This tells Stata that you expect duplicates on Monats in your master file and for it to uniquely identify your using file. Your previous command told Stata you expected Monats to be unique in your master. If you started with table_one you already know that Monats should not be unique.
1 like
Comment
Doug Hassanali

Join Date: Sep 2018

Posts: 14
#27

07 Oct 2020, 07:15

hello everyone, I am faced with a similar problem where I cannot merge

code:
***IMPORTING LARGE EXCEL FILE
set excelxlsxlargefile on

import excel "globalterrorismdb_0919dist.xlsx", sheet("Data") firstrow

drop if country!=147
keep if iyear>=2009 & iyear<=2013
keep if gname=="Boko Haram"
drop if provstate=="Unknown"

g new_event = 1

bysort provstate iyear: egen n_event = total(new_event)

rename provstate state

***convert str39 to numeric
encode state, g(Sstate)

*recoding the original variable values to match master
recode sstate (1=140) (2=70) (3=90) (4=180) (5=60) (6=350) (7=80) (8=320) (9=40) (10=110) (11=100) (12=30) (13=190) (14=360) (15=150) (16=130) (17=240) (18=160) (19=10) (20=170) (21=50) (22=20)

lab define newlabel 140 "abuja" 70 "adamawa" 90 "bauchi" 180 "benue" 60 "borno" 350 "delta" 80 " gombe" 320 "imo" 40 "jigawa" 110 "kaduna" 100 "kano" 30 "katsina" 190 "kogi" 360 "lagos" 150 "nasarawa" 130 "niger" 240 "ondo" 160 "plateau" 10 "sokoto" 170 "taraba" 50 "yobe" 20 "zamafara" , modify
lab val sstate newlabel

save "bh1.dta"
**clear all and load master NDHS
merge m:1 sstate using "bh1.dta" variable sstate does not uniquely identify observations in the using data r(459); At the moment, I am at wits end what the problem could be, any advice given is greatly appreciated
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#28

07 Oct 2020, 09:24

Well, the error message you are getting says very clearly that the problem is in the using data, i.e. in bh1.dta. And you have said nothing at all about that data set. So it is not possible to help you based on the information you gave.

Please post back, showing, not the code that creates the data sets, but example data from both data sets. Make sure that when you run your -merge- command on those example data sets you get the same error message.

When posting back with the example data, be sure to use the -dataex- command to do so. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
Doug Hassanali

Join Date: Sep 2018

Posts: 14
#29

19 Oct 2020, 05:33

Originally posted by Clyde Schechter View Post

Well, the error message you are getting says very clearly that the problem is in the using data, i.e. in bh1.dta. And you have said nothing at all about that data set. So it is not possible to help you based on the information you gave.

Please post back, showing, not the code that creates the data sets, but example data from both data sets. Make sure that when you run your -merge- command on those example data sets you get the same error message.

When posting back with the example data, be sure to use the -dataex- command to do so. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

Thanks Clyde, I was able to solve it
Comment
Suhail Mir

Join Date: Oct 2021

Posts: 5
#30

22 Dec 2021, 10:51

Hello,

I was trying to merge national sample survey, India (NSSO) 55th round data sets using merge 1:1 command. Before merging data, I had to create House hold ID (HHID) and personal ID (PID) for 14 files

When I tried to merge (1:1, m:1 or 1:m) on these data sets, I got the error message "variable date does not uniquely identify observations in the master data stata".

Can someone help me on this.
Thanks

. merge 1:1 HHID using "C:\Users\suhail.mir\Desktop\PhD daily work\Data Work PhD 2021\55th Round STATA files\Block4-sch10-and-10d
> ot1-Records-combined.dta"
variable HHID does not uniquely identify observations in the master data
r(459);

. merge m:1 HHID using "C:\Users\suhail.mir\Desktop\PhD daily work\Data Work PhD 2021\55th Round STATA files\Block4-sch10-and-10d
> ot1-Records-combined.dta"
variable HHID does not uniquely identify observations in the using data
r(459);

. merge 1:m HHID using "C:\Users\suhail.mir\Desktop\PhD daily work\Data Work PhD 2021\55th Round STATA files\Block4-sch10-and-10d
> ot1-Records-combined.dta"
variable HHID does not uniquely identify observations in the master data
r(459);
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment