merging datasets

Hannah Tsay

Join Date: Jan 2021

Posts: 3
#1

merging datasets

20 Jan 2021, 18:03

I am trying to merge two datasets, which have the date variable in common. However, when I try to merge the datasets (one-to-one on key variables), this message pops up:

"variable date does not uniquely identify observations in the master data"

Not sure why this is happening.. any ideas?
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30058
#2

20 Jan 2021, 18:36

The message means exactly what it says. When you want to -merge 1:1 date- it must be the case that every value of the date variable occurs only once in both the data already in active memory (the master data) and in the other data set to be merged in (the using data). Stata is telling you that your data set already in active memory contains some value(s) of date that occur more than once. I have never known Stata to be wrong about this.

To see the duplicate values that are causing this problem run:

Code:

duplicates tag date, gen(flag) browse if flag

and you will see all of them.

Then you will have to figure out what is going on. Why are those duplicate values of date present? Are they supposed to be there? If so, then you probably need to do a -merge m:1 date-. If they're not supposed to be there, then you need to figure out why they are there and what to do about it.

Note: it is possible that there are also duplicate values of the date variable in the using data set as well. Stata reports the first error it encounters and stops. So don't be astonished if you find out that there is a similar problem in the using data after you fix the master data. But be warned: the fix in that case is not to do -merge m:m-. -merge m:m- is, in my view, the only serious design defect in Stata. It is a command that produces data salad. I have urged StataCorp on a few occasions to eliminate it, but, for reasons known only to them, they don't. It is particularly insidious because it issues no warnings or error messages and then proceeds to combine the two data sets in a way that is almost never what is wanted and almost never makes any sense. (I have been using Stata since 1994 and in all that time I have only once seen a situation where what -merge m:m- does makes any sense--and even in that case there was another way to get the result.) If you find yourself tempted to write -merge m:m-, stop and post back with examples from both of your data sets (use the -dataex- command to show example data) and also explaining what you want the combined result to look like.

If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

20 Jan 2021, 18:41

It is happening because your merge command was something like

Code:

merge 1:1 date using ...

and in the dataset in memory (the "master data") there are dates that have more than one observation.

Did you expect that? If so, then perhaps

Code:

merge m:1 date using ...

will solve your problem.

Otherwise, perhaps you need to have more than one key variable, something like

Code:

merge 1:1 id date using ...

where each combination of id and date identifies a single observation.

Or, perhaps you have errors in your data. Did you import your data from Excel and not closely look at the results? A very common problem with the import excel command is that, when it tries to figure out where the data begin and end, it includes extra rows at the end, and in Stata those extra rows will have missing values - so the same (missing) date will occur in all of those observations.
Comment
Hannah Tsay

Join Date: Jan 2021

Posts: 3
#4

20 Jan 2021, 23:49

Yes, I expected duplicate values of date to be present. This is because for every entry in one particular dataset, the date is the same for all entries. (The date that the dataset was published is the date variable.)

I tried "merge m:1 date using ..." and the error "variable date does not uniquely identify observations in the using data" appears.

Is there something wrong with the way I'm formatting my data? I feel like my method is a bit redundant with the duplicate dates, but I'm not sure how else I could format the date variable and merge the two datasets.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

21 Jan 2021, 07:36

Is there something wrong with the way I'm formatting my data?

We can't answer that without knowing more about your data. Please follow the advice given by Clyde in post #2 and use the dataex command to post example data from both of your datasets.

Think about it: if you have two observations on some date in your "master" data and three observations on the same date in your "using" data, how would Stata know which observations to match to each other in the merged dataset? Or as I asked in post #3, is there some other variable in both datasets that should also be used as a key variable for matching?

Actually, the more I think about it the less I understand your data. You write

This is because for every entry in one particular dataset, the date is the same for all entries.

Do you mean that all the date values in your "master" dataset are the same date, and all the date values in the "using" dataset are also the same date, although perhaps the date in the "using" dataset is different than the one in the "master" dataset?

If this is the case, you do not at all want merge, because nothing will match. Perhaps you want append or perhaps you want joinby or perhaps you want something entirely different. Without achieving an understanding of your data, there is nothing on which we can base a recommendation. And there is little interest in writing an essay about all the different things you could try.
1 like
Comment
Hannah Tsay

Join Date: Jan 2021

Posts: 3
#6

21 Jan 2021, 16:54

When I tried
duplicates tag date, gen(flag) browse if flag
as suggested in #2, this is what I got for my first dataset (note that parts of the data to the left have been cut off due to size limits):

This is essentially survey data looking at various individuals' attitudes toward COVID-19. So each dataset has a different date (with the date being the same for all entries in one dataset). This seems to be what you were implying for the append/joinby methods, so I'll try those out.
For some reason, when I try dataex, it returns "input statement exceeds line size limit. Try specifying fewer variables." any idea why this is happening?
Attached Files

Last edited by Hannah Tsay; 21 Jan 2021, 16:57.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

21 Jan 2021, 19:08

This is essentially survey data looking at various individuals' attitudes toward COVID-19. So each dataset has a different date (with the date being the same for all entries in one dataset).

And is it the case that different people responded to the surveys on the different dates? If so, you need the append command. The joinby command will do nothing useful for you. Nor will the merge command.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment