Merging two datasets

Hirindu Kawshala

Join Date: Nov 2022

Posts: 37
#1

Merging two datasets

03 Dec 2022, 12:31

I want to merge the following two datasets using GVKEY & fyear.I used 1:1 as a type of merge, but it says that "variables GVKEY fyear do not uniquely identify observations in the master data".
Please advise me on this. Thank you.

Last edited by Hirindu Kawshala; 03 Dec 2022, 13:12.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30066
#2

03 Dec 2022, 12:55

Well, I don't see the problem in the screenshots you show. And, as you didn't show your code, I don't know which data set is the master. But, what Stata is telling you, and I've never known it to be wrong about this, is that somewhere in that data set there is some gkey (possibly more than one) which has two or more observations with the same fyear. So your first step is to find them:

Code:

duplicates tag gvkey fyear, gen(flag) browse if flag

Once you have seen them, you have to figure out how to get rid of them. Look them over to consider several possibilities:
They are pure duplicates: they agree on all variables, not just gvkey and fyear. And assuming there is no reason there should actually be purely duplicate observations in the data, you could just drop those observations and move on. I don't recommend that, but you could. Better is to investigate the data management that created this data set and find out where those duplicate observations crept in. There's a code error there, and where there is one code error, others may lurk as well. So the best solution is to fix any errors you find while investigating this and re-generate a corrected data set.

They agree on gvkey and fyear but disagree on some other variables. In this case, you can't even just delete them because you need to figure out which ones are correct (if any). Sometimes all of them are wrong and what you're supposed to have is some combination of them all. Anyway, this is the most complicated situation, and really mandates a thorough review and repair of the data management that led to this data set. You can't really move forward until you have fixed that and created a corrected data set.

The least likely situation is that the duplicate observations (even if they are pure duplicates) are actually supposed to be there. There could be some other variable that distinguishes the duplicates, e.g. if the data set is actually quarterly and there is a quarter variable. Or if different divisions of the same firm appear, distinguished by some other variable. In situations like this, the data is correct, but the code is not. This would not be a 1:1 merge. Instead you need to do -merge m:1 gvkey fyear using the_other_dataset-. Or, if both data sets are quarterly, then you need -merge 1:1 gvkey fyear quarter using the_other_dataset-.

If you are unable to work this out, do post back. When doing that, please show the exact -merge- command you are using, and some example data that exhibits the problem at hand. And, instead of screen shots, which are not helpful for testing code, show the example data using the -dataex- command. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
1 like
Comment
Hirindu Kawshala

Join Date: Nov 2022

Posts: 37
#3

07 Dec 2022, 14:23

Thank you so much for your detailed answer. I'm trying to resolve this based on your explanation.
Comment

Announcement

Merging two datasets

Comment

Comment