HELP NEEDED: Merging of data

Mira Bellenbaum

Join Date: Dec 2020

Posts: 17
#1

HELP NEEDED: Merging of data

18 Dec 2020, 14:42

Dear Stata experts,

I have been working on a STATA problem for the last two days now and I just cannot find out what I am doing wrong. Maybe you have the solution to my problem:

Overall I am trying to merge four data sets into one.
The first set of data is called "Kogan" and contains "permno" and "fyear" and some other data that interests us.
The second set of data is called "LPERMNO & Standard and Poors" and contains "permno" and "gvkey" data. I merged this data with the first set in order to get Kogan data with a gvkey as an identifier. This worked and we created a new dataset with "gvkey" "and "fyear".
The third set of data is called "Variablen" and the last one "MA ability". And this is where the problem starts: we cannot manage to merge these two new datasets into other merged dataset. I already dropped duplicates, renamed "_merge" and did some other things as you can see in the code. However, I always get an error message when doing 2.3) (the code before 2.3) runs fine, though). The error message says "Variables GVKEY fyear do not uniquely identify observations in the master data" which is weird, since both data sets that I want to merge definitely contain "GVKEY" and "fyear".
If you have any idea how to solve this problem, PLEASE, PLEASE tell me. Thank you so much in advance!!

1) Prepare datasets for Merge -------------------------------------------

*rename Standard and Poor's identifier in gvkey-Form

*1.1) Prepare datasets for merge (Kogan)
use "Kogan Patent data.dta", clear

duplicates tag permno fyear, gen(dup)
drop if dup==1
save "Kogan_v05.dta"
*0 obersavtions deleted

*1.2) Prepare datasets for merge (LPERMNO & Standard and Poors)
use "LPERMNO & Standard and Poors Merging Dataset.dta", clear

duplicates tag LPERMNO fyear, gen(dup)
drop if dup==1

*Varibale LPERMNO in permno
rename LPERMNO permno

save "LPERMNO_v05.dta", replace
*14 observations dropped

*1.3) Prepare datasets for merge (Variablen)
use "Variablen.dta", clear

duplicates tag GVKEY fyear, gen(dup)
drop if dup==1
save "Variablen_v05.dta", replace
*0 Obersvations dropped

*1.4) Prepare datasets for merge (MA ability)
use "ma_score_edited.dta", clear

duplicates tag gvkey fyear, gen(dup)
drop if dup==1
save "ma_score_v05.dta", replace
*400 Obervations dropped

*2) Merge Datasets--------------------------------------------------------------

*2.1) Merge Kogan Dataset with LPERMNO & Standard and Poors

*2.1.1) Open prepared Kogan dataset
use "Kogan_v05.dta", clear

*2.1.2) Merge with LPERMNO & Standard and Poors datasets
merge 1:1 permno fyear using "LPERMNO_v05.dta"
rename _merge execmatch01
save "LPERMNO&Kogan_v05.dta"

*2.2) Open LPERMNO & Standard and Poors datasets
use "LPERMNO&Kogan_v05.dta", clear

duplicates tag GVKEY fyear, gen(dupl)
drop if dupl==1
save "LPERMNO&Kogan_v05a.dta", replace
*4922 observations dropped

*2.3) Merge with prepared Variablen Dataset

use "LPERMNO&Kogan_v05a.dta", clear

merge 1:1 GVKEY fyear using "Variablen_v05.dta"
rename _merge execmatch02
rename GVKEY gvkey

*error message: Variables GVKEY fyear do not uniquely identify observations in the master data*

*2.4) Merge mit MA-Ability datasets

merge 1:1 gvkey fyear using "ma_score_v05.dta", keep(match)
rename _merge execmatch03
save "final_Merged_Data_v04.dta"
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30121
#2

18 Dec 2020, 15:37

Well, as you have not provided example data, one can only speculate might be going wrong. But I would urge you to focus on the exact wording of the error message. "Variables GVKEY fyear do not uniquely identify observations in the master data"

So the problem is with the LPERMNO & Kogan data set, and the problem is that there are multiple observations with the same values of GVKEY and fyear. Now, you have attempted to prevent that from happening by preparing that data set with the code

Code:

duplicates tag GVKEY fyear, gen(dupl) drop if dupl==1

but that is not how -duplicates tag- works. The variable it creates, dupl, is not a 0/1 variable. The value of dupl will be the number of surplus observations. So if a given combination of GVKEY and fyear is unique, as you desire, dupl will be 0. And if there are two observations for a given GVKEY and fyear, then dupl with be 1. So far so good. But if a combination of GVKEY and fyear appears 3 times, dupl will be 2, so those observations will not be dropped by your code. You need -drop if dupl > 0-. Or, actually, just -drop if dupl- will do the trick.

That said, the approach you are taking to duplicate observations appears rather reckless. If in a data set a given GVKEY and fyear combination appears more than once, are the observations complete duplicates? That is, do they agree on all the other variables? If so, then it is safe to drop them with no loss of information. But if they don't agree on the other variables, the approach you are taking is going to select one at random to keep. If you run the same code again, it may well keep different ones, and down the line you will get different results from whatever calculations you do. If you have multiple observations of the same GVKEY and fyear, this is a problem to be understood and solved, not swept under the rug with random dropping of observations.

So the first question you need to ask is why there are duplicates observations with the same GVKEY and fyear. How did they get there? Did somebody mangle the data management that created those data sets? Maybe some of them are data errors that need to be identified and fixed (they actually belong to a different GVKEY or different year.) Or are they supposed to be there? Perhaps they are distinguished in some way, such as by a narrower time period (quarter or month or something like that) or reflect different subdivisions of the entity specified by the GVKEY. There may be a principled reason for preferring one of them for your analyses. If so, the code should be written so as to select the correct one. Or perhaps for your purposes you need to average the values of the other variables or something like that. The point is that randomly dropping them is not appropriate as a solution--you should turn the lemons of duplicate observations into the lemonade of appropriate single observations.

Yet another possibility is that to proceed with correct analysis you need to keep all of them, calling for an m:1 -merge-. Or perhaps you really need to pair up very observation of a given GVKEY-fyear combination from the Variablen data with every observation in LPERMNO & Kogan that matches it on GVKEY and fyear. In that case, -merge- is the wrong command. That is done with -joinby-.

Anyway, as I said, without seeing example data, I can't give you more precise advice. But please don't just discard random observations to suppress error messages. Error messages are there to save you from the embarrassment and hassle that will come with your results being discovered to be wrong later. And, often as not, they don't reflect syntax errors but are alerting you to the fact that your data are not what you think they are. When that happens, you need to take a thoughtful approach to fixing the data or reconsidering its suitability for what you are trying to do with it.

Last edited by Clyde Schechter; 18 Dec 2020, 15:41.
2 likes
Comment
Mira Bellenbaum

Join Date: Dec 2020

Posts: 17
#3

20 Dec 2020, 01:38

Hey Clyde,

thank you so much for your quick and explaining answer! It really helped a lot!

You are right, I first have to think about the HOW concerning the duplicates. As my corrector told me that I can just assume that duplicates occur because firms reported e.g. twice, I don't have to pay too much attention to the origin of the duplicates. Anyway, I was wondering if I should use >0 then or >1. It says on Stata (https://www.stata.com/support/faqs/d...bservations/):

Case 2: Dropping duplicates based on a subset of variables
Picking up where case 1 left off, if you want to drop all duplicate observations but keep the first occurrence, type
. drop if dup>1
To drop all duplicate observations, including the first occurrence, type
. drop if dup>0

Our goal is to only delete the second/third... version of the data and not all the data that has duplicates, that's why we used ">1". E.g. There is 3 times exactly the same data which says: company with gvkey 001234 in fyear 1990 reports sales of 5,000. Now I want to delete this set of information twice but keep one data set that contains the information. To achieve that I used the following code:

*1.1) Prepare datasets for merge (Kogan)
use "Kogan Patent data.dta", clear

recast double permno
recast double fyear

sort permno fyear
quietly by permno fyear: gen dup = cond(_N==1,0,_n)
drop if dup>1
save "Kogan_edited.dta"
*0 obersavtions deleted

*1.2) Prepare datasets for merge (LPERMNO & Standard and Poors)
use "LPERMNO & Standard and Poors Merging Dataset.dta", clear

sort LPERMNO fyear
quietly by LPERMNO fyear: gen dup = cond(_N==1,0,_n)
drop if dup>1

*Varibale LPERMNO in permno
rename LPERMNO permno

save "LPERMNO_edited.dta", replace
*7 observations deleted

*1.3) Prepare datasets for merge (Variablen)
use "Variablen.dta", clear

sort GVKEY fyear
quietly by GVKEY fyear: gen dup = cond(_N==1,0,_n)
drop if dup>1

save "Variablen_edited.dta", replace
*0 Obersvations dropped

*1.4) Prepare datasets for merge (MA ability)
use "ma_score_edited.dta", clear

recast double fyear

sort gvkey fyear
quietly by gvkey fyear: gen dup = cond(_N==1,0,_n)
drop if dup>1

save "ma_score__new_edited.dta", replace
*210 Obervations dropped

Could you please tell me if I did it correctly or what I have to adjust in order to get it correct e.g. "dup>1" to "dup>0".

Thank you so much in advance!!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30121
#4

20 Dec 2020, 11:35

-dup > 1- is correct for your purpose.
1 like
Comment

Announcement

HELP NEEDED: Merging of data

Comment

Comment

Comment