Merging two datasets

Juan Gonzalex

Join Date: May 2022
Posts: 59

Merging two datasets

25 May 2022, 13:46

Hi,

I'm currently writing my thesis and I'm new to Stata. I want to combine a dataset from Compustat and a dataset from Factset, both in yearly terms. I have the following datasets:

Compustat

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long gvkey str12 isin str10 datadate int year float lvchange
1166 "NL0000334118" "31/12/2002" 2002  .006297663
1166 "NL0000334118" "31/12/2003" 2003     .105081
1166 "NL0000334118" "31/12/2004" 2004   .02746665
1166 "NL0000334118" "31/12/2005" 2005  -.01903224
1166 "NL0000334118" "31/12/2006" 2006  -.02919521
1166 "NL0000334118" "31/12/2007" 2007  -.05277575
1166 "NL0000334118" "31/12/2008" 2008  -.05401874
1166 "NL0000334118" "31/12/2009" 2009     .123919
1166 "NL0000334118" "31/12/2010" 2010  -.07997546
1166 "NL0000334118" "31/12/2011" 2011 -.072212696
1166 "NL0000334118" "31/12/2012" 2012  -.09492537
1166 "NL0000334118" "31/12/2013" 2013  -.23273465
1166 "NL0000334118" "31/12/2014" 2014  .007800169
1166 "NL0000334118" "31/12/2015" 2015 -.013378848
1166 "NL0000334118" "31/12/2016" 2016  .000170374
1932 "GB0002875804" "31/12/1995" 1995           .
1932 "GB0002875804" "31/12/1996" 1996  .007074844
1932 "GB0002875804" "31/12/1997" 1997   .02022037
1932 "GB0002875804" "31/12/1998" 1998   .07025388
1932 "GB0002875804" "31/12/1999" 1999  -.27306455
end

FactSet

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str12 isin str61 name str12 ri int year byte month
"AT0000730007" "ANDRITZ AG" "11.403597"  2002 12
"AT0000730007" "ANDRITZ AG" "72.014046"  2003 12
"AT0000730007" "ANDRITZ AG" "51.732697"  2004 12
"AT0000730007" "ANDRITZ AG" "68.74873"   2005 12
"AT0000730007" "ANDRITZ AG" "79.90005"   2006 12
"AT0000730007" "ANDRITZ AG" "2.558291"   2007 12
"AT0000730007" "ANDRITZ AG" "-54.92813"  2008 12
"AT0000730007" "ANDRITZ AG" "133.72536"  2009 12
"AT0000730007" "ANDRITZ AG" "73.54651"   2010 12
"AT0000730007" "ANDRITZ AG" "-4.4100466" 2011 12
"AT0000730007" "ANDRITZ AG" "56.09789"   2012 12
"AT0000730007" "ANDRITZ AG" "-3.874004"  2013 12
"AT0000730007" "ANDRITZ AG" "1.3872743"  2014 12
"AT0000730007" "ANDRITZ AG" "0.37008524" 2015 12
"AT0000730007" "ANDRITZ AG" "8.931768"   2016 12
"AT0000743059" "OMV AG"     "3.7600994"  2002 12
"AT0000743059" "OMV AG"     "30.54434"   2003 12
"AT0000743059" "OMV AG"     "92.7689"    2004 12
"AT0000743059" "OMV AG"     "127.135254" 2005 12
"AT0000743059" "OMV AG"     "-11.443013" 2006 12
end

I would like to match both the isin and the year, so I get the lv change of each company for each year.

To merge the, I tried the code "merge m:m isin year using fact, , keepusing(lvchange), but it gave me the error
"variables isin year do not uniquely identify observations in the master data".

Any help on how to fix this would be appreciated.

Thanks

Tags: combining data set, merge

Clyde Schechter

Join Date: Apr 2014

Posts: 30357
#2

25 May 2022, 14:39

I have two comments about "merge m:m isin year using fact, , keepusing(lvchange)" The first is that it would never give you the error message "variables isin year do not uniquely identify observations in the master data." That's because an m:m merge would not care at all whether anything uniquely identifies observations in any data set. More important, you should not be using -merge m:m-. It produces a useless data salad. If you think you need to use -merge m:m- then one or more of three things is true: 1) you don't understand what -merge- does, 2) you don't understand your data, or 3) your data sets are messed up.

In any case, based on the example data you show, you can properly combine these two data sets with:

Code:

use compustat_data_set, clear merge 1:1 isin year using factset_data_set

because in your example data the combination of isin and year does uniquely identify observations in both data sets.

That said, I see that the computstat data set also includes gvkey, and I suspect that in a fuller example from that data set, we would see multiple observations having the same values of isin and year, with different values of gvkey. So what I recommend you try is:

Code:

use compustat_data_set, clear merge m:1 isin year using factset_data_set

Now, if you try that and you get an error message saying that isin and year do not uniquely identify observations in the using data set, then you need to figure out what is actually going on. The first question is whether the factset data set should have multiple observations with the same values of isin and year.

Case 1: It should not. Then you need to find out why it does.

Code:

duplicates tag isin year, gen(flag) browse if flag

will show them to you and then maybe you can figure out how those duplicates got there, and arrive at a way to correctly retain only the right ones, and fix whatever errors in data management led to the inclusion of the spurious observations.

Case 2: It should. In that case, these data sets may not be mergeable at all. In order for them to be mergeable there must be some third variable (or perhaps more than one) common to both data sets that will, in combination with isin and year, uniquely identify observations in at least one of the two data sets. For example, if the factset data set also contains a gvkey variable, then including that in the -merge- command might resolve the problem. If there are, however, no such additional variables, then your data sets cannot be merged. Think about it. Suppose you have several observations in both data sets all with isin = AB123435678 and year = 2012. How would you decide on the proper way to pair them up? Which of these observations in the compustat data set goes with which one in the factset data set? If the answer is to create all possible pairings of such observations, -merge- is the wrong command: see -help joinby-. Or perhaps when looking more carefully at the data you can arrive at a rule for selecting specific pairs. In that case you might have to weed out unmatchable observations from one or both data sets so that you are then able to successfully use -merge m:1 isin year-. Or perhaps it is possible to create a new variable in each of the data sets that does correctly select those matches and you can -merge 1:1 isin year new_variable-.
2 likes
Comment
Juan Gonzalex

Join Date: May 2022

Posts: 59
#3

26 May 2022, 04:00

Hi,

Thanks to you I found the problem, which was that some of the companies changed their annual reporting date. After that I used the code:

Code:

bysort gvkey (flag): drop if flag[_N]==1

217 observations were deleted. I thought the problem was solved, but when I used both these codes I still got the error message "variables isin year do not uniquely identify observations in the using data":

Code:

merge 1:1 isin year using fact

Code:

merge m:1 isin year using fact

By the way, the dataset should not have multiple observations with the same values of isin and year anymore.

Last edited by Juan Gonzalex; 26 May 2022, 04:11.
Comment

Announcement

Merging two datasets

Comment

Comment