Merging Datasets With Unequal Observations

Faris Malla

Join Date: Jun 2021

Posts: 4
#1

Merging Datasets With Unequal Observations

16 Jun 2021, 14:42

Dear All,

First and foremost, this is my first post here so hello everyone and look forward to being a part of the community!

To my problem:

I am currently working with two datasets, one comprising of data on executive compensation (option awards) and another with various financial data on those CEOs respective firm's between the years 2010 – 2020. The executive compensation dataset includes the firm's ticker symbol, a unique Executive/Company ID, and all the options awarded for all CEOs of a given company for each year, among others. It is important to mention that each CEO may have received multiple option awards in a given year which each represent a unique observation in the dataset. An example extract is shown below:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str58 FullName str5 Ticker long ExecutiveCompanyCombination int FiscalYear "Aldo Charles Zucaro" "ORI" 10945 2010 "James M. Loree" "SWK" 18636 2010 "James Joseph Murren" "MGM" 23752 2010 "Ralph Izzo, Ph.D." "PEG" 29526 2012 "William J. Walljasper" "CASY" 29638 2011 "Richard A. Gottscho" "LRCX" 35287 2017 "Matthew Glenn Manders" "CI" 45083 2014 "John N. Roberts, III" "JBHT" 45173 2013 "Michael Patrick Lyons" "PNC" 46737 2014 "Anne L. Bramman" "JWN" 58822 2018 end

The second dataset includes the same firms as the exec comp dataset, their respective ticker symbols, cash flows, number of M&A deals completed etc. for each year in the range. Unlike the compensation dataset, however, there only exists one observation for each firm in each year. An extract is show below:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str50 Name int Year double CashFlow "Old Republic International Corporation" 2016 496.5 "Stanley Black & Decker, Inc." 2020 4365.1 "MGM Resorts International" 2010 13750.376 "Public Service Enterprise Group Inc" 2011 20391 "Casey's General Stores, Inc." 2015 2649.394 "Lam Research Corporation" 2016 1844.685 "Cigna Corporation" 2017 3915 "J.B. Hunt Transport Services, Inc." 2016 3612.391 "PNC Financial Services Group, Inc." 2012 9405 "Nordstrom, Inc." 2012 3743 end

Now, my aim:

I want to combine the two datasets such that every option observation for a given CEO is in line with the relevant financial information for their firm in the given year.
My initial attempts involved using m:m merge (ghastly, I know) which, surprisingly, yielded results very close to what I wanted. However, when the number of matching observations between the two datasets is not equal, m:m merge replaces all additional observations with the subsequent last observation, ruining the year to year matching.

Counter to this, I would like that each option after the first instance within a specific year simply repeats the respective financial information. For a little more clarification, below is an extract of one CEO’s option awards for 2010 whereby each observation should be accompanied by the financial information for ‘HES’ in the year 2010:

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str58 FullName str5 Ticker long ExecutiveCompanyCombination int FiscalYear double(OptionExercisePrice OptionMoneynessMIN) "John B. Hess" "HES" 79 2010 . . "John B. Hess" "HES" 79 2010 81.85 0 "John B. Hess" "HES" 79 2010 56.43 .35637072479177756 "John B. Hess" "HES" 79 2010 60.07 .2741801231896122

I would be extremely grateful for some guidance on this issue and thank you in advance for any help you might be able to offer.

I appreciate this is a rather longwinded question and I may have fallen short in explaining it clearly so please do not hesitate to ask for further explanations.

Thank you!

Last edited by Faris Malla; 16 Jun 2021, 14:44.
Tags: Finance, join, merge
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#2

16 Jun 2021, 15:54

Unlike the compensation dataset, however, there only exists one observation for each firm in each year.

That is the essential observation here.

Now, the example data you show does not match your description of them. In the executive compensation data, the firm is represented by Ticker and there is no firm name, but in the financial data only the firm name appears, no ticker. So the files cannot be merged until this is fixed. For that you will need to find a crosswalk between the firm names and the tickers. That would be a file with two variables: Ticker and Name which can be merged 1:1 with the financial data set you already have in hand. I don't work in finance, so I don't know, but I imagine that such a file can be easily found somewhere on the internet. That way there will be compatible identifying information in both files. You have a smaller problem, also, in that the compensation file has a variable called FiscalYear, whereas in the financial information file, the year variable is called Year. But that's easy to work around.

So assuming you have merged in the crosswalk so that Ticker appears in both files now:

Code:

use executive_compensation_data, clear clonevar Year = FiscalYear merge m:1 Ticker Year using financial_data

and you should be good to go.

Thank you for using -dataex- on your first post. And thank you for realizing that -merge m:m- was wrong. I shudder to think how much invalid analysis is sitting around out there based on the data salad produced by -merge m:m-.

Last edited by Clyde Schechter; 16 Jun 2021, 15:56.
2 likes
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

16 Jun 2021, 15:57

My initial attempts involved using m:m merge (ghastly, I know)

Perhaps you have not read the documentation.The following is copied word-for-word from the documentation of the merge command in the Stata Data Management Reference Manual PDF included in the Stata installation and accessible from Stata's Help menu.

m:m merges

m:m specifies a many-to-many merge and is a bad idea. In an m:m merge, observations are matched within equal values of the key variable(s), with the first observation being matched to the first; the second, to the second; and so on. If the master and using have an unequal number of observations within the group, then the last observation of the shorter group is used repeatedly to match with subsequent observations of the longer group. Thus m:m merges are dependent on the current sort order—something which should never happen.

Because m:m merges are such a bad idea, we are not going to show you an example. If you think that you need an m:m merge, then you probably need to work with your data so that you can use a 1:m or m:1 merge. Tips for this are given in Troubleshooting m:m merges below.
So, not just ghastly, completely ill advised and certainly the wrong thing to do, as you found out from your experience, and can read in the documentation.

If you are thinking about using merge m:m, it is a near certainty that at least one of the following is true:

1. Your merge key actually does uniquely identify the observations in one of your data sets, so you can use merge 1:m or merge m:1 or maybe even merge 1:1.

2. You are failing to take account of one or more additional variables in your data set that, combined with the variables you are trying to use as the merge key, uniquely identify the observations in one or both of your data sets, so you can use merge 1:m or merge m:1 or merge 1:1 with the expanded merge key.

3. You are really trying to accomplish what joinby, a different command, does: create in the output dataset every possible combination of an observation from the first dataset and an observation from the second dataset, both having the same key. (SQL users take note! I fell for this myself: to Stata m-to-m does not mean m-by-m.)

4. You actually need to append your datasets rather than merge them.

5. The data sets you are trying to merge are incorrectly configured or contain data errors that need to be fixed.

For your problem, option 1 is perhaps what you need, starting with something like the following.

Code:

use firstdataset, clear merge m:1 Ticker year using seconddataset

Added in edit: crossed with Clyde's much more tactful post.
1 like
Comment
Faris Malla

Join Date: Jun 2021

Posts: 4
#4

17 Jun 2021, 05:00

Dear Clyde and William,

Thank you very much for your guidance and insights, the solutions worked exactly as I wanted! I now also appreciate how my interpretation of the problem and its relation to the merge function required some reinterpreting - a good lesson for the future.

Again, thankful for your help and wishing you all the best.
Comment

Announcement

Merging Datasets With Unequal Observations

Comment

Comment

Comment