Derive S&P1500 sample

Samuel daSilva

Join Date: Mar 2017

Posts: 16
#1

Derive S&P1500 sample

15 May 2017, 07:00

Dear Forum,

I have a question on how to create a sample of S&P 1500 firms.

What I need:
I need to collect a sample of all firms in the S&P1500 from 1990 to 2010. I use WRDS and Stata and I need information on accounting and stock data.

More specifically:

If a firm was in the S&P1500 for at least one year (= t0) I would need data on this firm in the range from t-3 to t4.

Example 1: Firm A appears in the S&P1500 in the year 2000 (any month). Than, I would need data on this firm from 1997 to 2004.
Example 2: Firm B appears in the S&P1500 in the years 1995, 1996 and 1997. Than, I would need data on this firm from 1992 to 2001.

What I have done so far:

As I need accounting and stock data I turned to the CRSP-Compustat-Merged database, searched for all firms in the database and merged it with CRSP.

In WRDS, I used the „Compustat Monthly Updates - Index Constituents“ and searched for all firms from Jan 1990 to December 2010, using the „i0020“-ticker (which is for the S&P1500).

As a result, I get a list of all companies with a „from“ and „thru“ date constituting the entry and exit date in/from the S&P1500. The variables „from“ and „thru“ are in the format: „14feb2010“

My idea on how to solve this

- transform the two variables „from“ and „thru“ to a „long“ format on a yearly base

Current format:
gvkey From Thru Index

001 10sep1999 03aug2001 S&P1500

002 04jan1996 24mar1997 S&P 1500

Target format:
gvkey year index (changed to dummy var)

001 1999 1

001 2000 1

001 2001 1

002 1996 1

002 1997 1

- use (or create a) an unique identifier for the S&P1500 firms
- merge the index data with my CRSP/Compustat merged database on gvkey and year (fyear = year)
- question: the crsp-compustat-merged and compustat data which I merged before are merged on permno. is it a problem if I use gvkey for the
S&P1500-„merge“?

My questions

—> does this approach make sense?

—> if so, how could I do the transformation part. I have tried several ways but was not able to do it

Any guidance on how to do this is highly appreciated!

Thanks a lot,

Samuel
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30116
#2

15 May 2017, 09:27

I'm not sure I completely understand what you're trying to do. But it sounds a great deal like a job for -rangejoin-. I suppose you have a data set, let's call it master_data, with an identifying variable, gvkey, and variables from and thru that identify the lower and upper bounds of a range of dates for which you want to bring in data on that firm from another data set. I assume that this second data set, call it secondary_data, contains variables gvkey and year. Then you would do this with:

Code:

use master_data, clear rangejoin year from thru using secondary_data, by(gvkey)

Note that -rangejoin- is not part of official Stata. It was written by Robert Picard and is available by running -ssc install rangejoin-.
Comment
Robson Glasscock

Join Date: Apr 2014

Posts: 25
#3

15 May 2017, 11:07

question: the crsp-compustat-merged and compustat data which I merged before are merged on permno. is it a problem if I use gvkey for the
S&P1500-„merge“?

I have never been affiliated with a university that has had a subscription to the CRSP/COMPUSTAT merged data, but I have worked with both of these individual databases and merged them manually. There is no problem with merging based on GVKEY if it appears in both datasets, but if a firm has multiple security issuances in the CRSP/COMPUSTAT merged data and if this results in multiple GVKEY's for a given time period because of the multiple issuances, then your merge will need to account for this structure (e.g., GVKEY and time may not be a unique identifier in one dataset but will be a unique identifier in the other.). I don't think you need to create your own unique identifier for the index data.

I would caution you about using COMPUSTAT's FYEAR for the time dimension of the merge. Maybe it is fine, but look at your data and consider how COMPUSTAT defines this variable.The FYEAR is defined by COMPUSTAT based on the number of months a firm's fiscal year is within a given calendar year. A firm that files on May 31, 2003 will have a fiscal year of 2002 since seven months of the firm's operations were in 2002. In your case this may not be a problem since you already have the COMPUSTAT/CRSP merged database, but it is something to consider. There are cases where merges based on FYEAR can inadvertently slide accounting data one year off from data in other databases if the research isn't aware of the FYEAR definition.
1 like
Comment
Samuel daSilva

Join Date: Mar 2017

Posts: 16
#4

16 May 2017, 09:03

Dear Clyde,

thanks for your reply and your hint on - rangejoin - that worked out perfectly!

As you assumed correctly, I have a data set called „master_data“ with an identifying variable "gvkey" and variables "from" and "thru" that define lower and upper bounds of a range of dates. For this set of data I want to bring in data from another data set called „secondary_data“ which contains the variables gvkey and year as well as further variables of interest that I need for subsequent analysis (e. g. accounting data).

I have a further issue:
I would also need to bring in data from the „secondary_data“ set on the three years prior to the lower bound and four years after the upper bound for each firm.

Example 1: Firm A appears in the S&P1500 in the year 2000. Than, I would need to data on this firm from 1997 to 2004.

Example 2: Firm B appears in the S&P1500 in the years 1995, 1996 and 1997 (t. i. lower bound = 1995 and upper bound = 1997). Than, I would need data on this firm from 1992 to 2001.

Example 3: Firm C appears in the S&P1500 for the whole period I am looking at (1990 to 2010). Than, I would need data on this firm from 1987 to 2014.

Is there a way to modificate the - rangejoin - command? Or my master_data?

Thanks a lot!

Best,
Samuel
Comment
Robson Glasscock

Join Date: Apr 2014

Posts: 25
#5

16 May 2017, 09:54

One way to do this is to pull your secondary data for the prior and subsequent periods you need, use -xtset- to identify the data as panel data, and then generate variables equal to the lags and leads you need with Stata's time-series operators. This will give you one row of merged data that will contain all of the lags and leads for each firm-year, and then you can program accordingly.

For your example, "secondary_data" you pulled from COMPUSTAT has accounting variables from 1987- 2014. Perform your merge, use -xtset-, L. and F. to create these new variables, and then keep observations where _merge==1. The merged observations will contain one row with all current year variables, the three lags, and the four leads for each firm-year.
Comment
Samuel daSilva

Join Date: Mar 2017

Posts: 16
#6

17 May 2017, 05:08

Dear Robson,

thank you very much for your help!

With regard to your first message:

I was not quiet correct on my first post.
I have two datasets:

1. CRSP/Compustat/Merged (CCM) data (which contains basically data from Compustat)
2. CRSP Data on stock returns

Similar to CRSP, CCM has PERMNO as identifier which the „normal“ Compustat does not contain. This makes it easy to merge CCM with CRSP.

With regard to the time dimension:

I thought about that aspect too but was not sure on how to deal with it.

Although I understand the logic of Compustat in assigning the fiscal year, I do not exactly get the mechanism you described where the accounting data would be sliced one year off.

Is it not that in all cases where calendar year and fiscal year are synchronized, the data is „off“ at least partially?

Could you specify on that? Is there a routine on how to deal with this issue? Or maybe another post in this forum?

Your comment made me aware of another issue (thank you for that!!):

What I do with the (monthly) CRSP data is the following: I look at monthly returns, calculate equity betas based on stock returns of the 60 prior months. Than, I take the average of the betas for one (calendar) year to transform the betas to a yearly base.

In the next step, I merge this data with my CCM data. This data is annually and I use fiscal year as „reference point“.

However, I guess it would be a lot more precise to align the beta calculations with the fiscal year end?

Example:

Fiscal Year goes from March 1990 to March 1991. Compustat assigns this fiscal year to 1990.

If I calculate the betas in CRSP, I use data from January 1990 to December 1990. In this case, only 6 of the calculated betas which I use to calculate the average beta for that year would align with the Compustat data.

On the other hand, each of these monthly betas is based on 60 data points (I use monthly returns over the prior 5 years), so that in the end, the deviance might not be that much.

For now I would not know how to efficiently solve this (and I do not expect anyone in this forum to find out for me! :D ). I was just elaborating on that because you made me aware of this issue. If I get the time I might try to calculate the beta in both ways to compare results and than I would post them here in case someone is interested.

With regard to your second message:

I thought about that too and I like the idea because it is pretty straightforward.

However, I have at least 25 variables, so 25 x 8 (lagged, current, forward) would result in a pretty messy table..
Comment
Robson Glasscock

Join Date: Apr 2014

Posts: 25
#7

17 May 2017, 07:38

Samuel,

With regard to the time dimension:

I think the easiest way for you to see the issue with FYEAR is by downloading some data, keeping some firms with non-December 31st fiscal year ends, and then looking at the values assigned to FYEAR by COMPUSTAT. What you will see for firms, for example a firm that files on May 31, 2003, is an FYEAR equal to 2002. However, DATADATE will be May 31, 2003 and this firm's balance sheet date will be 2003 and not 2002 if you pull the 10-K from EDGAR. This won't be a problem for all firms with non-December 31st fiscal year ends. Firms that file on or after June will have FYEAR equal to the year of the DATADATE variable, but firms that file between January and May on a given year will have FYEAR equal to DATADATE year minus 1.

If you then deem FYEAR to be the time dimension to merge data from CRSP, you could inadvertently match up 2003 returns with 2002 accounting data. This can happen with other datasets, as well. Audit Analytics is another dataset where using FYEAR from COMPUSTAT can misalign auditor data one year off from the accounting data for the same firms (e.g., Jan- May filers). Typically, I handle this by merging around either the year of the DATADATE or the year from RQD depending on what my research question is. DATADATE is the period-end of the financial statements (e.g., balance sheet date) and RDQ is often the date of the Q4 earnings announcement which will also often have some annual financial information disclosed. Once could also look at FILEDATE, the date the 10-K is filed, depending on the research question.

My opinion is that FYEAR is a COMMPUSTAT convention to eliminate duplicate times within the panel for firms that change fiscal year-ends, but I do not believe this variable is useful for merging with other databases that are not COMPUSTAT. I could be wrong here, but I haven't come across anything to the contradict this opinion in any of my research.

The routine you are asking about is as simple as using one of these other dates, or perhaps generating a variable equal to the year of one of these other dates, renaming the variable in the dataset to be the same as the CRSP date variable (or date variable in Audit Analytics, etc.), and doing the merge based on this.

Your comment made me aware of another issue (thank you for that!!):

Thanks for the kind words. In my opinion, this continues from the first. Even you see that "results are not that different," this could be the one time out of 10, 100, 1000 where they aren't that different. There's no real way to know. I think it is better to line everything up from the start and iron out all known issues rather than relying on results not being materially impacted in the isolated context of this one paper.

It's starting to get difficult for me to answer here just based on descriptions, but this is what I think I would do:

Pull CRSP data for the entire panel of years you need and actually augment the query to go back 5 years further than your final estimation sample (more on this in the next sentence or two). The periodicity of this data is monthly. Estimate beta based on 60-month previous monthly returns. This gives a beta next to each PERMNO-month that you have provided you pull data for periods further back than your actual research question dictates (i.e., 5 years) so that the first few values of beta for each firm-month aren't missing due to the data not being there to estimate beta that far back. Next, average beta based on the prior 12 months. This gives you an annual beta next to each PERMNO-month. So, each firm-month has an annual beta based on the preceding 12 months.

Now we have to think about how to get that lined up with accounting data. Depending on your research question that's where you have to decide which date variable in COMPUSTAT is the one to use from those previously discussed. I think you should work with some function of year and month, though. If you don't, and you calculate annual beta measures the way you described where there is one per year, then I think you could run into an issue with lining up the right annual beta measures with the accounting data for the firms that are non-December 31 filers. I think what you want here is to line up a firm with, say, a March 31st DATADATE with the annual average beta based on a series of rolling, 60-month estimation windows over the annual period from March 31st year x to March 31st year x-1. Same thing for a firm that files at the end of May. All those averaged annual betas will be based on the correct preceding 12 months for any given month.

The way you described it, to me it sounds like your approach would line up calendar CRSP returns with COMPUSTAT firms, and if this is done on FYEAR there are two issues: 1) firms between Jan and May have returns from year t-1 matched with accounting data in year t and 2) I don't see how what you described takes into account various fiscal year-ends in COMPUSTAT. I could easily be misinterpreting some of what you said here, and I hope what I'm writing is clear enough that you can decide if this is a problem for your research design or not.

However, I have at least 25 variables, so 25 x 8 (lagged, current, forward) would result in a pretty messy table..

Again, I could be missing something here, but I don't see why that's a problem. The tables that will be published will be of summary statistics. Having a large underlying dataset shouldn't be an issue, but maybe I'm missing something. What I suggested allows you to run a variety of summary statistics, or program these lags and leads into whatever other variables you need, but I could be missing something here.
Comment
Samuel daSilva

Join Date: Mar 2017

Posts: 16
#8

17 May 2017, 11:26

Robson,

first of all, thank you for your detailed answers!

I think you are right with the linkage between issue 1 („time dimension“) and issue 2 („your comment made me aware of another issue“).

After thinking a while about this issue and your comments both issues could be solved by addressing "issue 2"

However, I will start with some thoughts on issue 1:

Lets consider a firm that has has its fiscal year ending on March 31st 2000.

If I assign the year, based on DATADATE, I would chose the year 2000. In this case:
- Compustat´s data on that firm would come from April 1st 1999 to March 31st. 2000.
- CRSP data would come from January 1st to December 31st 2000.

—> I would get 3 „matching“ months (from January 1st to March 31st 2000)

If I assign the year based on FYEAR, I would chose the year 1999. In this case:
- Compustat´s data on that firm would come from April 1st 1999 to March 31st. 2000.
- CRSP data would come from January 1st to December 31st 1999.

—> I would get 9 „matching“ months (from April 1st to Dezember 31st 1999).

So in all cases where DATADATE is not equal to December 31st, my yearly data will be off. However, it seems to me that going with FYEAR kind of „minimizes“ this issue:
Consider the extreme example of a fiscal year ending in January 2000. If I merge based on the year of DATADATE I wold have only one month where the data would „line up“.

Maybe I get something wrong, but I think that FYEAR is the better option if I chose between FYEAR and the year based on DATADATE, although it is still unprecise!

If my reasoning is right, than the solution to merge the data accurately would be to follow the path that you described in the second section of your post:

1. calculate beta for each month based on prior 60 months.
2. calculate the average of the 12 prior betas for each monthly observation.
3. generate a variable MONTH in my CRSP data for each observation, then name it FYR (similar to „fiscal year end“ in CCM data).

—> If I am thinking correctly, I should have twelve observations per calendar year in my CRSP data. Each observation would contain the correct equity beta for a particular month based on ints prior 12 months.
But only one of these observations should match with my CCM data, because in this set I would have only one FYR per PERMNO per YEAR.

Thus:
4., merge CCM and CRSP on PERMNO, YEAR, FYR
5. drop those observations from CRSP that did not match. This should be 11 monthly observations per firm per year.

I will try this and than I will report my results.

Thank you for your hints and the discussion!
Comment
Robson Glasscock

Join Date: Apr 2014

Posts: 25
#9

17 May 2017, 11:37

I disagree that FYEAR minimizes the problem. I believe that FYEAR causes it. I may not be explaining my point clearly, but I believe that merging based on some function of DATADATE, FILEDATE, RDQ, etc. will allow you to get the market's returns closer to the actual period-end of the accounting data or when the market becomes aware of the accounting data.

Consider the extreme example of a fiscal year ending in January 2000. If I merge based on the year of DATADATE I wold have only one month where the data would „line up“.

I disagree with the above statement. You could have the right aggregated returns depending on how returns are structured in the dataset. I think we just have to agree to disagree and move on. I appreciate the dialogue we have had.
Comment
Samuel daSilva

Join Date: Mar 2017

Posts: 16
#10

21 May 2017, 05:11

Robson,

I re-thought the issue and I agree on your comments on fyear and datadate.
Thank you for making me aware, your patience and the discussion!
Comment
Robson Glasscock

Join Date: Apr 2014

Posts: 25
#11

21 May 2017, 07:14

Samuel,
Thanks for posting and letting me know about your conclusion. I hope your project goes well.
Comment

gvkey	From	Thru	Index
001	10sep1999	03aug2001	S&P1500
002	04jan1996	24mar1997	S&P 1500

gvkey	year	index (changed to dummy var)
001	1999	1
001	2000	1
001	2001	1
002	1996	1
002	1997	1

Announcement

Derive S&P1500 sample

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment