Merge two data sets by a common variable and a date range

Rochelle Zhang

Join Date: Jul 2025

Posts: 0
#1

Merge two data sets by a common variable and a date range

03 Oct 2014, 23:13

Dear statalister,
I have the following two dataset that I need to merge:

(I am only showing a few records to illustrate data structure)

My goal is to add cusip information from Cusip_permno to my permno dataset .

Permno is a company ‘s permanent identifier – never changes. For each permno, it could correspond to more than one cusip if a company issued different types of stocks and received newly assigned cusip . the begin date and end date suggests during which time period and cusip applies. Hence, my merge is to add cusip to permno data based on the effective period cusip applies.

Data Cusip_permno
Permno Cusip Begindate Enddate

10000 002b0102 19900102 19931231

10001 0039b012 19900102 19920102

10001 0029b011 19920103 19941203

Permno year xvar

10000 1990 1

10000 1991 2

10001 1990 19

10001 1992 92

10001 1993 13

I am stuck with how to use the date to match with the year and permno.

Thanks for considering my question,

Rochelle
Tags: None

1 like
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

04 Oct 2014, 07:58

We need to clarify something here. For Permno 10001, Cusip 0029b011 covers most, but not all of 1992, all of 1993, and most, but not all of 1994. The beginning of 1992 is covered by Cusip 0039b012. So when you match up Permno 10001 and 1992, which Cusip do you want to match to? Or do you want to have two records, one for each Cusip? Please provide a general answer to the issue of multiple Cusips in a single year, not just this particular instance.

Last edited by Clyde Schechter; 04 Oct 2014, 08:18.
Comment

Evelyn Ersanilli

Join Date: May 2014
Posts: 43

04 Oct 2014, 08:39

It is not fully clear to me what you want to do.
I don't understand why in the example of your permno dataset above there are multiple years for each company, but these do not cover all years.

Would you like your dataset to look like
A. with Cusip connected to the start year

10000	1990	1	002b0102
10000	1991	2
10001	1990	19	0039b012
10001	1992	92	0029b011
10001	1993	13

B. Multiplying the observations in your permno database so that there is an observation for each year a cusip runs?

10000	1990	1	002b0102
10000	1991	2	002b0102
10000	1992	.	002b0102
10000	1993	.	002b0102
10001	1990	19	0039b012
10001	1991	.	0039b012
10001	1992	92	0039b012
10001	1992	92	0029b011
10001	1993	13	0029b011
10001	1994	.	0029b011

From your description I'm guessing it's option B.

You need to check what the format of your year variable in cusip is.
For whatever reason I couldn't extract year using the year function from a variable formatted %tdCCYY-NN-DD so I decided to turn it to string and extract the year from that.

Code:

use "cusip.dta", clear
tostring begindate, replace
tostring enddate, replace
gen beginyear=substr(begindate,1,4)
gen endyear=substr(enddate,1,4)
destring beginyear endyear, replace
gen duration=endyear-beginyear
expand duration+1
bysort cusip: gen year=beginyear+_n-1
keep cusip year
save "cusip expanded.dta"

use "permno.dta", clear
joinby Permno year using "C:\Users\Evelyn\Downloads\cusip expanded.dta", unmatched(both)

Comment

Rochelle Zhang

Join Date: Jul 2025

Posts: 0
#4

04 Oct 2014, 09:32

@ Clyde: Thank you ! I looked into the raw data. Yes, I do need to decide what to do when the year overlaps. Majority of my sample does not have this overlap issue. suppose I isolate these observations that have year overlaps and not merge them at this point (waiting for further investigation on what year to merge).

could you please comment on merging the two data sets below (I edit my data to show there is no overlaps in year)

Permno Cusip Begindate Enddate

10000 002b0102 19900102 19931231

10002 0032b016 19900101 19941231

10002 0032b016 19950102 19991231

Permno year xvar

10000 1990 1

10000 1991 2

10002 1990 2

10002 1995 9

10002 1996 3
Comment
Rochelle Zhang

Join Date: Jul 2025

Posts: 0
#5

04 Oct 2014, 09:40

@Evelyn, thank you for your detailed response .

Re: your first question. permno dataset has years that are not continuous (gaps inbetween ). this could happen when the data collection was incomplete. I am using secondary data from some commercial databases.

Re: my raw datasets are both sas datasets. the date format should work after importing into stata. your suggested solution should be useful if I have those overlapping years as Clyde mentioned in his post. Yes, your option B is better.

thanks again,
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30117

04 Oct 2014, 09:48

So, here's what I would do:

Code:

use cusip_data_set, clear
isid Permno Cusip, sort // IF THIS FAILS, CODE BELOW WOULDN'T WORK

// GENERATE YEAR VARIABLES AND EXPAND IT TO FILL IN
// PERIODS THAT COVER MORE THAN ONE YEAR
gen long first_date = date(Begindate, "YMD")
gen long last_date = date(Enddate, "YMD")
gen int first_year = year(first_date)
gen int last_year = year(last_date)
expand last_year-first_year+1
by Permno Cusip, sort: gen int year = first_year + _n - 1
drop first_date last_date first_year last_year

// WE NOW HAVE ONE OBSERVATION FOR EACH CUSIP
// FOR EACH YEAR IT WAS IN EFFECT
tempfile cusips
save `cusips'

// NOW MERGE WITH THE PERMNO DATASET
use permno_dataset, clear
isid Permno year, sort // IF THIS FAILS, CODE BELOW WOULDN'T WORK
merge 1:m Permno year using `cusips', keepusing(master match)

The data now are matched up. If there are multiple Cusips that apply to a given Permno in a given year, they are all there and you will have to decide which one(s) to keep. If there are data in the Permno data set with no corresponding Cusips, they are characterized by _merge == 1, and you will also have to decide what to do about those.

Comment

Rochelle Zhang

Join Date: Jul 2025

Posts: 0
#7

04 Oct 2014, 21:13

Thanks so much
Comment
michael joe

Join Date: May 2015

Posts: 50
#8

17 Jul 2015, 09:24

Hi, I have a similar question. What would happen if the year variable were a date in instead of what is shown here? ie. I wish to see if a certain date for each permo/year observation is within the beginning or end date of the first dataset?
Comment

Permno	Cusip	Begindate	Enddate
10000	002b0102	19900102	19931231
10001	0039b012	19900102	19920102
10001	0029b011	19920103	19941203

Permno	Cusip	Begindate	Enddate
10000	002b0102	19900102	19931231
10002	0032b016	19900101	19941231
10002	0032b016	19950102	19991231

Announcement