I have a 2 id panel data set and I want to fill down/expand observations with respect to a time variable.

Chihhao Shao

Join Date: Jun 2016
Posts: 2

I have a 2 id panel data set and I want to fill down/expand observations with respect to a time variable.

20 Jun 2016, 15:56

Hello everyone,

I encounter a problem and couldn’t find the solution from the past threads, so I decided to post a new thread seeking for advice.

I have a panel data set regarding the holding information of an institutional investor at a certain time point. The sample period spans from 2002Q1 to 2007Q2. For each stock held by investor i, I want to fill in the time gaps. Secondly, if the last period of a stock is not 2007Q2, then I want to expand one extra period for that stock held by investor i. The variables used are: manager number (mgrno), CUSIP, date, and shares (shares held at time t).

For example:

mgrno	cusip	date	shares
110	00184A10	2002m3	49825
110	00184A10	2002m6	56325
110	00184A10	2002m12	56625
110	00184A10	2003m3	56625
110	00206R10	2005m12	28111
110	00206R10	2006m3	27711
110	00206R10	2006m12	17691
110	00206R10	2007m3	23423
500	26101810	2003m6	158060
500	26101810	2003m9	57760
500	26101810	2003m12	18710
500	26101810	2004m3	18310
500	26101810	2004m6	21210
500	26157010	2007m3	3700
500	26157010	2007m6	3700

For stock 00184A10 held by investor 110, the holding period begins from 2002m3 to 2003m3. I want to fill the time gap between 2002m6 and 2002m12, which is 2002m9. Also, I want to add an extra period after 2003m3, since it doesn’t meet the limitation of the sample period.

The expected result (partial) will be:

mgrno	cusip	date	shares
110	00184A10	2002m3	49825
110	00184A10	2002m6	56325
110	00184A10	2002m9	0
110	00184A10	2002m12	56625
110	00184A10	2003m3	56625
110	00184A10	2003m6	0

The second example:

500	26157010	2007m3	3700
500	26157010	2007m6	3700

Since there is no time gap between the 2 observations and the last period is 2007m6, there is no need to do anything to this stock held by investor 500.

I have tried the tsfill command but I couldn’t define the dataset as a penal dataset. The reason is that at time t, stock x can be held by numerous investors. There are several observations for a certain stock at time t. It is required that mgrno and cusip are combined to generate a composite categorical variable in order to uniquely identify an observation. I also tried the command: egen both = group(mgrno cusip), label. However, there are too many observations in my dataset (13,148,727 observations), so the software couldn’t generate the result I want. I have already searched for potential materials for a while, but still didn’t find useful resources perhaps due to my capability. I hope someone can generously offer some suggestions to my problem. Thank you.

Kind regards,

Chihhao

References
1. tsfill
2. composite categorical variables

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#2

21 Jun 2016, 06:57

So, in a situation where a pair of variables (mgrno and cusip) identify the panel, you need to create a new variable that encodes those pairs. -help egen, (group)-. Then you can avail yourself of the usual time-series and panel commands. Adding an additional observation at the end can be done with -expand-. Filling in the missing values created by -tsfill- or overwriting the incorrect ones created by -expand- is a matter of the "usual" commands with -by-, -_n-, and -_N-.

In addition, the use of monthly dates in your data makes life more complicated than it need be. If you use quarterly dates instead, then what you want are just consecutive integers in the quarterly date variable, which helps -tsfill- work the way you need it to. In the code below, I have added a quarterly date variable to the data set. But, unless you have some particular reason to keep the monthly date variable you started with, I suggest you -drop- it.

So here's the code:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int mgrno str8 cusip float date long shares 110 "00184A10" 506 49825 110 "00184A10" 509 56325 110 "00184A10" 515 56625 110 "00184A10" 518 56625 110 "00206R10" 551 2811 110 "00206R10" 554 27711 110 "00206R10" 563 17691 110 "00206R10" 566 23423 500 "26101810" 521 158060 500 "26101810" 524 57760 500 "26101810" 527 18710 500 "26101810" 530 18310 500 "26101810" 533 21210 500 "26157010" 566 3700 500 "26157010" 569 3700 end format %tm date // FILL IN GAPS // FIRST CONVERT MONTHLY DATE TO QUARTERLY DATE gen int quarter = qofd(dofm(date)) format quarter %tq // GENERATE IDENTIFIER FOR MGRNO-CUSIP PAIR egen id = group(mgrno cusip) xtset id quarter tsfill // FILL IN MISSING VALUES GENERATED replace shares = 0 if missing(shares) replace date = mofd(dofq(quarter)) by id (mgrno), sort: replace mgrno = mgrno[1] by id (cusip), sort: replace cusip = cusip[_N] // NOW ADD ADDITIONAL OBSERVATION AT END OF SERIES // IF FINAL QUARTER IS NOT 2007Q2 by id (quarter), sort: gen byte ex = 1+ (_n == _N & quarter != tq(2007q2)) expand ex by id (quarter), sort: replace quarter = quarter[_N-1] + 1 if _n == _N by id (quarter), sort: replace shares = 0 if _n == _N & ex == 2

Finally, your question is difficult enough to be interesting but not so difficult as to be off-putting. So why did you get no response for over 12 hours? I think it's because you really made the example data as difficult as possible to work with. This kind of code requires some experimentation and testing, which in turn requires a sample of data to work with. The way you posted it made it extremely cumbersome to get into Stata. In fact, it took me twice as long to do that as it took me to solve your problem once the data was in. Please read the FAQ, especially #12 which has advice on the ways to post data and code and output in ways that make it as easy as possible for others to work with. In particular, data examples should be posted using the output of the -dataex- command (as I have done above). Doing that will likely get you quicker responses to your questions in the future.
Comment
Chihhao Shao

Join Date: Jun 2016

Posts: 2
#3

21 Jun 2016, 10:30

Thank you Clyde. Your advice really helps! I will follow the commands you post to proceed with the work. Also, I'll read the FAQ before I ask questions in the future. Thanks!
Comment

Announcement

I have a 2 id panel data set and I want to fill down/expand observations with respect to a time variable.

Comment

Comment