Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • I have a 2 id panel data set and I want to fill down/expand observations with respect to a time variable.

    Hello everyone,

    I encounter a problem and couldn’t find the solution from the past threads, so I decided to post a new thread seeking for advice.

    I have a panel data set regarding the holding information of an institutional investor at a certain time point. The sample period spans from 2002Q1 to 2007Q2. For each stock held by investor i, I want to fill in the time gaps. Secondly, if the last period of a stock is not 2007Q2, then I want to expand one extra period for that stock held by investor i. The variables used are: manager number (mgrno), CUSIP, date, and shares (shares held at time t).

    For example:
    mgrno cusip date shares
    110 00184A10 2002m3 49825
    110 00184A10 2002m6 56325
    110 00184A10 2002m12 56625
    110 00184A10 2003m3 56625
    110 00206R10 2005m12 28111
    110 00206R10 2006m3 27711
    110 00206R10 2006m12 17691
    110 00206R10 2007m3 23423
    500 26101810 2003m6 158060
    500 26101810 2003m9 57760
    500 26101810 2003m12 18710
    500 26101810 2004m3 18310
    500 26101810 2004m6 21210
    500 26157010 2007m3 3700
    500 26157010 2007m6 3700
    For stock 00184A10 held by investor 110, the holding period begins from 2002m3 to 2003m3. I want to fill the time gap between 2002m6 and 2002m12, which is 2002m9. Also, I want to add an extra period after 2003m3, since it doesn’t meet the limitation of the sample period.

    The expected result (partial) will be:
    mgrno cusip date shares
    110 00184A10 2002m3 49825
    110 00184A10 2002m6 56325
    110 00184A10 2002m9 0
    110 00184A10 2002m12 56625
    110 00184A10 2003m3 56625
    110 00184A10 2003m6 0
    The second example:
    500 26157010 2007m3 3700
    500 26157010 2007m6 3700
    Since there is no time gap between the 2 observations and the last period is 2007m6, there is no need to do anything to this stock held by investor 500.

    I have tried the tsfill command but I couldn’t define the dataset as a penal dataset. The reason is that at time t, stock x can be held by numerous investors. There are several observations for a certain stock at time t. It is required that mgrno and cusip are combined to generate a composite categorical variable in order to uniquely identify an observation. I also tried the command: egen both = group(mgrno cusip), label. However, there are too many observations in my dataset (13,148,727 observations), so the software couldn’t generate the result I want. I have already searched for potential materials for a while, but still didn’t find useful resources perhaps due to my capability. I hope someone can generously offer some suggestions to my problem. Thank you.

    Kind regards,

    Chihhao

    References
    1. tsfill
    2. composite categorical variables

  • #2
    So, in a situation where a pair of variables (mgrno and cusip) identify the panel, you need to create a new variable that encodes those pairs. -help egen, (group)-. Then you can avail yourself of the usual time-series and panel commands. Adding an additional observation at the end can be done with -expand-. Filling in the missing values created by -tsfill- or overwriting the incorrect ones created by -expand- is a matter of the "usual" commands with -by-, -_n-, and -_N-.

    In addition, the use of monthly dates in your data makes life more complicated than it need be. If you use quarterly dates instead, then what you want are just consecutive integers in the quarterly date variable, which helps -tsfill- work the way you need it to. In the code below, I have added a quarterly date variable to the data set. But, unless you have some particular reason to keep the monthly date variable you started with, I suggest you -drop- it.

    So here's the code:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int mgrno str8 cusip float date long shares
    110 "00184A10" 506  49825
    110 "00184A10" 509  56325
    110 "00184A10" 515  56625
    110 "00184A10" 518  56625
    110 "00206R10" 551   2811
    110 "00206R10" 554  27711
    110 "00206R10" 563  17691
    110 "00206R10" 566  23423
    500 "26101810" 521 158060
    500 "26101810" 524  57760
    500 "26101810" 527  18710
    500 "26101810" 530  18310
    500 "26101810" 533  21210
    500 "26157010" 566   3700
    500 "26157010" 569   3700
    end
    format %tm date
    
    // FILL IN GAPS
    //    FIRST CONVERT MONTHLY DATE TO QUARTERLY DATE
    gen int quarter = qofd(dofm(date))
    format quarter %tq
    //    GENERATE IDENTIFIER FOR MGRNO-CUSIP PAIR
    egen id = group(mgrno cusip)
    xtset id quarter
    tsfill
    
    //    FILL IN MISSING VALUES GENERATED
    replace shares = 0 if missing(shares)
    replace date = mofd(dofq(quarter))
    by id (mgrno), sort: replace mgrno = mgrno[1]
    by id (cusip), sort: replace cusip = cusip[_N]
    
    //    NOW ADD ADDITIONAL OBSERVATION AT END OF SERIES
    //    IF FINAL QUARTER IS NOT 2007Q2
    by id (quarter), sort: gen byte ex = 1+ (_n == _N & quarter != tq(2007q2))
    expand ex
    by id (quarter), sort: replace quarter = quarter[_N-1] + 1 if _n == _N
    by id (quarter), sort: replace shares = 0 if _n == _N & ex == 2
    Finally, your question is difficult enough to be interesting but not so difficult as to be off-putting. So why did you get no response for over 12 hours? I think it's because you really made the example data as difficult as possible to work with. This kind of code requires some experimentation and testing, which in turn requires a sample of data to work with. The way you posted it made it extremely cumbersome to get into Stata. In fact, it took me twice as long to do that as it took me to solve your problem once the data was in. Please read the FAQ, especially #12 which has advice on the ways to post data and code and output in ways that make it as easy as possible for others to work with. In particular, data examples should be posted using the output of the -dataex- command (as I have done above). Doing that will likely get you quicker responses to your questions in the future.

    Comment


    • #3
      Thank you Clyde. Your advice really helps! I will follow the commands you post to proceed with the work. Also, I'll read the FAQ before I ask questions in the future. Thanks!

      Comment

      Working...
      X