Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating data to balance a panel dataset

    Hi,

    I have a panel from Compustat like the below in Table 1 (table 1 is a subset of the data to show 3 different example issues), where gvkey is the firm-specific identifier, fyear is the reporting year, emp is employment, and dlrsn is the reason the firm dropped out of the dataset.

    Table 1
    gvkey fyear emp dlrsn
    001 1996 2 02
    001 1997 3 02
    001 1998 2 02
    001 1999 1 02
    002 1996 4 06
    002 1997 5 06
    002 1998 4 06
    002 1999 3 06
    002 2000 3 06
    002 2001 3 06
    002 2002 3 06
    002 2003 3 06
    003 1996 7 .
    003 1997 8 .
    . . . .
    . . . .
    . . . .
    003 2016 14 .

    I need employment data for each firm all the way up to and including 2016 (as shown in Table 2). However, many firms drop out of the dataset (e.g., because of bankruptcy). For such firms, I want to generate employment numbers for all years from the last date they reported, going up to and including 2016 using the following methodoloy:
    1. If dlrsn is 02 or 03, set employment number to zero from the first year after the last reporting year going up to and including 2016. For example, in Table 2, firm 001 reports up to 1999, I would like to generate data that has fyears 2000-2016 and employment set at 0 because dlrsn is 02.
    2. If dlrsn is 01,04,05,06,07,09,10,20, use the last reported employment number for all years after the last reported year. For example, in Table 2, firm 002 reports up 2003, I would like to generate data that has fyear 2004-2016 and employment is set equal to the last available employment number (i.e.,4) because dlrsn is 06
    3. If the firm does not drop out of the dataset, nothing should change.
    Table 2
    gvkey fyear emp dlrsn
    001 1996 2 02
    001 1997 3 02
    001 1998 2 02
    001 1999 1 02
    001* 2000* 0* 02*
    001* 2001* 0* 02*
    001* 2002* 0* 02*
    . . . .
    . . . .
    . . . .
    001 2016 0* 02
    002 1996 4 06
    002 1997 5 06
    002 1998 4 06
    002 1999 3 06
    002 2000 3 06
    002 2001 2 06
    002 2002 3 06
    002 2003 4 06
    002* 2004* 4* 06*
    002* 2005* 4* 06*
    . . . .
    . . . .
    . . . .
    002* 2016* 4* 06*
    003 1996 7 .
    003 1997 8 .
    . . . .
    . . . .
    . . . .
    003 2016 14 .
    Essentially, I want to get from Table 1 to Table 2 and would very much appreciate any advice (note there are thousands of firms where different dates and the above is just a example). I tried the following code but got a bit stuck:

    gen dldteyear=year(dldte)
    bysort gvkey: egen lastdate=max(fyear)
    expand yeardiff if fyear==lastdate


    The code was able to duplicate the last reported date in Table 1 the correct number of times but then I was a bit stuck with how to do the next step, I was thinking replace years because they would have to be made consecutive and then replace employment but this started to get a bit messy. I am sure there is probably a better approach then the one I am taking which seems rather mechanical.

    Thanks in advance for all support.


    Best,
    Ali


  • #2
    Welcome to the Stata Forum / Statalist.

    Please read the FAQ, particularly the topic about sharing command/data/output.


    I didn't read the post in detail and I didn't get the reason to produce a balanced data. But I gather you may use - fillin - for the matter:

    Code:
    help fillin
    Best regards,

    Marcos

    Comment


    • #3
      Let me note that filling in observations for firms that no longer exist is not likely to be a good idea. It is one thing to have a missing observation, but when the true value truly should not exist (e.g., firm has disappeared), then filling in by any arbitrary rule is questionable. Most Stata panel estimators do not require balanced panels.

      Comment

      Working...
      X