Eliminate non-consecutive years in pannel data

Nuno Goncalves

Join Date: Nov 2020

Posts: 12
#1

Eliminate non-consecutive years in pannel data

26 Oct 2023, 03:34

Greetings,

I´m working in a dataset that goes from 2010-2022. I have a panel with firm and year. I'm cleaning the data that contains some dependent variable missing values, and I used the following code to select the firms with at least 5 consecutive years:

gen run = .
by id: replace run = cond(L.run == ., 1, L.run + 1)
by id: egen maxrun = max(run)
by id: drop if maxrun <5

Now, some firms have breaks in the years of the dependent variable. For example, 6 consecutive years (2010-2015) and 2 consecutive years (2018-2019).
I want to eliminate the break with less than 5 consecutive years per company, but I can't find any code to run ( in the example, above I just want to drop the second break from 2018-2019).

Any suggestion,
Thank you
Nuno
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35724

26 Oct 2023, 06:13

You're evidently drawing upon https://www.stata.com/support/faqs/d...-observations/ -- which also indicates a way of answering this question -- which is to use tsspell from SSC.

However, I am guessing that the real question is different.

That is because the code doesn't do what you say it does.

The code doesn't pay any attention to missing values in the dependent variable, which isn't even mentioned. run is a variable created by the code. The code looks for gaps in tsset or xtset data, namely for years that could be present but are in fact absent from (not included in) the dataset.

There is much looseness in researcher talk, but to Stata a missing value is a value of a variable in an observation in the dataset that is missing, namely an empty string for a string variable or system missing . or an extended missing value .a to .z for a numeric variable.

Observations that might be in the dataset but aren't are perhaps missing to a researcher but Stata knows nothing about them, except indirectly, notably by being able to detect gaps in a time variable. As just mentioned, they are best regarded as absent, not missing.

This example is a guess at technique for the real problem. One panel is complete; another panel is provided only for certain years. If you fire up tsspell, it is a simple calculation to get the length of each run or spell, which you can keep or drop as you wish. There are other ways to do this too.

Code:

. clear 

. set obs 21 

. gen firmid = cond(_n <= 13, 1, 2)

. gen year = cond(_n <= 13, 2009 + _n, cond(_n <= 19, 1996 + _n, cond(_n ==  20, 2018, 2019))) 

. tsset firmid year

Panel variable: firmid (unbalanced)
 Time variable: year, 2010 to 2022, but with a gap
         Delta: 1 unit

. ssc install tsspell 

. 
. tsspell, fcond(L.year == .)
warning: data contain gaps; see help on tsspell

. 
. bysort firmid _spell (year) : egen _length = max(_seq)

. 
. list, sepby(firmid _spell)

     +------------------------------------------------+
     | firmid   year   _spell   _seq   _end   _length |
     |------------------------------------------------|
  1. |      1   2010        1      1      0        13 |
  2. |      1   2011        1      2      0        13 |
  3. |      1   2012        1      3      0        13 |
  4. |      1   2013        1      4      0        13 |
  5. |      1   2014        1      5      0        13 |
  6. |      1   2015        1      6      0        13 |
  7. |      1   2016        1      7      0        13 |
  8. |      1   2017        1      8      0        13 |
  9. |      1   2018        1      9      0        13 |
 10. |      1   2019        1     10      0        13 |
 11. |      1   2020        1     11      0        13 |
 12. |      1   2021        1     12      0        13 |
 13. |      1   2022        1     13      1        13 |
     |------------------------------------------------|
 14. |      2   2010        1      1      0         6 |
 15. |      2   2011        1      2      0         6 |
 16. |      2   2012        1      3      0         6 |
 17. |      2   2013        1      4      0         6 |
 18. |      2   2014        1      5      0         6 |
 19. |      2   2015        1      6      1         6 |
     |------------------------------------------------|
 20. |      2   2018        2      1      0         2 |
 21. |      2   2019        2      2      1         2 |
     +------------------------------------------------+

If your problem is instead or as well really with missing values of the dependent variable, then you need different code, and if you need a suggestion, then the best way to get there quickly is to give us an explicit example of problematic data.

Comment

Nuno Goncalves

Join Date: Nov 2020

Posts: 12
#3

26 Oct 2023, 07:49

Thank you Nick,

Your suggestion is very helpful, it's always an valuable input.
I think I found a less time consuming path. I leave here the code I used for those who may have similar situations and who may be useful:

by id (year), sort: gen spell = sum(year != year[_n-1]+1)
by id spell (year), sort: gen spell_size = _N
drop if spell_size <5

thankfully,
Nuno
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#4

26 Oct 2023, 08:08

Yes; that would work and is good Stataish code.
Comment

Announcement