Deleting observations in panel data.

Laurens Veenman

Join Date: Dec 2015

Posts: 5
#1

Deleting observations in panel data.

16 Dec 2015, 11:42

Hi there,

I'm writing my thesis and using STATA 13. I'm using panel data for the years 2004-2014. However due the fact that some companies aren't listed for the whole period, I also have companies in my data that for example only have data for 2005-2009. I would like to remove all companies which don't have observations for the whole period(2004-2014).

Thanks in advance and greetings,

Laurens
Tags: None

1 like
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17700
#2

16 Dec 2015, 11:47

Laurens:
your idea does not sound that good, as in that way you would end up with a selected sample of companies, with a subsequent bias in your inference.
Please consider that Stata can easily handle unbalanced panels.

Last edited by Carlo Lazzaro; 16 Dec 2015, 12:15.

Kind regards,
Carlo
(Stata 19.0)
2 likes
Comment
Karim Syed

Join Date: Jan 2016

Posts: 87
#3

21 Jan 2016, 12:46

Originally posted by Carlo Lazzaro View Post

Laurens:
your idea does not sound that good, as in that way you would end up with a selected sample of companies, with a subsequent bias in your inference.
Please consider that Stata can easily handle unbalanced panels.

Dear Carlo Lazzaro, I have a question of same sort, and you reasoning sounds good, however, the nature of my study requires such treatment. Let me explain a bit. I have an initial data set of 6500 firms for 15 years, with approximately 12 variables. My variable of interest is dividend per share and I want to retain only those firms which are dividend payers and which has at least 5 years of information avaliable on dividend per share variable for my final data set. Most of the firms fail on this test which I am trying to exclude from my data set. Can you kindly suggest the stata code using which I can delete all those firms which have inadequate information (less than 5 years) on my variable of interest. My year id is year and panel id is code in my data set.
I have used the following code which didn't work in my case returning message (0 observation deleted).

. xtdes
. local maxobs=r(max)
. by code: generate obs2=_N
. drop if obs2<5

I also applied by sort option as follows

. by code (year), sort: keep if _N >=5

Please help.

Regards
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35613

21 Jan 2016, 12:55

Karim: In principle, you have the right idea. See this:

Code:

. input code year 

          code       year
  1. 1  2001
  2. 1  2002
  3. 1  2003
  4. 1  2004
  5. 1  2005
  6. 2  2001
  7. 2  2002
  8. end 

. list, sepby(code)

     +-------------+
     | code   year |
     |-------------|
  1. |    1   2001 |
  2. |    1   2002 |
  3. |    1   2003 |
  4. |    1   2004 |
  5. |    1   2005 |
     |-------------|
  6. |    2   2001 |
  7. |    2   2002 |
     +-------------+

. bysort code: drop if _N < 5 
(2 observations deleted)

. list, sepby(code)

     +-------------+
     | code   year |
     |-------------|
  1. |    1   2001 |
  2. |    1   2002 |
  3. |    1   2003 |
  4. |    1   2004 |
  5. |    1   2005 |
     +-------------+

So, what is different about your variable? Show us the result of xtdes

Comment

Karim Syed

Join Date: Jan 2016

Posts: 87
#5

21 Jan 2016, 13:46

Thanks Nick for the code.

But I think, my case is bit different. In your example, firm 2 enters for only 2 years (i.e. 2001 and 2002) in the initial data set, which is removed in final data set after the code you suggest.

However, in my case the data is so entered in stata that all cases have all years. See the structure of data below. You will find that from firm 1 to 5, all have all years (2000-2015) entered in the data set. The same applies to the rest of the firms. However, you will also find that the information availability on dps (dividend per share) for each case varies. So, firm 1 has 10 year information on dps, while firm 2 has only 7 years and so on. I have 6500 such firms, many of them have either no or inadequate information on dps, although they are present in the data set for all the 16 years (2000-2015). This is due to the fact that the data set was downloaded from a database for the period 2000 to 2015, so the data base created a balanced panel on its own even though it contained many missing values as shown below.

Now, I want to remove all those firms which fail to pass the test of minimum 5 years of data availability on my variable of interest dps (divided per share).

code year dps
1 2000
1 2001
1 2002
1 2003
1 2004 0
1 2005 0
1 2006 0
1 2007 0
1 2008 0
1 2009 0
1 2010
1 2011 0
1 2012 0
1 2013 0
1 2014 0
1 2015
2 2000
2 2001
2 2002
2 2003
2 2004
2 2005
2 2006
2 2007 0
2 2008 0
2 2009 0
2 2010 0
2 2011
2 2012 0
2 2013 0
2 2014 0
2 2015
3 2000
3 2001 0
3 2002 0
3 2003 0
3 2004
3 2005 0
3 2006 0
3 2007 0
3 2008 0
3 2009 0
3 2010
3 2011 0
3 2012 0
3 2013 0
3 2014 0
3 2015
4 2000
4 2001
4 2002
4 2003
4 2004
4 2005 0
4 2006 0
4 2007 0
4 2008 0
4 2009 0
4 2010 0
4 2011 0
4 2012
4 2013 0
4 2014 0
4 2015
5 2000
5 2001
5 2002
5 2003
5 2004
5 2005
5 2006 0
5 2007 0
5 2008 0
5 2009 0
5 2010 0
5 2011
5 2012
5 2013 0
5 2014 0
5 2015

As for the result of xtdes, this is what I get after I run the code. I think it is also because the data set has all firms for all years even though with missing values on dps and other variables used in my analysis.

. xtdes

code: 1, 2, ..., 6448 n = 6448
year: 2000, 2001, ..., 2015 T = 16
Delta(year) = 1 unit
Span(year) = 16 periods
(code*year uniquely identifies each observation)

Distribution of T_i: min 5% 25% 50% 75% 95% max
16 16 16 16 16 16 16

Freq. Percent Cum. | Pattern
---------------------------+------------------
6448 100.00 100.00 | 1111111111111111
---------------------------+------------------
6448 100.00 | XXXXXXXXXXXXXXXX

.
. local maxobs=r(max)

. by code: generate obs2=_N

. drop if obs2<5
(0 observations deleted)

Thanks.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#6

21 Jan 2016, 14:13

So, I think you want this:

Code:

egen dps_availability = total(!missing(dps)), by(code) drop if dps_availability < 5
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35613
#7

21 Jan 2016, 14:18

I agree with Clyde. As your code made no mention of dps, it did not do what you want. Note that count(dps) within an egen call is another way to approach this.
1 like
Comment
Karim Syed

Join Date: Jan 2016

Posts: 87
#8

21 Jan 2016, 16:24

Many thanks to both of you. The code worked, Clyde. Thanks.
Comment
Karim Syed

Join Date: Jan 2016

Posts: 87
#9

13 Feb 2016, 05:19

Hi Dear Clyde and Nick

Please refer to the code provided above in #6 comment of this thread. The code drops firms with less than 5 observations on a variable (dps in this case). Suppose instead of dps alone I want firms to have at least 5 observations on dps eps cps and tps (where dps eps cps and tps are all variables), how can I modify the code to achieve this objective.
Please help.
Regards
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35613
#10

13 Feb 2016, 08:14

Presuming that by "observations" you mean "non-missing values" then just extend to

Code:

!missing(dps, eps, cps, tps)

which follows from the definition and syntax of missing(): see help for missing()

In Stata an observation is a row (record) in the dataset, not any particular value.
Comment
Muhammad Rashid

Join Date: Aug 2018

Posts: 38
#11

17 Aug 2018, 17:19

Dear All
I am new to this forum and will be greatful if you can help.
I have a panel dataset over 8 year. I have a categorical variable (school) in the dataset which denotes 334 schools in a state ( coded from 1 to 334). I need to drop low-frequency schools in a particular year. for example, I want to drop school 1 in the year 2007 if it had less than 50 students. similarly, drop any school which had less than 50 admissions in 2008 and so on... so essentially I need to drop any school which had less than 50 admissions in a particular year during the study period.

Thank you
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#12

17 Aug 2018, 17:28

When asking for help with code, it is wise to show example data. Please use the -dataex- command to do so. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

If we assume that each observation in your data set has a variable, school, which identifies the school, and a variable, year, which identifies the year, and that each observation is an admission, and if you want to drop a school's observations only in those years when it has fewer than 50 admissions (so if school 1 has 55 admissions in 2009 and 45 admissions in 2010, you will drop school 1's 2010 observations but retain its 2009 observations) then it would be like this:

Code:

by school year, sort: drop if _N < 50

Note: because there was no data example given, this code is not tested. It may contain errors, or it may be correct but unsuitable for your data.

If that doesn't do what you want, post back with a clearer explanation and example data using -dataex-.
Comment
Muhammad Rashid

Join Date: Aug 2018

Posts: 38
#13

18 Aug 2018, 05:02

Originally posted by Clyde Schechter View Post

When asking for help with code, it is wise to show example data. Please use the -dataex- command to do so. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

If we assume that each observation in your data set has a variable, school, which identifies the school, and a variable, year, which identifies the year, and that each observation is an admission, and if you want to drop a school's observations only in those years when it has fewer than 50 admissions (so if school 1 has 55 admissions in 2009 and 45 admissions in 2010, you will drop school 1's 2010 observations but retain its 2009 observations) then it would be like this:

Code:

by school year, sort: drop if _N < 50

Note: because there was no data example given, this code is not tested. It may contain errors, or it may be correct but unsuitable for your data.

If that doesn't do what you want, post back with a clearer explanation and example data using -dataex-.

Dear Clyde

Thank you for the reply and helping, it works perfect, I am new to this forum so was not aware of -dataex- , will use this for future, thank you once again
Comment
Carlos Conde

Join Date: Jun 2019

Posts: 1
#14

07 Jun 2019, 07:31

Hi there,
I am cleaning my data. And I would like to drop all observation of household variable if at least one of the observations in age variable is greater than 17 (more than 17 years of age). Your kind help will be greatly appreciated.

Best
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30063
#15

07 Jun 2019, 15:38

So, on the assumption that your data is in long layout, and that you have a numeric variable called age, and the household variable is called household, it will look like this:

Code:

by household: egen max_age = max(age) drop if max_age > 17

In the future, whenever you ask for help with coding, show example data. The above code may have been a complete waste of both of our times if your data do not conform to what I described above. To avoid getting answers that aren't helpful, show example data, and do that using the -dataex- command. If you are running version 15.1 or a fully updated version 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
1 like
Comment

Announcement

Deleting observations in panel data.

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment