Improving approach for keeping the first two observations for each group

Marc Pelow

Join Date: Jul 2021
Posts: 85

Improving approach for keeping the first two observations for each group

16 Nov 2023, 23:22

Hi,

consider the following data structure:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input double FirmID long ExecutiveID int Year
4295899290 18101 2010
4295899290 18101 2010
4295899290 18101 2010
4295899290 35166 2011
4295899290 35166 2012
4295899290 35166 2012
4295899290 35166 2012
4295899290 35166 2012
4295899290 35166 2013
4295899290 35166 2013
4295899290 35166 2013
4295899290 35166 2013
4295899290 35166 2014
4295899290 35166 2014
4295899290 35166 2014
4295899290 35166 2014
4295899290 35166 2015
4295899290 35166 2015
4295899290 35166 2015
4295899290 35166 2015
end

For every firm-executive combination I only want to keep those observations referring to the first two years which are different from each other.
That is, the Firm-Executive combination 4295899290-18101 should only have one entry for the year 2010, as all of its observations are linked to 2010.
4295899290-35166 should have two entries, one for 2011 and another for 2012.

My approach does work (I guess) but seems a bit complex so I was wondering whether there is a more elegant approach to obtain the same result.

Code:

sort FirmID Year ExecutiveID 
egen tag = tag(FirmID ExecutiveID Year)
gen fyind = Year if tag != 0

drop if missing(fyind)
bysort FirmID ExecutiveID: gen abs= _n
keep if abs<=2

Thank you!

Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35757

17 Nov 2023, 04:23

Code:

bysort FirmID ExecutiveID (Year) : keep if sum(Year != Year[_n-1]) <= 2 
duplicates drop

Comment

Marc Pelow

Join Date: Jul 2021

Posts: 85
#3

17 Nov 2023, 04:38

That's way cleaner. Thank you, Nick.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35757
#4

17 Nov 2023, 04:56

This is a bit simpler yet

Code:

duplicates drop bysort FirmID ExecutiveID (Year) : keep if _n <= 2

EDIT:

If we go back to your code, it is very close in spirit, but we can simplify it:

Code:

sort FirmID Year ExecutiveID egen tag = tag(FirmID ExecutiveID Year) gen fyind = Year if tag != 0 drop if missing(fyind)

That could be

Code:

egen tag = tag(FirmID ExecutiveID Year) keep if tag drop tag

and that is equivalent to using duplicates.

Code:

bysort FirmID ExecutiveID: gen abs= _n keep if abs<=2

That is equivalent to my second command in this post.

The style lesson is not to generate variables you don't really need.
Thus fyind is based on Year, but you can use other variables for your purpose.
abs is based on observation number, but use observation number directly.

There is a Dolly Parton-type saying lurking here, something like "It can take a lot of time to be this concise". Pascal was there earlier saying that he would have written a shorter letter if he had more time.

More seriously, tag() was written to allow idioms like if tag because it never produces missing values. There has been occasional flak about that, but as the original author I stand by the intent.

Last edited by Nick Cox; 17 Nov 2023, 05:16.
1 like
Comment

Announcement

Improving approach for keeping the first two observations for each group

Comment

Comment

Comment