Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Improving approach for keeping the first two observations for each group

    Hi,

    consider the following data structure:

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input double FirmID long ExecutiveID int Year
    4295899290 18101 2010
    4295899290 18101 2010
    4295899290 18101 2010
    4295899290 35166 2011
    4295899290 35166 2012
    4295899290 35166 2012
    4295899290 35166 2012
    4295899290 35166 2012
    4295899290 35166 2013
    4295899290 35166 2013
    4295899290 35166 2013
    4295899290 35166 2013
    4295899290 35166 2014
    4295899290 35166 2014
    4295899290 35166 2014
    4295899290 35166 2014
    4295899290 35166 2015
    4295899290 35166 2015
    4295899290 35166 2015
    4295899290 35166 2015
    end

    For every firm-executive combination I only want to keep those observations referring to the first two years which are different from each other.
    That is, the Firm-Executive combination 4295899290-18101 should only have one entry for the year 2010, as all of its observations are linked to 2010.
    4295899290-35166 should have two entries, one for 2011 and another for 2012.

    My approach does work (I guess) but seems a bit complex so I was wondering whether there is a more elegant approach to obtain the same result.

    Code:
    sort FirmID Year ExecutiveID 
    egen tag = tag(FirmID ExecutiveID Year)
    gen fyind = Year if tag != 0
    
    drop if missing(fyind)
    bysort FirmID ExecutiveID: gen abs= _n
    keep if abs<=2
    Thank you!

  • #2
    Code:
    bysort FirmID ExecutiveID (Year) : keep if sum(Year != Year[_n-1]) <= 2 
    duplicates drop

    Comment


    • #3
      That's way cleaner. Thank you, Nick.

      Comment


      • #4
        This is a bit simpler yet

        Code:
        duplicates drop
        bysort FirmID ExecutiveID (Year) : keep if _n <= 2

        EDIT:

        If we go back to your code, it is very close in spirit, but we can simplify it:

        Code:
        sort FirmID Year ExecutiveID
        egen tag = tag(FirmID ExecutiveID Year)
        gen fyind = Year if tag != 0
        drop if missing(fyind)
        That could be

        Code:
        egen tag = tag(FirmID ExecutiveID Year)
        keep if tag
        drop tag
        and that is equivalent to using duplicates.

        Code:
        bysort FirmID ExecutiveID: gen abs= _n
        keep if abs<=2
        That is equivalent to my second command in this post.

        The style lesson is not to generate variables you don't really need.
        Thus fyind is based on Year, but you can use other variables for your purpose.
        abs is based on observation number, but use observation number directly.

        There is a Dolly Parton-type saying lurking here, something like "It can take a lot of time to be this concise". Pascal was there earlier saying that he would have written a shorter letter if he had more time.

        More seriously, tag() was written to allow idioms like if tag because it never produces missing values. There has been occasional flak about that, but as the original author I stand by the intent.
        Last edited by Nick Cox; 17 Nov 2023, 05:16.

        Comment

        Working...
        X