Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dropping disappeared firms

    Hi dear Statalist,

    I am going to drop firms that do not live the entire period. There are 2 variables:
    i. year: 2010-2021
    ii. firms ID: NPC_FIC
    During period 2010-2021 some firms exist from the market and don't stay for the whole period. I need to keep only firms that stay all 12 years in the market. The data is massive and I cannot recognize which firm left. Thank you for your ideas.
    Code:
    clear
    input double NPC_FIC int year
    500979723 2010
    500982050 2010
    501615311 2010
    501615918 2010
    500990290 2010
    500992141 2010
    502481463 2010
    501616580 2010
    Listed 100 out of 4207538 observations
    Only brought a few obs as examples.


    Cheers,
    Paris

  • #2
    If there really are only two variables in your dataset
    Code:
    duplicates drop
    bysort NPC_FIC: egen years = _N
    drop if years<12
    But if there are other variables, and perhaps as in your other example datasets the data was collected at a lower-than-firm level with multiple observations per firm/year
    Code:
    egen firmyear = tag(NPC_FIC year)
    bysort NPC_FIC: egen years = total(firmyear)
    drop if years<12

    Comment


    • #3
      Code:
       duplicates drop
      
      Duplicates in terms of all variables
      
      (0 observations are duplicates)
      
      . 
      . bysort NPC_FIC: egen years = _N
      unknown egen function _N()
      r(133);
      Actually, there are other variables like workers,so I tried the second code:
      Code:
      egen firmyear = tag(NPC_FIC year)
      
      . 
      . bysort NPC_FIC: egen years = total(firmyear)
      
      . drop if years<12
      (6,822,153 observations deleted)
      Thank you so much prof William.

      Comment


      • #4
        For those who find this topic at a later date, the first code in post #2 should have read
        Code:
        duplicates drop
        bysort NPC_FIC: generate years = _N
        drop if years<12
        But that would not have improved its performance in this case, because as noted in post #3, the actual data contained more than just the two variables shown in the example data.

        Comment

        Working...
        X