Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Delete duplicate firm year observations

    Hello everyone,

    I am a new stata user and I have a problem with my data set. Maybe it's quite easy for you experts or it has already been discussed in this forum, but I've been searching for a solution almost the whole day without any success. I have several thousand annual firm year data and the problem that there are many year duplicates. Usually, the duplicate has only missings or at least more missings for the numeric variables. The duplicate is not always the first observation, so that e.g. sort year ; duplicates drop year, force does not work. Furthermore, I want to delete only year duplicates within a specific firm id, but the by prefix does not work for duplicates drop.

    The following table shows an example of my data set where I e.g. would like to delete observation 4.

    Observation Firm_ID Year Sales
    1 912 2006 3,320
    2 917 2005 1,123
    3 917 2006 1,341
    4 917 2006 .

    Do you know any smart solution for this problem?

    Thank you!
    TM

  • #2
    Welcome to Statalist!

    Here is some (incomplete) technique that shows how to keep only the observation with the least number of missing values for each given firm/year combination.
    Code:
    clear
    input Firm_ID Year Sales Foo
    912 2006 3320 1111
    917 2005 1123 2222
    917 2006 1341 .
    917 2006 .    .
    end
    egen nmiss = rowmiss(Sales Foo)
    bysort Firm_ID Year (nmiss): keep if _n==1
    list, clean
    Code:
           Firm_ID   Year   Sales    Foo   nmiss  
      1.       912   2006    3320   1111       0  
      2.       917   2005    1123   2222       0  
      3.       917   2006    1341      .       1
    This has problems if you have ties for the lowest number of duplicates. Without a rule for ensuring ties don't happen, you're at the mercy of the sort as to which one comes first, and that need not be the same from run to run of this code. You need to give some thought to that issue, but this at least points you in the right direction. That's about as much as I'm able to contribute at the moment; if you have further problems, post them here and that may well lead to further comment. Or maybe someone will see this suggestion and tell us a better way to do it.

    Comment


    • #3
      To William's good advice I will add a comment on duplicates, an official Stata command to which I contributed (see manual entry). The (or at least my) reason that by: is not supported is that it is quite unnecessary.

      Code:
      bysort firm : duplicates <whatever> year


      could only mean, so far as I can see, exactly the same as

      Code:
      duplicates <whatever> firm year
      duplicates supports searching for duplicates in any set of variables, including all of them, so that there is no distinction between (e.g.) identifiers and any other variables so far as it is concerned.

      All that said, the key question for you is why such duplicates occur and to document a rational way to remove them. In addition to William's suggestion, it may be collapse is what you need.

      Last edited by Nick Cox; 10 Jan 2016, 05:07.

      Comment

      Working...
      X