Delete duplicate firm year observations

tim malz

Join Date: Jan 2016

Posts: 19
#1

Delete duplicate firm year observations

09 Jan 2016, 16:05

Hello everyone,

I am a new stata user and I have a problem with my data set. Maybe it's quite easy for you experts or it has already been discussed in this forum, but I've been searching for a solution almost the whole day without any success. I have several thousand annual firm year data and the problem that there are many year duplicates. Usually, the duplicate has only missings or at least more missings for the numeric variables. The duplicate is not always the first observation, so that e.g. sort year ; duplicates drop year, force does not work. Furthermore, I want to delete only year duplicates within a specific firm id, but the by prefix does not work for duplicates drop.

The following table shows an example of my data set where I e.g. would like to delete observation 4.

Observation Firm_ID Year Sales
1 912 2006 3,320
2 917 2005 1,123
3 917 2006 1,341
4 917 2006 .

Do you know any smart solution for this problem?

Thank you!
TM
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

09 Jan 2016, 20:50

Welcome to Statalist!

Here is some (incomplete) technique that shows how to keep only the observation with the least number of missing values for each given firm/year combination.

Code:

clear input Firm_ID Year Sales Foo 912 2006 3320 1111 917 2005 1123 2222 917 2006 1341 . 917 2006 . . end egen nmiss = rowmiss(Sales Foo) bysort Firm_ID Year (nmiss): keep if _n==1 list, clean

Code:

Firm_ID Year Sales Foo nmiss 1. 912 2006 3320 1111 0 2. 917 2005 1123 2222 0 3. 917 2006 1341 . 1

This has problems if you have ties for the lowest number of duplicates. Without a rule for ensuring ties don't happen, you're at the mercy of the sort as to which one comes first, and that need not be the same from run to run of this code. You need to give some thought to that issue, but this at least points you in the right direction. That's about as much as I'm able to contribute at the moment; if you have further problems, post them here and that may well lead to further comment. Or maybe someone will see this suggestion and tell us a better way to do it.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 36053
#3

10 Jan 2016, 04:44

To William's good advice I will add a comment on duplicates, an official Stata command to which I contributed (see manual entry). The (or at least my) reason that by: is not supported is that it is quite unnecessary.

Code:

bysort firm : duplicates <whatever> year

could only mean, so far as I can see, exactly the same as

Code:

duplicates <whatever> firm year

duplicates supports searching for duplicates in any set of variables, including all of them, so that there is no distinction between (e.g.) identifiers and any other variables so far as it is concerned.

All that said, the key question for you is why such duplicates occur and to document a rational way to remove them. In addition to William's suggestion, it may be collapse is what you need.

Last edited by Nick Cox; 10 Jan 2016, 05:07.
Comment

Announcement

Delete duplicate firm year observations

Comment

Comment