How to delete duplicate records

Rochelle Zhang

Join Date: Jul 2025

Posts: 0
#1

How to delete duplicate records

30 Jan 2016, 17:08

Dear statalisters,
I have a dataset that has records that may be duplicates by two variables , below I show you a small set of observations to make my question
clear.

Code:

clear input int(firmid fyear) byte csale 1000 1980 19 1000 1980 . 1001 1980 2 end

when observations have the same firmid and fyear, I want to keep the records that have non-missing csale values. I all records that share the same firmid and fyear have missing csale, I want to keep at least 1 record.

I thought of using tag, but not sure if tag can be organized by csale value.

Thanks,
Rochelle
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30121
#2

30 Jan 2016, 17:32

Well, the basic task is fairly simple:

Code:

by firmid fyear (csale), sort: keep if _n == 1

This works because any non-missing value sorts before missing.

But this is an unsafe thing to do without further investigation. You need to be sure that there aren't two or more records with a non-missing value of csale. If there are, the above code would always choose the one with the smallest value, which might not be correct. So you need to first verify that this condition is met:

Code:

// DO THIS BEFORE THE CODE ABOVE gen byte csale_missing = missing(csale) by firmid fyear csale_missing (csale), sort: assert csale[1] == csale[_N] if !csale_missing

This will assure you that if there is more than one observation for a combination of firmid fyear with non-missing csale, the csale values are all the same. If the above assertion fails, then you need to look at the cases where it fails to figure out why you have inconsistent non-missing values of csale and how to resolve those inconsistencies.

There is one additional risk in using even those conditions: what about other variables in your data set? In your example, these are the only ones, so no problem. But if your real data has others, you need to verify that those variables all agree on the same value before you remove them (and figure out what's going on and how to fix it if they don't). Without knowing more about the possible other variables, or even if there are any, I won't write code here.
Comment
Rochelle Zhang

Join Date: Jul 2025

Posts: 0
#3

31 Jan 2016, 11:30

Thank you Clyde very much !!!!

Your comments are very useful. My real data is rather complex and the assertion failed. I will make further investigation to my data before I proceed.

Regards,
Rochelle
Comment
Attaullah Shah

Join Date: Aug 2014

Posts: 1669
#4

31 Jan 2016, 12:41

what about

Code:

duplicates drop firmid fyear, force

Regards
--------------------------------------------------
Attaullah Shah, PhD.
Professor of Finance, Institute of Management Sciences Peshawar, Pakistan
FinTechProfessor.com
https://asdocx.com
Check out my asdoc program, which sends outputs to MS Word.
For more flexibility, consider using asdocx which can send Stata outputs to MS Word, Excel, LaTeX, or HTML.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30121
#5

31 Jan 2016, 13:47

One should always think twice, three time, and four or more times before using the -force- option in any command. Particularly with -duplicates drop-. This will, indeed, remove duplicates and leave you with just one observation for each combination of firmid and fyear. But the whole point raised in #2 and #3 is that the duplicate observations contain conflicting information about the other variables. Using -force- simply obscures the problem without solving it. So before just arbitrarily selecting one and using it, good statistical practice requires first studying those duplicate observations to determine, a) why they even exist (if they shouldn't), and b) which one is the right one to keep, or, perhaps whether they need to be combined (e.g. by averaging, medians, etc.) into a single observation. It is more important to have correct data than to use a "quick and dirty" workaround so that you can rapidly arrive at meaningless and useless results.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10223
#6

31 Jan 2016, 18:28

In the replies above, I think that there is a presumption that non-missing duplicates of csale are unwanted. I read the original problem differently.

when observations have the same firmid and fyear, I want to keep the records that have non-missing csale values. If all records that share the same firmid and fyear have missing csale, I want to keep at least 1 record.

This suggests to me that observations may have the same firmid and fyear, but different csale values. Indeed, if I suppose "csale" refers to "customer sale", then there is nothing inherently wrong with the same firm in the same year making two or more sales (hence multiple csales). The only remark that I would make is that "at least 1" is ambiguous because it may include anything between one to the maximum number of missing observations. That said, I will provide code that does the following:

1. If there are multiple non-missing csale values for the same firmid and fyear, keep all of them.
2. If there is a non-missing csale value for a given firmid and fyear, delete the observations with missing csale values.
3. For a group of observations with the same firmid and fyear, if all values for csale are missing, keep only one observation (you may change to clarify the "atleast one" ambiguity).

The following expanded example illustrates:

Code:

clear input int(firmid fyear) byte csale 1000 1980 19 1000 1980 . 1001 1980 2 1001 1980 3 1002 1980 . 1002 1980 . 1002 1980 . 1003 1980 . 1003 1980 7 1003 1980 . 1004 1980 . end

In the following, firmid 1002 has all missing values, so we keep one, whereas firmid 1001 has 2 non-missing csale values, so we keep both.

Code:

. list, sepby(firmid) +------------------------+ | firmid fyear csale | |------------------------| 1. | 1000 1980 19 | 2. | 1000 1980 . | |------------------------| 3. | 1001 1980 2 | 4. | 1001 1980 3 | |------------------------| 5. | 1002 1980 . | 6. | 1002 1980 . | 7. | 1002 1980 . | |------------------------| 8. | 1003 1980 . | 9. | 1003 1980 7 | 10. | 1003 1980 . | |------------------------| 11. | 1004 1980 . | +------------------------+

Here is one way to proceed:

Code:

gen byte tag = missing(csale) bys firmid fyear: egen tag2= mean(tag) drop if tag==1 & tag2!=1 bys firmid fyear: replace tag2= sum(tag2) drop if tag==1 & tag2!=1 drop tag tag2 list, sepby(firmid)

Code:

. list, sepby(firmid) +------------------------+ | firmid fyear csale | |------------------------| 1. | 1000 1980 19 | |------------------------| 2. | 1001 1980 2 | 3. | 1001 1980 3 | |------------------------| 4. | 1002 1980 . | |------------------------| 5. | 1003 1980 7 | |------------------------| 6. | 1004 1980 . | +------------------------+
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30121
#7

31 Jan 2016, 18:35

Andrew may well be right. I understood the original post to imply that only one value of csale was to be retained, but on re-reading the post it does not actually say that. Here is another way to do what he has proposed that is a bit shorter, and to me, at least, a bit clearer:

Code:

gen byte missing_csale = missing(csale) by firmid fyear (missing_csale csale), sort: keep if _n == 1 | !missing_csale

The logic works because if there are non-missing observations of csale will all be retained, and the missing observation(s) all sort at the end, so they are not retained by either part of the -if- clause. When, all observations of csale are missing, only the first is kept.

Last edited by Clyde Schechter; 31 Jan 2016, 18:43.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10223
#8

31 Jan 2016, 18:57

Thanks Clyde for a more efficient solution!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30121
#9

01 Feb 2016, 10:11

Actually, overnight I realized that it can be reduced to a single line:

Code:

by firmid fyear (csale), sort: keep if _n == 1 | !missing_csale

(The variable missing_csale that I used in #7 is not actually needed because the missing values automatically sort to the end anyhow.)
Comment
bishal bharadwaj

Join Date: Jul 2014

Posts: 3
#10

31 Mar 2019, 23:15

Thank you Clyde. Your suggestions are really helpful. Your post provides great help to me like student who are looking for ways to resolve issues. Thanks once again.
Comment

Announcement

How to delete duplicate records

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment