Dropping duplicates takes more that 24 hr.

Jesse Freund

Join Date: Mar 2021

Posts: 8
#1

Dropping duplicates takes more that 24 hr.

03 Mar 2021, 23:10

Dear stata forum,

I am quite new to Stata and I ran into an issue. I have panel data with quarterly observations from 1965 until 2020 for all US firms included in Compustat.
I want to drop the duplicates by using the following command:

duplicates drop gvkey Year_Quarter, force

However, Stata is working for 24 hours now and nothing happens. Is there a different way to find all the duplicates an remove them from the sample, a faster way?

best regards
Tags: None
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#2

04 Mar 2021, 01:27

I would say test firstly on a smaller subsample of your data whether what you are doing is doing what you want to do at all. And if you are satisfied, let the computer run for a day, 3 days, a week, or what is necessary, you need to do this only once.

I do not know of a faster version of -duplicates-.

You can look at this thread as well.
https://www.statalist.org/forums/for...icient-options
1 like
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10218
#3

04 Mar 2021, 05:51

duplicates drop gvkey Year_Quarter, force

If you are sure that you are not losing any information by keeping only an instance of gvkey and Year_Quarter, then the duplicates drop command is equivalent to either

Code:

bys gvkey Year_Quarter: keep if _n==1

or

Code:

bys gvkey Year_Quarter: drop if _n>1

Generally, sorting can be time consuming in large datasets.
1 like
Comment
Jesse Freund

Join Date: Mar 2021

Posts: 8
#4

10 Mar 2021, 00:09

Thanks for your answers guys!
Comment
Daniel Feenberg

Join Date: Oct 2014

Posts: 323
#5

10 Mar 2021, 06:35

Something has gone wrong. I don't believe your computer is still working on the problem, An ordinary desktop should be able to drop duplicate observations from a 10 million record file in less than a minute. Using Musau's -bys- method would be twice as fast. Is it possible you are out of memory and the system is paging to disk? That would slow things down such that the job might never complete.
1 like
Comment
Jesse Freund

Join Date: Mar 2021

Posts: 8
#6

12 Mar 2021, 02:37

Dear [email protected],

I solved the issue by dropping the duplicates on the subsamples before merging, thanks for your answer.
Comment

Announcement

Dropping duplicates takes more that 24 hr.

Comment

Comment

Comment

Comment

Comment