Long processing time. More efficient options?

Felix Schrock

Join Date: Mar 2018

Posts: 25
#1

Long processing time. More efficient options?

14 Mar 2018, 09:26

I am working with a dataset with 24 Variables and ~44 million observations. For unknown reasons, the raw data has a about 5000 duplicates (found by duplicate tag, gen). I am trying to drop the duplicates right now with:

Code:

duplicates drop permno date, force

My computer is calculating since 6 hours now and doesn't want to come to an end. It is not a huge workstation: It has a i7-Processor and 256GB SSD and 8GB RAM. Right now it is calculating with 7,840M Ram.

Did someone have similiar experiences and knows a more efficient way to get drop the duplicates?

P.S.: An additional feature in the bottom right corner data section could be an estimated processing time.

Best regards,
Felix
Tags: None
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#2

14 Mar 2018, 09:56

If you already tagged the duplicates you could just drop all observations with the right tag?
Comment
Felix Schrock

Join Date: Mar 2018

Posts: 25
#3

14 Mar 2018, 10:14

The duplicates, the "original" and the duplicate observations are both indicated with a 1. How could I make sure that only the duplicate observation would be deleted?
Comment
Jorrit Gosens

Join Date: Jan 2015

Posts: 1019
#4

14 Mar 2018, 10:29

Like so:

Code:

webuse grunfeld, clear replace year = 1935 if company==1 & year==1936 duplicates tag company year, generate(dup) bys company year dup: drop if _n!=1

although I have to admit I have no idea if this is any faster.
Also note that you are dropping 'random' duplicates, in the sense that you are not regarding whether the other variable values are also duplicates. You are preserving a single observation that happens to be the first in whatever ordering you have. This is the same when you use the force option in your duplicates drop command
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35807
#5

14 Mar 2018, 10:37

duplicates isn't really written to be fast. It's written to bundle together various operations under one heading and to try to make it more difficult for you to maltreat your data. But the extra stuff that implies shouldn't be what slows things down.
Comment
Felix Schrock

Join Date: Mar 2018

Posts: 25
#6

14 Mar 2018, 10:48

Thanks for sharing your knowledge, Jorrit! While writing this my PC came to a first result with the old command. For my other dataset I will come back to your method!

Edit: Nick, thanks for the notice. I didn't refresh the browser before.

Last edited by Felix Schrock; 14 Mar 2018, 10:55.
Comment
Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#7

15 Mar 2018, 11:43

While you can program to be more efficient, if you have something that only needs to run once, running things over night or over the weekend is often easier. Many of the things that might make this faster would require sorting the data which is generally a slow process.

If you have 8GB RAM and it is using 7,840M Ram, there is a chance that you're being slowed down by a lack of Ram. Stata is much faster when it can load the entire dataset into Ram. It works fine if it can't, but is often much slower.
Comment

Announcement

Long processing time. More efficient options?

Comment

Comment

Comment

Comment

Comment

Comment