Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Long processing time. More efficient options?

    I am working with a dataset with 24 Variables and ~44 million observations. For unknown reasons, the raw data has a about 5000 duplicates (found by duplicate tag, gen). I am trying to drop the duplicates right now with:

    Code:
    duplicates drop permno date, force
    My computer is calculating since 6 hours now and doesn't want to come to an end. It is not a huge workstation: It has a i7-Processor and 256GB SSD and 8GB RAM. Right now it is calculating with 7,840M Ram.

    Did someone have similiar experiences and knows a more efficient way to get drop the duplicates?

    P.S.: An additional feature in the bottom right corner data section could be an estimated processing time.

    Best regards,
    Felix

  • #2
    If you already tagged the duplicates you could just drop all observations with the right tag?

    Comment


    • #3
      The duplicates, the "original" and the duplicate observations are both indicated with a 1. How could I make sure that only the duplicate observation would be deleted?

      Comment


      • #4
        Like so:

        Code:
        webuse grunfeld, clear
        replace year = 1935 if company==1 & year==1936
        duplicates tag company year, generate(dup)
        bys company year dup: drop if _n!=1
        although I have to admit I have no idea if this is any faster.
        Also note that you are dropping 'random' duplicates, in the sense that you are not regarding whether the other variable values are also duplicates. You are preserving a single observation that happens to be the first in whatever ordering you have. This is the same when you use the force option in your duplicates drop command

        Comment


        • #5
          duplicates isn't really written to be fast. It's written to bundle together various operations under one heading and to try to make it more difficult for you to maltreat your data. But the extra stuff that implies shouldn't be what slows things down.

          Comment


          • #6
            Thanks for sharing your knowledge, Jorrit! While writing this my PC came to a first result with the old command. For my other dataset I will come back to your method!

            Edit: Nick, thanks for the notice. I didn't refresh the browser before.
            Last edited by Felix Schrock; 14 Mar 2018, 10:55.

            Comment


            • #7
              While you can program to be more efficient, if you have something that only needs to run once, running things over night or over the weekend is often easier. Many of the things that might make this faster would require sorting the data which is generally a slow process.

              If you have 8GB RAM and it is using 7,840M Ram, there is a chance that you're being slowed down by a lack of Ram. Stata is much faster when it can load the entire dataset into Ram. It works fine if it can't, but is often much slower.

              Comment

              Working...
              X