Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dropping duplicates takes more that 24 hr.

    Dear stata forum,

    I am quite new to Stata and I ran into an issue. I have panel data with quarterly observations from 1965 until 2020 for all US firms included in Compustat.
    I want to drop the duplicates by using the following command:

    duplicates drop gvkey Year_Quarter, force

    However, Stata is working for 24 hours now and nothing happens. Is there a different way to find all the duplicates an remove them from the sample, a faster way?

    best regards

  • #2
    I would say test firstly on a smaller subsample of your data whether what you are doing is doing what you want to do at all. And if you are satisfied, let the computer run for a day, 3 days, a week, or what is necessary, you need to do this only once.

    I do not know of a faster version of -duplicates-.

    You can look at this thread as well.
    https://www.statalist.org/forums/for...icient-options

    Comment


    • #3
      duplicates drop gvkey Year_Quarter, force
      If you are sure that you are not losing any information by keeping only an instance of gvkey and Year_Quarter, then the duplicates drop command is equivalent to either

      Code:
      bys gvkey Year_Quarter: keep if _n==1
      or

      Code:
      bys gvkey Year_Quarter: drop if _n>1

      Generally, sorting can be time consuming in large datasets.

      Comment


      • #4
        Thanks for your answers guys!

        Comment


        • #5
          Something has gone wrong. I don't believe your computer is still working on the problem, An ordinary desktop should be able to drop duplicate observations from a 10 million record file in less than a minute. Using Musau's -bys- method would be twice as fast. Is it possible you are out of memory and the system is paging to disk? That would slow things down such that the job might never complete.

          Comment


          • #6
            Dear [email protected],

            I solved the issue by dropping the duplicates on the subsamples before merging, thanks for your answer.

            Comment

            Working...
            X