Dropping duplicates randomly

Ludovic Van Cau

Join Date: Jul 2019

Posts: 23
#1

Dropping duplicates randomly

21 Jul 2019, 16:33

Hi

I'd like to drop duplicates randomly instead of just the first duplicate observation.
A snapshot of my data set:

Each patent-invt_id has several co_invt_id. I want to keep only one co_invt_id but picked randomly.

I found the following code on the predecessor of statalist:

Code:

bys varnames : gen rnd = uniform() bys varnames (rnd) : keep if _n == 1

Does it make sense? (I'm not very familiar with Stata syntax) I can execute it in my dataset but because I have over 1 million observation it's quite difficult to see if it indeed duplicates were dropped randomly. Any feedback would be welcome.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

21 Jul 2019, 20:01

Yes, that is the general approach. More specifically, for your data, it would be:

Code:

set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED gen double shuffle1 = runiform() gen double shuffle2 = runiform() by patent invt_id (shuffle1 shuffle2): keep if _n == 1 drop shuffle*

The use of two double-precision random numbers to select the random observations to keep is necessary because you have a large number of observations in your data set. A smaller storage type, or a single double precision random number, would have an appreciable probability of producing duplicate random numbers, which might then make the resulting selection indeterminate and irreproducible.

The current name of the uniform random number generating function is runiform(), not uniform().

Now that you have been using Statalist for a while, I would like you to improve your posts. Please read the Forum FAQ in their entirety. Pay particular attention to #12, where you will learn about the more and less useful ways to show example data. There, among other things, you will learn that screenshots are among the less helpful ways, and that the preferred method is by using the -dataex- command. Instructions for accessing and using -dataex- will be found there as well.

Also, it is the norm in this community to use our real given and surnames as our username. You cannot change your username by editing your profile, but you can accomplish this by clicking on "CONTACT US" in the lower right corner of this page and sending a message to the system administrator requesting a change. Please help us maintain the collegiality and professionalism of the Forum by doing this. Thank you in advance.
2 likes
Comment
Ludovic Van Cau

Join Date: Jul 2019

Posts: 23
#3

22 Jul 2019, 05:54

Thank you Clyde.

I sent an email to change my username and I'll took at the Forum Faq.
Comment
Ludovic Van Cau

Join Date: Jul 2019

Posts: 23
#4

22 Jul 2019, 06:28

Originally posted by Clyde Schechter View Post

Yes, that is the general approach. More specifically, for your data, it would be:

Code:

set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED gen double shuffle1 = runiform() gen double shuffle2 = runiform() by patent invt_id (shuffle1 shuffle2): keep if _n == 1 drop shuffle*

I get the following error when executing your code:

Code:

by patent invt_id (shuffle1 shuffle2): keep if _n == 1 not sorted r(5);
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

22 Jul 2019, 07:34

With errors like this, a good place to start is by reviewing the documentation for the command. The output of help by suggests that this will take care of Clyde's oversight.

Code:

by patent invt_id (shuffle1 shuffle2), sort: keep if _n == 1
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#6

22 Jul 2019, 11:48

Sorry for the error, and thanks to William Lisowski for correcting it.
Comment

Announcement

Dropping duplicates randomly

Comment

Comment

Comment

Comment

Comment