Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dropping duplicates randomly

    Hi


    I'd like to drop duplicates randomly instead of just the first duplicate observation.
    A snapshot of my data set:
    Click image for larger version

Name:	Schermafbeelding 2019-07-22 om 00.30.55.png
Views:	1
Size:	20.1 KB
ID:	1508757



    Each patent-invt_id has several co_invt_id. I want to keep only one co_invt_id but picked randomly.

    I found the following code on the predecessor of statalist:
    Code:
     bys varnames  : gen rnd = uniform()
    bys varnames (rnd) : keep if _n == 1
    Does it make sense? (I'm not very familiar with Stata syntax) I can execute it in my dataset but because I have over 1 million observation it's quite difficult to see if it indeed duplicates were dropped randomly. Any feedback would be welcome.

  • #2
    Yes, that is the general approach. More specifically, for your data, it would be:
    Code:
    set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED
    gen double shuffle1 = runiform()
    gen double shuffle2 = runiform()
    by patent invt_id (shuffle1 shuffle2): keep if _n == 1
    drop shuffle*
    The use of two double-precision random numbers to select the random observations to keep is necessary because you have a large number of observations in your data set. A smaller storage type, or a single double precision random number, would have an appreciable probability of producing duplicate random numbers, which might then make the resulting selection indeterminate and irreproducible.

    The current name of the uniform random number generating function is runiform(), not uniform().

    Now that you have been using Statalist for a while, I would like you to improve your posts. Please read the Forum FAQ in their entirety. Pay particular attention to #12, where you will learn about the more and less useful ways to show example data. There, among other things, you will learn that screenshots are among the less helpful ways, and that the preferred method is by using the -dataex- command. Instructions for accessing and using -dataex- will be found there as well.

    Also, it is the norm in this community to use our real given and surnames as our username. You cannot change your username by editing your profile, but you can accomplish this by clicking on "CONTACT US" in the lower right corner of this page and sending a message to the system administrator requesting a change. Please help us maintain the collegiality and professionalism of the Forum by doing this. Thank you in advance.


    Comment


    • #3
      Thank you Clyde.

      I sent an email to change my username and I'll took at the Forum Faq.

      Comment


      • #4
        Originally posted by Clyde Schechter View Post
        Yes, that is the general approach. More specifically, for your data, it would be:
        Code:
        set seed 1234 // OR YOUR FAVORITE RANDOM NUMBER SEED
        gen double shuffle1 = runiform()
        gen double shuffle2 = runiform()
        by patent invt_id (shuffle1 shuffle2): keep if _n == 1
        drop shuffle*
        I get the following error when executing your code:
        Code:
         by patent invt_id (shuffle1 shuffle2): keep if _n == 1
        not sorted
        r(5);

        Comment


        • #5
          With errors like this, a good place to start is by reviewing the documentation for the command. The output of help by suggests that this will take care of Clyde's oversight.
          Code:
          by patent invt_id (shuffle1 shuffle2), sort: keep if _n == 1

          Comment


          • #6
            Sorry for the error, and thanks to William Lisowski for correcting it.

            Comment

            Working...
            X