Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to keep duplicates and drop non-duplicates?

    Hi all,

    I am currently working on data quality and I would like to create a dataset that will keep the duplicates and drop the non-duplicates observations.
    Is there any code that can help me to achieve this goal?

    I have shared the dataset with the link below. I'm trying to keep the duplicates based on these variables: facilityid pickupdate uniqueno description.
    Based on the duplicates report, I should be able to keep a dataset of 500 variables:


    duplicates report /*Duplicate is not case-sensitive
    --------------------------------------
    Copies | Observations Surplus
    ----------+---------------------------
    1 | 69763 0
    2 | 916 458
    3 | 63 42
    --------------------------------------

    Is it possible to keep the duplicates only?

    Thank you very much!
    https://drive.google.com/file/d/1x8I...ew?usp=sharing
    Last edited by Marianne Sie; 03 Nov 2021, 11:05.

  • #2
    Code:
    duplicates tag, gen(flag)
    keep if flag

    Comment


    • #3
      Hi Clyde Schechter , many thanks for your prompt reply.
      I tried this code before and that's not the right final dataset that I would like to keep. Let me explain.

      I would like to keep these 500 (458+42) observations in the column Surplus (see below) because these are the observations that will be dropped if I drop the duplicates.
      I don't want to drop these 500 observations but rather keep them. Is it possible to do so?
      duplicates report --------------------------------------
      Copies | Observations Surplus
      ----------+---------------------------
      1 | 69763 0
      2 | 916 458
      3 | 63 42
      --------------------------------------

      When I run the command "keep if flag", I'm left with the observations (916 + 63) but that's not what I need. I would like to keep the 500 surplus only.
      I hope that my explanation is clear.

      keep if flag
      duplicates report

      --------------------------------------
      Copies | Observations Surplus
      ----------+---------------------------
      2 | 916 458
      3 | 63 42
      --------------------------------------



      Thanks for your help
      Last edited by Marianne Sie; 03 Nov 2021, 19:30.

      Comment


      • #4
        Oh, I see what you want.

        Code:
        by _all, sort: drop if _n == 1
        will leave only the surplus observations.

        Comment


        • #5
          Hi Clyde Schechter , this code worked perfectly! Thanks a ton!

          I have another question, I was wondering if it is possible to keep the duplicates (500 observations) + the missing values from the variable pickupdate.
          Is there a code that can help me to keep or drop observations while looking at two variables.

          Thank you!!

          Comment


          • #6
            Code:
            by _all, sort: keep if _n > 1 | missing(pickupdate)

            Comment


            • #7
              Thank you very much, Clyde.
              These codes worked perfectly.

              Comment

              Working...
              X