Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Remove Duplicate Observations

    Greetings,

    I have a big dataset in which some of the observations have up to 17 variations; meaning their ID is the same but the operational codes are different. For a project, I classified more than 50 operational codes into three categories (MH = 0, 1, and 3). I wrote the following script to remove any duplicates (same ID) greater than 1 and MH codes 0 and 3. But when I tabulate the dup variable after dropping duplicates, I still have lots of duplicate observations that I don't need. I appreciate your kind advice, please.

    drop if (dup > 1 & (MH == 0 | MH == 3))

    Thanks,
    Eliot

  • #2
    Well, first of all, you should still be left with duplicates where MH == 1. So before you conclude that something is wrong check and see if the surviving duplicates have MH == 1.

    If not, then probably you did something wrong in calculating the variable dup, so that dup > 1 doesn't correctly capture all and only surplus observations. You don't show how you did that, so it's anybody's guess what (if anything) went wrong and how it might be fixed.

    Comment


    • #3
      Thanks Clyde. I have checked and I have enough correct number of duplicates with MH==1. Here is the code:

      sort ID
      quietly by ID: gen dup = cond(_N==1,0,_n)
      order dup, after(ID)
      tabulate dup

      drop if (dup > 1 & (MH == 0 | MH == 3))

      Still I have ID with MH == 3 that I don't need.

      For instance, I have the followings after drop command above:

      ID dup MH
      1234 1 1
      1234 2 3

      5678 2 3
      5678 3 3
      Last edited by Eliot Assoudeh; 06 Aug 2024, 09:22.

      Comment


      • #4
        The core problem here, I believe is that you are checking for dup > 1, instead of dup > 0. For observations that are unique, dup is zero. dup = 1 means there is one duplicate (i.e. two are identical), and so on. This is the standard way in which the variable is produced by duplicates tag.

        Code:
        clear
        set obs 3
        gen x = _n
        gen y = (_n < 3) * 5
        gen z = 10
        
        duplicates tag x, gen(dupx)
        duplicates tag y, gen(dupy)
        duplicates tag z, gen(dupz)
        which yields:

        Code:
        . list, noobs
        
          +---------------------------------+
          | x   y    z   dupx   dupy   dupz |
          |---------------------------------|
          | 1   5   10      0      1      2 |
          | 2   5   10      0      1      2 |
          | 3   0   10      0      0      2 |
          +---------------------------------+
        Last edited by Hemanshu Kumar; 07 Aug 2024, 01:27.

        Comment


        • #5
          Sorry, I missed your post in #3 where you show how you generate your dup variable, which looks a bit different from the one produced by duplicates tag. So ignore my response in #4.

          Before we go any further, it should be impossible to have any observations with dup greater than 1 and also MH equal to 3 if you were literally doing what you said:

          Code:
          drop if (dup > 1 & (MH == 0 | MH == 3))
          So can you confirm this is actually what you use in your code, or did you try and show us some toy code to reproduce the problem and so this is not exactly what you do?
          Last edited by Hemanshu Kumar; 07 Aug 2024, 03:52.

          Comment


          • #6
            Thanks Hemanshu for your note. That is exactly the code that I used and there are multiple instances where dup is greater than 1 and MH == 3. Someone suggested that I use -sortby- rather than -sort- and it improved the removal of duplication but not completely.

            Comment

            Working...
            X