Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Duplicates being tagged without being duplicates

    Hi, I have encountered a problem that I do not understand. Somehow, Stata is tagging duplicates that, at first sight, are not duplicates. I have used the following command:
    Code:
    duplicates tag month year cusip, gen(flag)
    As can be seen in the following sample, it tags these observations while they have different dates (same cusip):

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long reprisk_id float(month year) str6 cusip byte flag
    101829  9 2015 "466112" 1
    101829  1 2019 "466112" 1
    101829 12 2020 "466112" 1
    101829  8 2019 "466112" 1
    101829  9 2019 "466112" 1
    101829  2 2020 "466112" 1
    101829 12 2019 "466112" 1
    101829  4 2018 "466112" 1
    101829  3 2018 "466112" 1
    101829 11 2019 "466112" 1
    end
    Why could this be the case?

    Thanks.

  • #2
    You are showing us results from some larger dataset, but the example data themselves do not show the problem.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input long reprisk_id float(month year) str6 cusip byte flag
    101829  9 2015 "466112" 1
    101829  1 2019 "466112" 1
    101829 12 2020 "466112" 1
    101829  8 2019 "466112" 1
    101829  9 2019 "466112" 1
    101829  2 2020 "466112" 1
    101829 12 2019 "466112" 1
    101829  4 2018 "466112" 1
    101829  3 2018 "466112" 1
    101829 11 2019 "466112" 1
    end
    
    duplicates tag month year cusip, gen(tag)
         +-----------------------------------------------+
         | repris~d   month   year    cusip   flag   tag |
         |-----------------------------------------------|
      1. |   101829       9   2015   466112      1     0 |
      2. |   101829       1   2019   466112      1     0 |
      3. |   101829      12   2020   466112      1     0 |
      4. |   101829       8   2019   466112      1     0 |
      5. |   101829       9   2019   466112      1     0 |
         |-----------------------------------------------|
      6. |   101829       2   2020   466112      1     0 |
      7. |   101829      12   2019   466112      1     0 |
      8. |   101829       4   2018   466112      1     0 |
      9. |   101829       3   2018   466112      1     0 |
     10. |   101829      11   2019   466112      1     0 |
         +-----------------------------------------------+
    .
    My guess is that you'e not seeing the duplicates because the data are jumbled. So check out for example


    Code:
    list if cusip == 466112 & year == 2015 & month == 9

    Comment


    • #3
      You are totally right, thank you so much!

      Comment

      Working...
      X