Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Duplicates Drop Force Yields Different Results for Same Dataset

    Dear Statalisters,

    I have a dataset with three (quite long) string variables and other (numeric and string) variables. In total I have roughly 100,000 observations. I want to drop duplicates in terms of the three string variables and disregard the other variables in this selection. Thus, I write drop duplicates var1 var2 var3, force

    However, the resulting dataset varies in the number of observations it contains by quite a lot. For example, I run the code and get 39,367 observations. Then I immediately rerun the code and get 39,394 observations.

    I know it is hard to tell from far away what´s going on. But does anyone might have an insight into the problem or encountered that before? Does duplicates drop get confused when there are a lot of variables? Or is there another problem?
    Thank you very much!

    All the best
    Leon
    Last edited by Leon Schmidt; 07 Sep 2019, 08:06. Reason: Duplicates, Drop, Force

  • #2
    Can you find a small subset of your data that reproduces this behavior and post it here (using -dataex-)? What you describe should not happen. After running -duplicates drop var1 var2 var3, force- the number of observations should be exactly equal to the number of distinct combinations of values of var1, var2, and var3.

    That said, if you are looking for a workaround, instead of -duplicates drop- you can run:

    Code:
    by var1 var2 var3, sort: keep if _n == 1

    Comment


    • #3
      Good advice from Clyde, of course. If this doesn't get explained here, I'd encourage you to take this up with Stata Tech Support. If this is a bug, I'd call it a significant one.
      Proceeding on the bug hypothesis: How long are these strings? Given that -duplicates- is an old command, I wonder if there is some problem with comparisons using modern longer strings. Can you get this same anomalous results by using the -report- or -tag- option on -duplicates-?

      Comment


      • #4
        Dear Clyde and Mike,

        Thank you very much for your answers and support! In the meantime I figured out what caused this behavior and it is indeed not a bug. Essentially, what happened was that somewhere before the duplicates drop I merged and collapsed the data with other datasets. However, this procedure did not result in the same sorting each time I ran the code. This is why the duplicates drop then delivered different results. I am sorry for the confusion! Maybe this can serve as a reminder to always have the correct sorting!

        All the best
        Leon

        Comment


        • #5
          Thanks for the closure.

          Comment


          • #6
            Dear Leon,

            I am facing a similar problem. I merge 3 different datasets and end up with 10,300 +/- 10 observations each time I run the code. How were you able to deal with the sorting problem?

            Thank you in advance!

            Regards,
            Michael

            Comment


            • #7
              *Edit: The number of observations differ after dropping duplicate entries in terms of 4 variables.

              Comment


              • #8
                Dear Michael,

                I don´t remember that issue completely. But it was something like that I had three string-variables and wanted to drop duplicates of these. But in some observations some strings were missing and I generated the third variable before the duplicates drop. If they were not properly sorted after the merges the creation of the third variable was not consistent and thus the duplicates drop had different results (like in your case the datasets differed by like +-10 observations). I just used the - sort - command on the three variables to give it a consistent sorting before the duplicates drop and creation of the third variable. Then it worked. I checked that the resulting datasets were then indeed the same with the - cf - command.

                Hope that helps.

                All the best
                Leon

                Comment


                • #9
                  Hey Leon,

                  thanks a lot for the answer! I will try to do that as well, hope it works.

                  Comment

                  Working...
                  X