Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Duplicates list problem

    Dear guys,

    This is my first post on this forum, so sorry if the location of my post is wrong.

    So I am working on an assignment in a merged CRSP-T1 dataset.

    I am puzzled by the following:

    When I run the duplicates list command with two variables Stata doesn't find duplicates (duplicates list var_name var_name).
    However if I browse through the data there are many duplicates between the two variables. Both the variables are numeric.

    Any insights would be highly appreciated.


    Best regards,
    Bram van Heiningen



  • #2
    This may ease your puzzlement:
    This may ease your puzzlement:

    Code:
    . clear
    
    . input str4 beast score
    
             beast      score
      1. "frog"  42
      2. "toad"  666
      3. "toad"  42
      4. "frog"  666
      5. end
    
    .
    . duplicates list beast score
    
    Duplicates in terms of beast score
    
    (0 observations are duplicates)
    
    .
    . duplicates list beast
    
    Duplicates in terms of beast
    
      +-----------------------+
      | group:   obs:   beast |
      |-----------------------|
      |      1      1    frog |
      |      1      4    frog |
      |      2      2    toad |
      |      2      3    toad |
      +-----------------------+
    
    .
    . duplicates list score
    
    Duplicates in terms of score
    
      +-----------------------+
      | group:   obs:   score |
      |-----------------------|
      |      1      1      42 |
      |      1      3      42 |
      |      2      2     666 |
      |      2      4     666 |
      +-----------------------+
    With two or more variables, duplicates works on duplicates of { all the variables named }, not all of { the duplicates on the variables named }.

    Does that answer the question?

    Comment


    • #3
      Thank you for your reply Nick. I understand your points.

      However, in your example there are no duplicate observervations in terms of both variables.

      My data:
      1. Frog 42
      2. Frog 42

      Duplicates list beast score will result in 0 duplicate observations.

      Comment


      • #4
        Still don't get why duplicates list var1 var2 doesn't seem to work.

        We managed to work around our issue by generating a dummy variable with value of 1 if var1==var2.

        Comment


        • #5
          Here "doesn't seem to work" seems to mean "doesn't do what we expect".

          It sounds as if your definition of duplicates is that two (or more) variables are the same in the same observation.

          That's not the idea of duplicates at all. It is, as the title of the help explains, designed to report, tag, or drop duplicate observations.

          EDIT: I only just noticed #3. But that works as intended.

          Code:
           
          . clear 
          
          . input str4 beast score 
          
                   beast      score
            1. "frog"  42 
            2. "frog"  42 
            3. "toad"  666 
            4. "toad"  42 
            5. "frog"  666
            6. end 
          
          . 
          . duplicates list beast score 
          
          Duplicates in terms of beast score
          
            +----------------------+
            | obs:   beast   score |
            |----------------------|
            |    1    frog      42 |
            |    2    frog      42 |
            +----------------------+
          Last edited by Nick Cox; 28 Jan 2016, 11:32.

          Comment


          • #6
            Thank you again for your reply. Sorry for the misunderstanding. I do mean between different observations.

            Your last example matches almost exactly my issue, but instead in my case the variables are both numeric. Unfortunately the output says 0 duplicate observations, whilst there obviously are (obs1 and obs2 in your example). My question is how can it be possible that duplicates list doesn't get the same kind of output as in your example?

            That is not what I expected since there are duplicaties between observations in terms of two numeric variables.

            Comment


            • #7
              Please use dataex (SSC) to show data and CODE to show code so that we can see your example exactly that doesn't conform so far as you can see.

              Comment


              • #8
                I agree with Nick 100%! Here are two examples of how this could happen. In both cases, dataex would immediately show the problem. The first is related to precision and display format:

                Code:
                . clear 
                
                . input str4 beast score 
                
                         beast      score
                  1. "frog"  42 
                  2. "frog"  42 
                  3. "toad"  666 
                  4. "toad"  42 
                  5. "frog"  666
                  6. end 
                
                . duplicates list beast score
                
                Duplicates in terms of beast score
                
                  +----------------------+
                  | obs:   beast   score |
                  |----------------------|
                  |    1    frog      42 |
                  |    2    frog      42 |
                  +----------------------+
                
                . 
                . 
                . replace score = score + .00001 in 1
                (1 real change made)
                
                . format %10.0f score
                
                . duplicates list beast score
                
                Duplicates in terms of beast score
                
                (0 observations are duplicates)
                
                . list
                
                     +---------------+
                     | beast   score |
                     |---------------|
                  1. |  frog      42 |
                  2. |  frog      42 |
                  3. |  toad     666 |
                  4. |  toad      42 |
                  5. |  frog     666 |
                     +---------------+
                
                . dataex
                
                ----------------------- copy starting from the next line -----------------------
                [CODE]
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input str4 beast float score
                "frog" 42.00001
                "frog"       42
                "toad"      666
                "toad"       42
                "frog"      666
                end
                [/CODE]
                ------------------ copy up to and including the previous line ------------------
                The second show that spaces can throw off encode
                Code:
                . clear 
                
                . input str4 sbeast score 
                
                        sbeast      score
                  1. "frog"  42 
                  2. "frog"  42 
                  3. "toad"  666 
                  4. "toad"  42 
                  5. "frog"  666
                  6. end 
                
                . duplicates list sbeast score
                
                Duplicates in terms of sbeast score
                
                  +-----------------------+
                  | obs:   sbeast   score |
                  |-----------------------|
                  |    1     frog      42 |
                  |    2     frog      42 |
                  +-----------------------+
                
                . 
                . 
                . replace sbeast = " " + sbeast in 1
                variable sbeast was str4 now str5
                (1 real change made)
                
                . encode sbeast, gen(beast)
                
                . duplicates list beast score
                
                Duplicates in terms of beast score
                
                (0 observations are duplicates)
                
                . list
                
                     +------------------------+
                     | sbeast   score   beast |
                     |------------------------|
                  1. |   frog      42    frog |
                  2. |   frog      42    frog |
                  3. |   toad     666    toad |
                  4. |   toad      42    toad |
                  5. |   frog     666    frog |
                     +------------------------+
                
                . dataex
                
                ----------------------- copy starting from the next line -----------------------
                
                [CODE]
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input str5 sbeast float score long beast
                " frog"  42 1
                "frog"   42 2
                "toad"  666 3
                "toad"   42 3
                "frog"  666 2
                end
                label values beast beast
                label def beast 1 " frog", modify
                label def beast 2 "frog", modify
                label def beast 3 "toad", modify
                [/CODE]
                
                ------------------ copy up to and including the previous line ------------------

                Comment


                • #9
                  In addition to the Picardesque examples, the principle is that looking similar is not proof of identity. Display formats can make numeric values look the same, but that can be an illusion.

                  Comment


                  • #10
                    Ah great, thank you all a lot. It was indeed a format issue.

                    Comment

                    Working...
                    X