Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • comparison of records

    Hi,

    We double enter some of our data, and use the "proc compare" command in SAS to compare entries. This command flags discrepancies between records, that we then investigate further. I am interested in flagging discrepancies between observations (ie. if for the same ID, one record has age = 15, the other record has age = 19 - this would be flagged.)

    We use Stata for most of our analyses so it would be great if we could use it in this case as well. However, I am unaware of anything similar to proc compare. Does anyone know of anything that might work?

    Thanks,
    Robin

  • #2
    See

    Code:
    help duplicates
    for a command which checks for duplicates. Here you want everything to be duplicated (strict sense) and are on the lookout for singletons and other patterns. But note that

    Code:
    bysort a b c : gen bad = _N != 2
    list a b c if bad 
    is going to find observations that don't pair off on the variables named. So although duplicates is a convenience command, and one I quite like, you can get a long way just working from first principles.

    Comment


    • #3
      Check out -obsdiff- from ssc.
      Eric A. Booth | Senior Director of Research | Far Harbor | Austin TX

      Comment


      • #4
        Hi Nick and Eric,

        Thanks for the replies - sorry for the very slow response.

        In our dataset, we should have two records for each ID, for multiple IDs. We want to compare many other variables, within IDs. Is there a way to compare these other variables, by ID, with either of these approaches?

        Thanks again,
        Robin

        Comment


        • #5
          The code in #2 shows the principle you're asking for. If records should pair by identifier and by various other variables, then you can check for whether they define groups of 2 or not.

          Comment


          • #6
            Thanks Nick. I see that now.

            Since we are comparing many variables between records, this isn't quite as helpful as it could be, since it lists all variables that were compared (most of which are hopefully identical).

            Is there a way to set this up that I'm not seeing, so that only ID and any mismatched variables are listed?

            Comment


            • #7
              Hard to suggest good strategy without any data example or even a sketch of what kinds of problems you have.

              Perhaps this may help. Imagine an identifier id and a host of other variables a b c d e f g

              Code:
              gen bad = 0
              
              foreach v in a b c e f g {
                  qui  bysort id : replace bad = _N == 2 & `v'[1] != `v'[2]
                  list id `v' if `bad', sepby(id)
              }

              Comment


              • #8
                Thanks Nick. Sorry for the minimal description. We are double entering data from various forms, so that we end up with two entries per ID. We want to compare the double entries, so that any discrepancies are flagged and can be investigated. I would like a list of the discrepant variables, plus ID. I don't want all variables listed when there is a single discrepancy between two records, because it would be too much to sort through all of those variables to find the discrepancy.

                Let's use this is an example (we are working with more records and more variables in reality; some are numeric and some are strings):
                id a b c d e f g
                1001 1 18 0 cat 0 1 -8
                1001 1 18 1 cat 0 1 -8
                1002 1 55 0 dog 0 1 -9
                1002 1 55 0 dog 0 1 -9
                1003 0 42 0 dog 0 0 0
                1003 0 42 0 dog 0 0 0
                1004 1 36 0 cat 0 1 0
                1004 0 36 0 dog 0 1 0
                I can't quite get your code to work; I get an invalid syntax error.

                My Stata is not quite advanced enough - I'm not sure how the "highlighted" pieces are meant to work (see attached).
                Attached Files
                Last edited by Robin Fatch; 26 Apr 2018, 14:53.

                Comment


                • #9
                  Sorry, but I won't read Word documents (see FAQ Advice) and I can't comment on what's wrong with your code because you don't show it (see FAQ Advice).

                  Comment


                  • #10
                    I see. I had pasted the code (your suggested code from your earlier post) in the word document with parts highlighted; I couldn't figure out how to do that in the editor. Here it is below.

                    I get an "invalid syntax" message when I use it on the sample data. In particular in the attachment, I had highlighted "replace bad = _N == 2 &" and " `bad' " as I'm not sure how those pieces were meant to work.

                    Thanks,
                    Robin


                    Code:
                    gen bad = 0
                     
                    foreach v in a b c e f g {
                        qui  bysort id : replace bad = _N == 2 & `v'[1] != `v'[2]
                        list id `v' if `bad', sepby(id)
                    }
                    id a b c d e f g
                    1001 1 18 0 cat 0 1 -8
                    1001 1 18 1 cat 0 1 -8
                    1002 1 55 0 dog 0 1 -9
                    1002 1 55 0 dog 0 1 -9
                    1003 0 42 0 dog 0 0 0
                    1003 0 42 0 dog 0 0 0
                    1004 1 36 0 cat 0 1 0
                    1004 0 36 0 dog 0 1 0

                    Comment


                    • #11
                      Sorry, my bad typo. This works:

                      Code:
                      clear 
                      input id    a    b    c    str3 d    e    f    g
                      1001    1    18    0    cat    0    1    -8
                      1001    1    18    1    cat    0    1    -8
                      1002    1    55    0    dog    0    1    -9
                      1002    1    55    0    dog    0    1    -9
                      1003    0    42    0    dog    0    0    0
                      1003    0    42    0    dog    0    0    0
                      1004    1    36    0    cat    0    1    0
                      1004    0    36    0    dog    0    1    0
                      end 
                      
                      gen bad = 0
                       
                      foreach v in a b c d e f g {
                          qui  bysort id : replace bad = _N == 2 & `v'[1] != `v'[2]
                          list id `v' if bad, sepby(id)
                      }

                      Comment


                      • #12
                        Great, thanks Nick. That works.

                        Can you explain what this portion of your code does?


                        Code:
                         
                         replace bad = _N == 2 & `v'[1] != `v'[2]
                        I would like to understand it better so I can modify, if needed. I understand that the second half of the statement (this part: `v'[1] != `v'[2]) is highlighting records where the value of `v' in record 1 does not equal the value in record 2. But how does the first part of the replace statement fit in/how does it work?

                        Comment


                        • #13
                          We're ignoring singletons (_N == 1) and indeed "duplicates" that occur 3 or more times. They may be of interest or concern otherwise.

                          Comment


                          • #14
                            How does this work, if we're not saying to replace bad IF something else? Sorry, I feel as though I am missing something key here...
                            Last edited by Robin Fatch; 30 Apr 2018, 15:21.

                            Comment


                            • #15
                              Sorry, but I don't understand what you're asking here. As you loop over variables, you want to check each new variable. Previous values are all irrelevant.

                              Comment

                              Working...
                              X