Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #16
    Thanks Nick. I think it's because I'm having a hard time phrasing the question.

    Basically, I'm not sure how your code works. I would expect something like this to work:

    Code:
    clear
    input id    a    b    c    str3 d    e    f    g
    1001    1    18    0    cat    0    1    -8
    1001    1    18    1    cat    0    1    -8
    1002    1    55    0    dog    0    1    -9
    1002    1    55    0    dog    0    1    -9
    1003    0    42    0    dog    0    0    0
    1003    0    42    0    dog    0    0    0
    1004    1    36    0    cat    0    1    0
    1004    0    36    0    dog    0    1    0
    end
    
    gen bad = 0
     
    foreach v in a b c d e f g {
        qui  bysort id : replace bad = 1 if (_N == 2) & (`v'[1] != `v'[2])
        list id `v' if bad ==1, sepby(id)
    }
    You can see it's slightly different, in the "replace" code, and "bad==1". And, I do get different results. Using this code, it lists each id after it has found a discrepant value between obs 1 and 2, even if that variable itself isn't discrepant.

    I'm just having trouble figuring out how your code works/how Stata interprets it. If you are able to somehow put that into words, that would be great.

    Comment


    • #17
      You're asking, I think, why when you change my code it no longer acts as intended.

      When you changed the code, bad is now changed to 1 if and only if there is a mismatch on the current variable in the loop. So, if there is no such mismatch, then bad is 1 and a pair of observations on a variable is listed if and only if there was a mismatch on a previous variable.

      This is what happens with your code. Follow my comments.

      Code:
      * 1004 on a is listed because there is a mismatch: intended
           +----------+
           |   id   a |
           |----------|
        7. | 1004   1 |
        8. | 1004   0 |
           +----------+
      
      * 1004 on b is listed because bad was not changed: not intended behaviour
           +-----------+
           |   id    b |
           |-----------|
        7. | 1004   36 |
        8. | 1004   36 |
           +-----------+
      
      * 1001 on c is listed because there is a mismatch: intended behaviour
      * 1004 on c is still listed because (again) bad was not changed: not intended behaviour
      
           +----------+
           |   id   c |
           |----------|
        1. | 1001   0 |
        2. | 1001   1 |
           |----------|
        7. | 1004   0 |
        8. | 1004   0 |
           +----------+
      
      * yet again, the listing of identical values arises because bad was not changed
      
           +------------+
           |   id     d |
           |------------|
        1. | 1001   cat |
        2. | 1001   cat |
           |------------|
        7. | 1004   cat |
        8. | 1004   dog |
           +------------+
      
      * and so on
           +----------+
           |   id   e |
           |----------|
        1. | 1001   0 |
        2. | 1001   0 |
           |----------|
        7. | 1004   0 |
        8. | 1004   0 |
           +----------+
      
      * and so on
           +----------+
           |   id   f |
           |----------|
        1. | 1001   1 |
        2. | 1001   1 |
           |----------|
        7. | 1004   1 |
        8. | 1004   1 |
           +----------+
      
      * and so on
           +-----------+
           |   id    g |
           |-----------|
        1. | 1001   -8 |
        2. | 1001   -8 |
           |-----------|
        7. | 1004    0 |
        8. | 1004    0 |
           +-----------+
      Here for completeness is my intended code and the results.

      Code:
      clear
      input id    a    b    c    str3 d    e    f    g
      1001    1    18    0    cat    0    1    -8
      1001    1    18    1    cat    0    1    -8
      1002    1    55    0    dog    0    1    -9
      1002    1    55    0    dog    0    1    -9
      1003    0    42    0    dog    0    0    0
      1003    0    42    0    dog    0    0    0
      1004    1    36    0    cat    0    1    0
      1004    0    36    0    dog    0    1    0
      end
      
      gen bad = 0
       
      foreach v in a b c d e f g {
          qui  bysort id : replace bad = _N == 2 & `v'[1] != `v'[2]
          list id `v' if bad, sepby(id)
      }
      
           +----------+
           |   id   a |
           |----------|
        7. | 1004   1 |
        8. | 1004   0 |
           +----------+
      
           +----------+
           |   id   c |
           |----------|
        1. | 1001   0 |
        2. | 1001   1 |
           +----------+
      
           +------------+
           |   id     d |
           |------------|
        7. | 1004   cat |
        8. | 1004   dog |
           +------------+
      It's vital to forget about previous variables when looking at each variable in turn.

      Comment


      • #18
        Hi Nick,

        Thanks. I did see that pattern, and that my code lists all observations, if they had a *previous* mismatch ("unintended behavior", as you noted). The syntax of my code was simply meant to show what I am more accustomed to - even though it clearly does not give the same results. I'm still not sure how your code works, though - I just don't understand parts of the syntax. That's where I am getting hung up.

        Your code, inserted here again:

        Code:
        clear
        input id    a    b    c    str3 d    e    f    g
        1001    1    18    0    cat    0    1    -8
        1001    1    18    1    cat    0    1    -8
        1002    1    55    0    dog    0    1    -9
        1002    1    55    0    dog    0    1    -9
        1003    0    42    0    dog    0    0    0
        1003    0    42    0    dog    0    0    0
        1004    1    36    0    cat    0    1    0
        1004    0    36    0    dog    0    1    0
        end
        
        gen bad = 0
         
        foreach v in a b c d e f g {
            qui  bysort id : replace bad = _N == 2 & `v'[1] != `v'[2]
            list id `v' if bad, sepby(id)
        }
        So, for each variable, you are asking Stata to (within the same ID), "replace bad = _N == 2 and if the first observation of the variable does not equal the second observation of that variable". How does the first part of this work? I know that you said it is ignoring those with one observation only (_N == 1) or more than two observations (_N > 2) (ie. those for whom _N ~= 2). But what is the code replacing bad to equal? I am wondering if this is somehow a temporary replacement, otherwise I think it would be similar to my code - listing all of the variables, after one mismatch is found.

        Similarly, when you say "list id `v' if bad, sepby(id)" - don't you have to specify what bad is equal to, here, to tell Stata what to list?

        Thanks...

        Comment


        • #19
          I trust that https://www.stata.com/support/faqs/d...rue-and-false/ will answer both questions.

          Comment


          • #20
            Thanks Nick. That does help clarify some of my confusion.

            I'm still having trouble, though, understanding why this:

            Code:
            qui  bysort id : replace bad = _N == 2 & `v'[1] != `v'[2]
            is different than this:
            Code:
            qui  bysort id : replace bad = 1 if (_N == 2) & (`v'[1] != `v'[2])

            You said "When you changed the code, bad is now changed to 1 if and only if there is a mismatch on the current variable in the loop. So, if there is no such mismatch, then bad is 1 and a pair of observations on a variable is listed if and only if there was a mismatch on a previous variable."


            In both instances, though, isn't bad changed if there is a mismatch on the current variable? Why are all variables listed after code #2 (ie. bad stays = 1 after the first bad pair), but only the mismatched variables listed after code #1? Does it have to do with the loop somehow?


            For example, if I drop the part limiting us to those with 2 observations each (for now), it seems like using this, works.
            Code:
            clear
            input id    a    b    c    str3 d    e    f    g
            1001    1    18    0    cat    0    1    -8
            1001    1    18    1    cat    0    1    -8
            1002    1    55    0    dog    0    1    -9
            1002    1    55    0    dog    0    1    -9
            1003    0    42    0    dog    0    0    0
            1003    0    42    0    dog    0    0    0
            1004    1    36    0    cat    0    1    0
            1004    0    36    0    dog    0    1    0
            end
            
            gen bad = 0
             
            foreach v in a b c d e f g {
                qui  bysort id : replace bad = `v'[1] != `v'[2]
                list id `v' if bad, sepby(id)
            }

            But similarly changing it to this, does not (same unintended behavior as before):

            Code:
            clear
            input id    a    b    c    str3 d    e    f    g
            1001    1    18    0    cat    0    1    -8
            1001    1    18    1    cat    0    1    -8
            1002    1    55    0    dog    0    1    -9
            1002    1    55    0    dog    0    1    -9
            1003    0    42    0    dog    0    0    0
            1003    0    42    0    dog    0    0    0
            1004    1    36    0    cat    0    1    0
            1004    0    36    0    dog    0    1    0
            end
            
            gen bad = 0
             
            foreach v in a b c d e f g {
                qui  bysort id : replace bad = 1 if `v'[1] != `v'[2]
                list id `v' if bad, sepby(id)
            }
            I was getting caught up on the _N == 2 piece before, but now I am thinking it is the loop behavior that I am misunderstanding (maybe?).

            Comment


            • #21
              _N == 2 is always satisfied for your example data. I put it in the code because you need different code if each identifier occurs just once (_N == 1) or more than twice (_N > 2). We haven't addressed that, but experience with large complicated datasets shows that even if each record "should" appear twice there can be exceptions.

              Otherwise sorry, but I can't see a new question here and can't see a new way to give the same explanation more clearly. Using if on the replace will only change values if there is a mismatch on the current variable. If the values of bad don't change they will usually reflect a previous change. Naturally being in a loop is crucial here, as it's essential that evaluation refers to the present variable in focus each time around the loop.

              Nothing (except preferences of style and some concern for efficiency) stops anyone rewriting the loop

              Code:
              gen bad = 0  
              foreach v in a b c d e f g {    
                  qui  bysort id : replace bad = `v'[1] != `v'[2]    
                  list id `v' if bad, sepby(id)
              }
              as

              Code:
              foreach v in a b c d e f g {
                  qui bysort id : gen bad = `v'[1] != `v'[2]
                  list id `v' if bad, sepby(id)
                  drop bad
              }
              or even

              Code:
              foreach v in a b c d e f g {    
                  qui  bysort id : gen bad`v' = `v'[1] != `v'[2]    
                  list id `v' if bad`v', sepby(id)
              }
              Both of those different versions of the code make it more obvious that each comparison is utterly independent of the others.
              Last edited by Nick Cox; 03 May 2018, 02:19.

              Comment


              • #22
                THanks Nick. I agree that the _N == 2 is important to include. I was just trying to simplify to make my question easier.

                I think my issue is that I don't understand why, with your code, the value of bad "resets" after every variable, and thus only lists the particular variables that are mismatched. (with mine, it clearly stays = 1 after the first mismatch in a pair is found - thus listing variables that are not mismatched, but are being compared after the first mismatch)

                Anyways, thanks for all of your help trying to explain this. I do see that your code will work well for my purposes, so that is ultimately very helpful.

                Thanks again,
                Robin

                Comment

                Working...
                X