comparison of records

Robin Fatch

Join Date: Oct 2015

Posts: 36
#16

01 May 2018, 20:20

Thanks Nick. I think it's because I'm having a hard time phrasing the question.

Basically, I'm not sure how your code works. I would expect something like this to work:

Code:

clear input id a b c str3 d e f g 1001 1 18 0 cat 0 1 -8 1001 1 18 1 cat 0 1 -8 1002 1 55 0 dog 0 1 -9 1002 1 55 0 dog 0 1 -9 1003 0 42 0 dog 0 0 0 1003 0 42 0 dog 0 0 0 1004 1 36 0 cat 0 1 0 1004 0 36 0 dog 0 1 0 end gen bad = 0 foreach v in a b c d e f g { qui bysort id : replace bad = 1 if (_N == 2) & (`v'[1] != `v'[2]) list id `v' if bad ==1, sepby(id) }

You can see it's slightly different, in the "replace" code, and "bad==1". And, I do get different results. Using this code, it lists each id after it has found a discrepant value between obs 1 and 2, even if that variable itself isn't discrepant.

I'm just having trouble figuring out how your code works/how Stata interprets it. If you are able to somehow put that into words, that would be great.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35637

#17

02 May 2018, 02:03

You're asking, I think, why when you change my code it no longer acts as intended.

When you changed the code, bad is now changed to 1 if and only if there is a mismatch on the current variable in the loop. So, if there is no such mismatch, then bad is 1 and a pair of observations on a variable is listed if and only if there was a mismatch on a previous variable.

This is what happens with your code. Follow my comments.

Code:

* 1004 on a is listed because there is a mismatch: intended
     +----------+
     |   id   a |
     |----------|
  7. | 1004   1 |
  8. | 1004   0 |
     +----------+

* 1004 on b is listed because bad was not changed: not intended behaviour
     +-----------+
     |   id    b |
     |-----------|
  7. | 1004   36 |
  8. | 1004   36 |
     +-----------+

* 1001 on c is listed because there is a mismatch: intended behaviour
* 1004 on c is still listed because (again) bad was not changed: not intended behaviour

     +----------+
     |   id   c |
     |----------|
  1. | 1001   0 |
  2. | 1001   1 |
     |----------|
  7. | 1004   0 |
  8. | 1004   0 |
     +----------+

* yet again, the listing of identical values arises because bad was not changed

     +------------+
     |   id     d |
     |------------|
  1. | 1001   cat |
  2. | 1001   cat |
     |------------|
  7. | 1004   cat |
  8. | 1004   dog |
     +------------+

* and so on
     +----------+
     |   id   e |
     |----------|
  1. | 1001   0 |
  2. | 1001   0 |
     |----------|
  7. | 1004   0 |
  8. | 1004   0 |
     +----------+

* and so on
     +----------+
     |   id   f |
     |----------|
  1. | 1001   1 |
  2. | 1001   1 |
     |----------|
  7. | 1004   1 |
  8. | 1004   1 |
     +----------+

* and so on
     +-----------+
     |   id    g |
     |-----------|
  1. | 1001   -8 |
  2. | 1001   -8 |
     |-----------|
  7. | 1004    0 |
  8. | 1004    0 |
     +-----------+

Here for completeness is my intended code and the results.

Code:

clear
input id    a    b    c    str3 d    e    f    g
1001    1    18    0    cat    0    1    -8
1001    1    18    1    cat    0    1    -8
1002    1    55    0    dog    0    1    -9
1002    1    55    0    dog    0    1    -9
1003    0    42    0    dog    0    0    0
1003    0    42    0    dog    0    0    0
1004    1    36    0    cat    0    1    0
1004    0    36    0    dog    0    1    0
end

gen bad = 0
 
foreach v in a b c d e f g {
    qui  bysort id : replace bad = _N == 2 & `v'[1] != `v'[2]
    list id `v' if bad, sepby(id)
}

     +----------+
     |   id   a |
     |----------|
  7. | 1004   1 |
  8. | 1004   0 |
     +----------+

     +----------+
     |   id   c |
     |----------|
  1. | 1001   0 |
  2. | 1001   1 |
     +----------+

     +------------+
     |   id     d |
     |------------|
  7. | 1004   cat |
  8. | 1004   dog |
     +------------+

It's vital to forget about previous variables when looking at each variable in turn.

Comment

Robin Fatch

Join Date: Oct 2015

Posts: 36
#18

02 May 2018, 09:19

Hi Nick,

Thanks. I did see that pattern, and that my code lists all observations, if they had a *previous* mismatch ("unintended behavior", as you noted). The syntax of my code was simply meant to show what I am more accustomed to - even though it clearly does not give the same results. I'm still not sure how your code works, though - I just don't understand parts of the syntax. That's where I am getting hung up.

Your code, inserted here again:

Code:

clear input id a b c str3 d e f g 1001 1 18 0 cat 0 1 -8 1001 1 18 1 cat 0 1 -8 1002 1 55 0 dog 0 1 -9 1002 1 55 0 dog 0 1 -9 1003 0 42 0 dog 0 0 0 1003 0 42 0 dog 0 0 0 1004 1 36 0 cat 0 1 0 1004 0 36 0 dog 0 1 0 end gen bad = 0 foreach v in a b c d e f g { qui bysort id : replace bad = _N == 2 & `v'[1] != `v'[2] list id `v' if bad, sepby(id) }

So, for each variable, you are asking Stata to (within the same ID), "replace bad = _N == 2 and if the first observation of the variable does not equal the second observation of that variable". How does the first part of this work? I know that you said it is ignoring those with one observation only (_N == 1) or more than two observations (_N > 2) (ie. those for whom _N ~= 2). But what is the code replacing bad to equal? I am wondering if this is somehow a temporary replacement, otherwise I think it would be similar to my code - listing all of the variables, after one mismatch is found.

Similarly, when you say "list id `v' if bad, sepby(id)" - don't you have to specify what bad is equal to, here, to tell Stata what to list?

Thanks...
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35637
#19

02 May 2018, 09:28

I trust that https://www.stata.com/support/faqs/d...rue-and-false/ will answer both questions.
Comment

Robin Fatch

Join Date: Oct 2015
Posts: 36

#20

02 May 2018, 15:04

Thanks Nick. That does help clarify some of my confusion.

I'm still having trouble, though, understanding why this:

Code:

qui  bysort id : replace bad = _N == 2 & `v'[1] != `v'[2]

is different than this:

Code:

qui  bysort id : replace bad = 1 if (_N == 2) & (`v'[1] != `v'[2])

You said "When you changed the code, bad is now changed to 1 if and only if there is a mismatch on the current variable in the loop. So, if there is no such mismatch, then bad is 1 and a pair of observations on a variable is listed if and only if there was a mismatch on a previous variable."

In both instances, though, isn't bad changed if there is a mismatch on the current variable? Why are all variables listed after code #2 (ie. bad stays = 1 after the first bad pair), but only the mismatched variables listed after code #1? Does it have to do with the loop somehow?

For example, if I drop the part limiting us to those with 2 observations each (for now), it seems like using this, works.

Code:

clear
input id    a    b    c    str3 d    e    f    g
1001    1    18    0    cat    0    1    -8
1001    1    18    1    cat    0    1    -8
1002    1    55    0    dog    0    1    -9
1002    1    55    0    dog    0    1    -9
1003    0    42    0    dog    0    0    0
1003    0    42    0    dog    0    0    0
1004    1    36    0    cat    0    1    0
1004    0    36    0    dog    0    1    0
end

gen bad = 0
 
foreach v in a b c d e f g {
    qui  bysort id : replace bad = `v'[1] != `v'[2]
    list id `v' if bad, sepby(id)
}

But similarly changing it to this, does not (same unintended behavior as before):

Code:

clear
input id    a    b    c    str3 d    e    f    g
1001    1    18    0    cat    0    1    -8
1001    1    18    1    cat    0    1    -8
1002    1    55    0    dog    0    1    -9
1002    1    55    0    dog    0    1    -9
1003    0    42    0    dog    0    0    0
1003    0    42    0    dog    0    0    0
1004    1    36    0    cat    0    1    0
1004    0    36    0    dog    0    1    0
end

gen bad = 0
 
foreach v in a b c d e f g {
    qui  bysort id : replace bad = 1 if `v'[1] != `v'[2]
    list id `v' if bad, sepby(id)
}

I was getting caught up on the _N == 2 piece before, but now I am thinking it is the loop behavior that I am misunderstanding (maybe?).

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35637
#21

03 May 2018, 01:51

_N == 2 is always satisfied for your example data. I put it in the code because you need different code if each identifier occurs just once (_N == 1) or more than twice (_N > 2). We haven't addressed that, but experience with large complicated datasets shows that even if each record "should" appear twice there can be exceptions.

Otherwise sorry, but I can't see a new question here and can't see a new way to give the same explanation more clearly. Using if on the replace will only change values if there is a mismatch on the current variable. If the values of bad don't change they will usually reflect a previous change. Naturally being in a loop is crucial here, as it's essential that evaluation refers to the present variable in focus each time around the loop.

Nothing (except preferences of style and some concern for efficiency) stops anyone rewriting the loop

Code:

gen bad = 0 foreach v in a b c d e f g { qui bysort id : replace bad = `v'[1] != `v'[2] list id `v' if bad, sepby(id) }

as

Code:

foreach v in a b c d e f g { qui bysort id : gen bad = `v'[1] != `v'[2] list id `v' if bad, sepby(id) drop bad }

or even

Code:

foreach v in a b c d e f g { qui bysort id : gen bad`v' = `v'[1] != `v'[2] list id `v' if bad`v', sepby(id) }

Both of those different versions of the code make it more obvious that each comparison is utterly independent of the others.

Last edited by Nick Cox; 03 May 2018, 02:19.
Comment
Robin Fatch

Join Date: Oct 2015

Posts: 36
#22

03 May 2018, 18:38

THanks Nick. I agree that the _N == 2 is important to include. I was just trying to simplify to make my question easier.

I think my issue is that I don't understand why, with your code, the value of bad "resets" after every variable, and thus only lists the particular variables that are mismatched. (with mine, it clearly stays = 1 after the first mismatch in a pair is found - thus listing variables that are not mismatched, but are being compared after the first mismatch)

Anyways, thanks for all of your help trying to explain this. I do see that your code will work well for my purposes, so that is ultimately very helpful.

Thanks again,
Robin
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment