Check if two out of three values match and ignore missing values

Clemens Lehner

Join Date: Oct 2018

Posts: 7
#1

Check if two out of three values match and ignore missing values

16 Oct 2018, 03:03

Hi Statalist,

I have a dataset where I am comparing the same numerical variables from three different sources. I want to see if they are identical (and ignore any missing values) and currently achieve this with egens rowmax and rowmin functions.

egen rowmin=rowmin(VarA VarB VarC)
egen rowmax=rowmax(VarA VarB VarC)
gen Difference = rowmax!=rowmin

However, I want to introduce a new rule that as long as any two out of three variables match, they should not be marked as different (missing values should still be ignored). It is important to note that any of the variables can have missing values and that I am checking them individually against each other (so none of them is a "correct" variable that I am comparing against). The result should look like this:

VarA VarB VarC Difference

10 10 10 0

30 30 40 0

50 50 . 0

. . 60 0

70 . 80 1

I figured there might be a way to do this with anymatch, however, I did not suceed in doing so. Any suggestions wouldd be greatly appreciated!
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35724

16 Oct 2018, 06:08

Your rules can be restated:

If all values are missing, nothing you can do.

If one value is present, use it.

If two values are present, use their value if identical.

If three values are present, use a mode (shared by two or three values) if present.

These can all be united as a rule that for bad values, neither the minimum nor the maximum is equal to the median. That's because if you have two different values, neither is equal to the median. If you have three values but at least two agree (either a, a, b or a, b, b in order) then either the minimum or the maximum will equal the median.

Code:

clear 
input VarA    VarB    VarC    Difference
10    10    10    0
30    30    40    0
50    50    .    0
.    .    60    0
70    .    80    1
end 

egen median = rowmedian(Var?) 

gen Bad = (median != min(VarA, VarB, VarC)) & (median != max(VarA, VarB, VarC))

list 

    | VarA   VarB   VarC   Differ~e   median   Bad |
     |----------------------------------------------|
  1. |   10     10     10          0       10     0 |
  2. |   30     30     40          0       30     0 |
  3. |   50     50      .          0       50     0 |
  4. |    .      .     60          0       60     0 |
  5. |   70      .     80          1       75     1 |
     +----------------------------------------------+

Comment

Romalpa Akzo

Join Date: Oct 2017

Posts: 369
#3

16 Oct 2018, 06:22

Other solution: Logically, any two out of three variables match is just a contrast to no single match couple among 3 Vars (missings included)

Code:

gen Wanted = (VarA != VarB)*(VarB != VarC)*(VarC != VarA)
2 likes
Comment
Clemens Lehner

Join Date: Oct 2018

Posts: 7
#4

21 Dec 2018, 04:13

As an update to my original question, would there also be a similar method that works with strings? Both answers worked great for numerical values, but now I face the same Problem with a string variable (an iso 3 Country Code like USA, AUT, etc.)
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#5

21 Dec 2018, 04:19

Please give a data example, realistic if not real, so that people who answer don't have to work at that too.
Comment
Clemens Lehner

Join Date: Oct 2018

Posts: 7
#6

21 Dec 2018, 05:07

Sorry for not providing an example!

Code:

clear input str3 VarA str3 VarB str3 VarC Match "USA" "USA" "USA" 1 "USA" "USA" "" 1 "" "USA" "USA" 1 "USA" "" "" 0 "USA" "CAN" "" 0 "CAN" "USA" "MEX" 0 end list

The goal is the same as in my original question, check if at least two out of the three strings match and ignore missing values.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35724

21 Dec 2018, 05:24

Thanks for the example. It seems to me that the rules stated in #2 remain good. Nothing in that formulation rules out strings. However, to implement them a sorting of values helps too.

Code:

clear
input str3 VarA str3 VarB str3 VarC Match
"USA" "USA" "USA" 1
"USA" "USA" "" 1
"" "USA" "USA" 1
"USA" "" "" 0
"USA" "CAN" "" 0
"CAN" "USA" "MEX" 0
end

* install from Stata Journal 
rowsort Var?, gen(New1 New2 New3) highmissing 

*1 If all values are missing, nothing you can do. 

*2 If one value is present, use it. 

*3 If two values are present, use their value if identical. 

*4 If three values are present, use a mode (shared by two or three values) if present. 

* rules 1 and 2 
gen wanted = New1 if missing(New2) 

* rule 3 
replace wanted = New1 if missing(wanted) & missing(New3) & (New1 == New2) 

* rule 4 
replace wanted = cond(New1 == New2, New1, cond(New2 == New3, New2, "")) if missing(wanted) 

list , sep(0) 

     +----------------------------------------------------------+
     | VarA   VarB   VarC   Match   New1   New2   New3   wanted |
     |----------------------------------------------------------|
  1. |  USA    USA    USA       1    USA    USA    USA      USA |
  2. |  USA    USA              1    USA    USA             USA |
  3. |         USA    USA       1    USA    USA             USA |
  4. |  USA                     0    USA                    USA |
  5. |  USA    CAN              0    CAN    USA                 |
  6. |  CAN    USA    MEX       0    CAN    MEX    USA          |
     +----------------------------------------------------------+

Comment

Clemens Lehner

Join Date: Oct 2018

Posts: 7
#8

21 Dec 2018, 06:16

Hi Nick, thanks a lot for the helpful answer! I did not succeed in convincing your -rowsort- command to accept string variables, but I used -sortrows- instead since it essentially does the same thing but also works with strings.

Code:

sortrows Var?, gen(New1 New2 New3) missing
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#9

21 Dec 2018, 07:15

Note that I said rowsort from the Stata Journal. That comes up first with a search rowsort:

SJ-9-1 pr0046 . . . . . . . . . . . . . . . . . . . Speaking Stata: Rowwise
(help rowsort, rowranks if installed) . . . . . . . . . . . N. J. Cox
Q1/09 SJ 9(1):137--157
shows how to exploit functions, egen functions, and Mata
for working rowwise; rowsort and rowranks are introduced

My guess is that you installed rowsort from SSC. If so, you should see this result in Stata

Code:

.which rowsort *! NJC 1.2.0 22 November 2005
1 like
Comment
Clemens Lehner

Join Date: Oct 2018

Posts: 7
#10

21 Dec 2018, 09:08

You are right, thanks for the hint!
Comment

Announcement