Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Check if two out of three values match and ignore missing values

    Hi Statalist,

    I have a dataset where I am comparing the same numerical variables from three different sources. I want to see if they are identical (and ignore any missing values) and currently achieve this with egens rowmax and rowmin functions.

    egen rowmin=rowmin(VarA VarB VarC)
    egen rowmax=rowmax(VarA VarB VarC)
    gen Difference = rowmax!=rowmin

    However, I want to introduce a new rule that as long as any two out of three variables match, they should not be marked as different (missing values should still be ignored). It is important to note that any of the variables can have missing values and that I am checking them individually against each other (so none of them is a "correct" variable that I am comparing against). The result should look like this:

    VarA VarB VarC Difference
    10 10 10 0
    30 30 40 0
    50 50 . 0
    . . 60 0
    70 . 80 1

    I figured there might be a way to do this with anymatch, however, I did not suceed in doing so. Any suggestions wouldd be greatly appreciated!

  • #2
    Your rules can be restated:

    If all values are missing, nothing you can do.

    If one value is present, use it.

    If two values are present, use their value if identical.

    If three values are present, use a mode (shared by two or three values) if present.

    These can all be united as a rule that for bad values, neither the minimum nor the maximum is equal to the median. That's because if you have two different values, neither is equal to the median. If you have three values but at least two agree (either a, a, b or a, b, b in order) then either the minimum or the maximum will equal the median.

    Code:
    clear 
    input VarA    VarB    VarC    Difference
    10    10    10    0
    30    30    40    0
    50    50    .    0
    .    .    60    0
    70    .    80    1
    end 
    
    egen median = rowmedian(Var?) 
    
    gen Bad = (median != min(VarA, VarB, VarC)) & (median != max(VarA, VarB, VarC))
    
    list 
    
        | VarA   VarB   VarC   Differ~e   median   Bad |
         |----------------------------------------------|
      1. |   10     10     10          0       10     0 |
      2. |   30     30     40          0       30     0 |
      3. |   50     50      .          0       50     0 |
      4. |    .      .     60          0       60     0 |
      5. |   70      .     80          1       75     1 |
         +----------------------------------------------+

    Comment


    • #3
      Other solution: Logically, any two out of three variables match is just a contrast to no single match couple among 3 Vars (missings included)

      Code:
      gen Wanted = (VarA != VarB)*(VarB != VarC)*(VarC != VarA)

      Comment


      • #4
        As an update to my original question, would there also be a similar method that works with strings? Both answers worked great for numerical values, but now I face the same Problem with a string variable (an iso 3 Country Code like USA, AUT, etc.)

        Comment


        • #5
          Please give a data example, realistic if not real, so that people who answer don't have to work at that too.

          Comment


          • #6
            Sorry for not providing an example!

            Code:
            clear
            input str3 VarA str3 VarB str3 VarC Match
            "USA" "USA" "USA" 1
            "USA" "USA" "" 1
            "" "USA" "USA" 1
            "USA" "" "" 0
            "USA" "CAN" "" 0
            "CAN" "USA" "MEX" 0
            end
            
            list
            The goal is the same as in my original question, check if at least two out of the three strings match and ignore missing values.

            Comment


            • #7
              Thanks for the example. It seems to me that the rules stated in #2 remain good. Nothing in that formulation rules out strings. However, to implement them a sorting of values helps too.

              Code:
              clear
              input str3 VarA str3 VarB str3 VarC Match
              "USA" "USA" "USA" 1
              "USA" "USA" "" 1
              "" "USA" "USA" 1
              "USA" "" "" 0
              "USA" "CAN" "" 0
              "CAN" "USA" "MEX" 0
              end
              
              * install from Stata Journal 
              rowsort Var?, gen(New1 New2 New3) highmissing 
              
              *1 If all values are missing, nothing you can do. 
              
              *2 If one value is present, use it. 
              
              *3 If two values are present, use their value if identical. 
              
              *4 If three values are present, use a mode (shared by two or three values) if present. 
              
              * rules 1 and 2 
              gen wanted = New1 if missing(New2) 
              
              * rule 3 
              replace wanted = New1 if missing(wanted) & missing(New3) & (New1 == New2) 
              
              * rule 4 
              replace wanted = cond(New1 == New2, New1, cond(New2 == New3, New2, "")) if missing(wanted) 
              
              list , sep(0) 
              
                   +----------------------------------------------------------+
                   | VarA   VarB   VarC   Match   New1   New2   New3   wanted |
                   |----------------------------------------------------------|
                1. |  USA    USA    USA       1    USA    USA    USA      USA |
                2. |  USA    USA              1    USA    USA             USA |
                3. |         USA    USA       1    USA    USA             USA |
                4. |  USA                     0    USA                    USA |
                5. |  USA    CAN              0    CAN    USA                 |
                6. |  CAN    USA    MEX       0    CAN    MEX    USA          |
                   +----------------------------------------------------------+

              Comment


              • #8
                Hi Nick, thanks a lot for the helpful answer! I did not succeed in convincing your -rowsort- command to accept string variables, but I used -sortrows- instead since it essentially does the same thing but also works with strings.

                Code:
                sortrows Var?, gen(New1 New2 New3) missing

                Comment


                • #9
                  Note that I said rowsort from the Stata Journal. That comes up first with a search rowsort:


                  SJ-9-1 pr0046 . . . . . . . . . . . . . . . . . . . Speaking Stata: Rowwise
                  (help rowsort, rowranks if installed) . . . . . . . . . . . N. J. Cox
                  Q1/09 SJ 9(1):137--157
                  shows how to exploit functions, egen functions, and Mata
                  for working rowwise; rowsort and rowranks are introduced

                  My guess is that you installed rowsort from SSC. If so, you should see this result in Stata

                  Code:
                  .which rowsort
                  *! NJC 1.2.0 22 November 2005



                  Comment


                  • #10
                    You are right, thanks for the hint!

                    Comment

                    Working...
                    X