Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Check values from one variable against another variable

    Hello,

    I'd like to write a code which checks if the values in one variable appear atleast once in another variable.

    Code:
    clear
    input float(var1 var2)
    683      547
    547      .
    546      028
    548      .
    546      .
    694      .
    042      042
    040      .
    042      .
    042      .
    751      041
    In the example above, 547 is a value for var1. It is also a value for var2 but for a different observation. I want to check if any of the values in var1 are repeated in var2 for any observation.

    Could someone please suggest how to go ahead? The values are repeated in var1 but not in var2. There are missing values in var2.

    Thank you.

  • #2
    Minnie: This is pretty straightforward to do using Mata. This code will define a variable c in your Stata dataset that will take on the value "1" if the value of var1 for that observation appears anywhere in var2, otherwise "0".
    Code:
    putmata var1 var2, replace
    mata
    c=J(rows(var1),1,.)
    for (j=1;j<=rows(var1);j++) {
     c[j]=anyof(var2,var1[j])
    }
    end
    getmata c, force
    list var1 var2 c

    Comment


    • #3
      You have a couple of options here:
      1) use macro list commands:
      Code:
      levelsof var1, local(v1)
      levelsof var2, local(v2)
      local isnotinv1: list v2 - v1
      noi di "`isnotinv1'"
      local isinv1: list v1 & v2
      noi di "`isinv1'"
      But note that there are limits for macro lengths that may or may not bite you. See help limits where you see :
      (2) The maximum length of the contents of a macro are fixed in Stata/IC and settable in Stata/SE and Stata/MP. The currently set maximum length is recorded in c(macrolen); type display c(macrolen). The maximum length can be changed with set maxvar. If you set maxvar to a larger value, the maximum length increases; if you set maxvar to a smaller value, the maximum length decreases. The relationship between them is maximum_length = 33*maxvar + 200.
      2) use merge
      Code:
      save c:\temp\org.dta, replace  //save original dataset
      clonevar check=var2
      keep check
      drop if mi(check)
      save c:\temp\check.dta, replace
      clear
      use c:\temp\org.dta
      clonevar check=var1
      merge m:m check using c:\temp\check.dta
      
      **values in var2 that are also in var1
      tab check if _merge==3
      
      **values in var1 that are not in var2
      tab check if _merge==1
      
      **values in var2 that are not in var1
      tab check if _merge==2
      Does this give you what you want?
      Stata/MP 14.1 (64-bit x86-64)
      Revision 19 May 2016
      Win 8.1

      Comment


      • #4
        Is this not even simpler to do like this:

        Code:
        . bysort var1 var2:  gen numcooccur = sum(var1==var2)
        
        . list, clean
        
               var1   var2   numcoo~r  
          1.     40      .          0  
          2.     42     42          1  
          3.     42      .          0  
          4.     42      .          0  
          5.    546     28          0  
          6.    546      .          0  
          7.    547      .          0  
          8.    548      .          0  
          9.    683    547          0  
         10.    694      .          0  
         11.    751     41          0

        Comment


        • #5
          Originally posted by Joro Kolev View Post
          Is this not even simpler to do like this:
          ...
          Unfortunately, this approach does not, in general, yield the desired result. The approach fails if the same values are not sorted into the same observation, as is the case for value 547 in the given example.

          Best
          Daniel

          Comment


          • #6
            Originally posted by daniel klein View Post

            Unfortunately, this approach does not, in general, yield the desired result. The approach fails if the same values are not sorted into the same observation, as is the case for value 547 in the given example.
            True... Thank you for pointing out that very generally the code I proposed is a non-sense and does not yield the desired result.

            Comment


            • #7
              Let me try another shot at redemption.

              Does the following not yield a simpler solution:

              Code:
              . levelsof var2, local(second)  separate(,)
              28,41,42,547
              
              . gen coincide = inlist(var1,`second')
              
              . list , clean
              
                     var1   var2   coincide  
                1.    683    547          0  
                2.    547      .          1  
                3.    546     28          0  
                4.    548      .          0  
                5.    546      .          0  
                6.    694      .          0  
                7.     42     42          1  
                8.     40      .          0  
                9.     42      .          1  
               10.     42      .          1  
               11.    751     41          0

              Comment


              • #8
                My only concern with the inlist solution is hitting the maximum of 255 reals. Even with the macro list solution, there is a danger of hitting the limits if there are a lot of values. No clue how many unique values are in Minnie's dataset.
                Stata/MP 14.1 (64-bit x86-64)
                Revision 19 May 2016
                Win 8.1

                Comment


                • #9
                  Carole pointed out was I was going to write. Also, there might be precision issues with non-integer values.

                  Best
                  Daniel
                  Last edited by daniel klein; 21 Jan 2019, 09:08.

                  Comment


                  • #10
                    Thank you for all your replies! This is what I attempted - it isn't an elegant solution and does take a lot of time since my dataset has around 1000 observations.

                    Code:
                    local id1=_N
                    local id2=_N
                    gen match=.
                    
                    forval i=1/`id1'{
                    local var1=var1[`i']
                        forval x=1/`id2'{
                        replace match=1 if var1==var2[`x']
                        }
                    }
                    It does seem to be working. I will try the mata solution as well.

                    Comment


                    • #11
                      Originally posted by Minnie Smith View Post
                      Thank you for all your replies! This is what I attempted - it isn't an elegant solution and does take a lot of time since my dataset has around 1000 observations.

                      Code:
                      local id1=_N
                      local id2=_N
                      gen match=.
                      
                      forval i=1/`id1'{
                      local var1=var1[`i']
                      forval x=1/`id2'{
                      replace match=1 if var1==var2[`x']
                      }
                      }
                      It does seem to be working. I will try the mata solution as well.
                      As far as I see this double loop is pointless. If you want to do it with a loop, it should be done like this:

                      Code:
                      gen match2=.
                          forval x=1/`=_N' {
                          replace match2=1 if var1==var2[`x']
                          }

                      Comment


                      • #12
                        Code:
                        clear
                        input float(var1 var2)
                        683      547
                        547      .
                        546      028
                        548      .
                        546      .
                        694      .
                        042      042
                        040      .
                        042      .
                        042      .
                        751      041
                        end
                        
                        
                        fillin var1 var2
                        list if var1 == var2

                        Comment

                        Working...
                        X