Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Keep rows if two consecutive columns have the same value

    I am trying to filter rows based on a condition where two consequtive columns have a particular value. Say I have the following data consisting of an "id" and indicators if they appear in a series of interviews ("inw8-inw12"). I am trying to keep only those ids that have appeared in two consecutive interviews.

    The sample data is as follows,
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long id byte(inw8 inw9 inw10 inw11 inw12)
    1010 1 0 1 1 0
    3020 0 0 0 0 0
    3030 1 0 1 0 1
    4010 1 1 0 0 0
    4040 0 0 1 1 1
    3010 1 0 0 0 1
    3040 1 1 1 1 1
    The result of my sample data:
    Code:
    1010 1 0 1 1 0
    4010 1 1 0 0 0
    4040 0 0 1 1 1
    3040 1 1 1 1 1

  • #2
    You might want to wait until somebody comes up with a smartest solution. But this below (while not very smart) seems to do the trick:

    Code:
    . egen in8and9 = rowtotal(inw8 inw9)
    
    . egen in9and10 = rowtotal(inw9 inw10)
    
    . egen in10and11 = rowtotal(inw10 inw11)
    
    . egen in11and12 = rowtotal(inw11 inw12)
    
    . keep if in8and9==2 | in9and10==2 | in10and11==2 | in11and12==2
    (3 observations deleted)
    
    . list, clean
    
             id   inw8   inw9   inw10   inw11   inw12   in8and9   in9and10   in10a~11   in11a~12  
      1.   1010      1      0       1       1       0         1          1          2          1  
      2.   4010      1      1       0       0       0         2          1          0          0  
      3.   4040      0      0       1       1       1         0          1          2          2  
      4.   3040      1      1       1       1       1         2          2          2          2  
    
    .

    Comment


    • #3
      This is a simpler way:

      Code:
      clear
      input long id byte(inw8 inw9 inw10 inw11 inw12)
      1010 1 0 1 1 0
      3020 0 0 0 0 0
      3030 1 0 1 0 1
      4010 1 1 0 0 0
      4040 0 0 1 1 1
      3010 1 0 0 0 1
      3040 1 1 1 1 1
      end
      
      * This variable holds the sample
      gen insample=0
      * Initial variable to compare
      loc lx = "inw8"
      * Loop over the remaining variables
      foreach x in inw9 inw10 inw11 inw12 {
              * Keep in sample if variable equals 1 and it's the same as previous variable
              replace insample = 1 if `x'==`lx' & `x'==1
              * Update variable to compare
              loc lx = "`x'"
      }
      Jorge Eduardo Pérez Pérez
      www.jorgeperezperez.com

      Comment


      • #4
        Code:
        egen all_inw = concat(inw*)
        keep if strpos(all_inw, "11")

        Comment


        • #5
          This code provides similar results (inv12 was dropped, but I suspect it won't in larger datasets:

          Code:
          reshape long inw, i(id) j(interview)
          by id, sort: gen myseq = inw == inw[_n+1] & inw != 0
          keep if myseq ==1
          drop myseq
          reshape wide
          mvencode _all, mv(0) override
          list
          Best regards,

          Marcos

          Comment


          • #6
            Romalpa Akzo Rather dexterous and clever, your code!
            Best regards,

            Marcos

            Comment


            • #7
              Code:
              keep if inlist( 2, inw8+inw9, inw9+inw10, inw10+inw11, inw11+inw12 )

              Comment


              • #8
                Thank you all for the great solutions. @Romalpa Akzo That was a great solution and works perfectly.

                Comment


                • #9
                  Marcos Almeida, thanks for your encouragement. Indeed, the solution is originally credited to Nick Cox at this thread. Overthere, you might also notice the brilliant suggestion by Nick on utilizing - strmatch- as a sharp and effective tool to deal with a similar but broader issue. For me, such suggestion is quite valuable and really enjoyable.

                  Comment


                  • #10
                    Kudos to Nick Cox as well, that's a really great strategy to tackle this issue. I'm glad to learn it.
                    Best regards,

                    Marcos

                    Comment


                    • #11
                      The very fine string based matching in #4 can be used for matching much more complicated patterns than this case. For this pattern match the solution using -inlist()- in #7 will be most computational efficient, in both running time and space used, which can matter for larger data and many repetitions. But, specifying arguments to inlist() might be too much work. To build the arguments a little code generating utility is drafted below.

                      Syntax to reproduce the example, matching two consecutive interviews:
                      Code:
                      utilseqvars inw*, ns(2) display
                      keep if inlist(2, `s(args)')
                      In the examples below -keep- is replaced with -display-

                      * Read example data
                      Code:
                      * Example generated by -dataex-. To install: ssc install dataex
                      clear
                      input long id byte(inw8 inw9 inw10 inw11 inw12)
                      1010 1 0 1 1 0
                      3020 0 0 0 0 0
                      3030 1 0 1 0 1
                      4010 1 1 0 0 0
                      4040 0 0 1 1 1
                      3010 1 0 0 0 1
                      3040 1 1 1 1 1
                      end
                      * Define utility to build arguments for inlist()
                      Code:
                      ********************************************************************************
                      prog def utilseqvars , sclass
                      
                      syntax varlist , ns(integer) [display]
                      
                      local nvars : list sizeof varlist
                      
                      if ( `ns' > `nvars' ) {
                      
                          di as err "ns `ns' > `nvars' variables"
                          exit
                      } 
                      
                      foreach v of varlist `varlist' {
                      
                          local c = `c' + 1 
                          local s = "`s'" + " " + "`v'"
                          local sep = cond( "`args'" != "", ", " ,"" ) 
                      
                          if ( `c' > `ns' ) gettoken first s : s
                      
                          if ( `: list sizeof s '  == `ns' ) {
                      
                              local args = "`args'" + "`sep'" + subinstr( trim("`s'"), " ", "+", .) 
                          }
                      }
                      
                      sreturn local args = "`args'"
                      
                      if ( "`display'" != "" ) {
                      
                          di as text "`args'"
                      }
                      
                      end /* utilseqvars */
                      ********************************************************************************
                      * Run example:
                      Code:
                      utilseqvars inw*, ns(2) display
                      list if inlist(2, `s(args)')
                      
                      utilseqvars inw* , ns(3) display
                      list if inlist(3, `s(args)')
                      * Results:
                      Code:
                      . utilseqvars inw*, ns(2) display
                      inw8+inw9, inw9+inw10, inw10+inw11, inw11+inw12
                      
                      . list if inlist(2, `s(args)')
                      
                           +--------------------------------------------+
                           |   id   inw8   inw9   inw10   inw11   inw12 |
                           |--------------------------------------------|
                        1. | 1010      1      0       1       1       0 |
                        4. | 4010      1      1       0       0       0 |
                        5. | 4040      0      0       1       1       1 |
                        7. | 3040      1      1       1       1       1 |
                           +--------------------------------------------+
                      
                      . 
                      . utilseqvars inw* , ns(3) display
                      inw8+inw9+inw10, inw9+inw10+inw11, inw10+inw11+inw12
                      
                      . list if inlist(3, `s(args)')
                      
                           +--------------------------------------------+
                           |   id   inw8   inw9   inw10   inw11   inw12 |
                           |--------------------------------------------|
                        5. | 4040      0      0       1       1       1 |
                        7. | 3040      1      1       1       1       1 |
                           +--------------------------------------------+

                      Comment


                      • #12
                        Bjarte Aagnes’s solutions in #7 and #11 would be making sense when computational efficiency is a big concern. For the same purpose, below code might be a little bit simpler while not less efficient.
                        Code:
                        keep if inw8*inw9 + inw9*inw10 + inw10*inw11 + inw11*inw12

                        Comment

                        Working...
                        X