Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Positive lookahead assertion regex query

    Hello all:

    This is kind of related to a post a while ago relating to 'negative lookahead assertion' you answered for me. I am trying to match numbers 1 through 22 for each of the chromosomes. But the regex code is picking up chromosome 2 for 22 and 1 and 6 for 16 in addition. Which is incorrect. How can I restrict it to pick the the appropriate chromosome in the match.

    For the obs 5 below. it picks 2qe and 6qe as matches, which are both incorrect. I am separation individual chromosome arm losses.

    Code:
    forvalue num = 1/22{
        gen cnaloss`num'p = ustrregexs(0) if ustrregexm(cnaloss, "(`num'p.{0,1})") 
        gen cnaloss`num'q = ustrregexs(0) if ustrregexm(cnaloss, "(`num'q.{0,1})") 
    }

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int obs str64 cnaloss
      1 ""                                                                
      2 ""                                                                
      3 ""                                                                
      4 "12pe"                                                            
      5 "5qe, 12qe, 16qe"                                                 
      6 ""                                                                
      7 "7pe, 7qe"                                                        
      8 "5qe, 7qe, 12pe"                                                  
      9 ""                                                                
     10 "5qe, 16qe, 17pe"                                                 
     11 "3qe, 5qe, 7pe, 7qe"                                              
     12 ""                                                                
     13 "2qe, 3pe, 3qe, 5qe, 12pe, 18pe, 18qe"                            
     14 "5qe, 7qe, 12pe, 17qe"                                            
     15 "5qe, 7qe, 9pe, 11qe, 13pe, 16pe, 16qe"                           
     16 "5qc, 16pe"                                                       
     17 ""                                                                
     18 ""                                                                
     19 "3pe, 5qe, 11qe, 16qe"                                            
     20 ""                                                                
     21 ""                                                                
     22 "5qc, 7qe, 12pc, 14pc, 16qe, 17pe, 18qe, 20qe"                    
     23 "3qe, 5qe, 8pe, 13pe, 14qe, 15pe, 16qe, 17pe, 18qe"               
     24 "5qe, 6qe, 19pe"                                                  
     25 ""                                                                
     26 ""                                                                
     27 ""                                                                
     28 ""                                                                
     29 "9qe, 18pc, 18qc"                                                 
     30 ""                                                                
     31 "7pc, 7qc, 11pe, 17pe"                                            
     32 ""                                                                
     33 ""                                                                
     34 ""                                                                
     35 "9qe, 21qe"                                                       
     36 "3pe, 3qe, 5qe, 7pe, 7qe, 8pe"                                    
     37 ""                                                                
     38 ""                                                                
     39 ""                                                                
     40 "5qc, 15pe, 16qc"                                                 
     41 "2qe, 3pe, 5qe, 10pe, 12pe, 20pe"                                 
     42 "3pe, 5qc, 7qc"                                                   
     43 ""                                                                
     44 "3pc, 5qe, 11pc"                                                  
     45 "3pe, 5qe, 7qe, 16pe, 16qe, 20qe"                                 
     46 "5qc, 7pc, 7qc, 16pe, 17pe, 18pc, 19pe"                           
     47 ""                                                                
     48 "4qe, 5qe, 17pe, 18pe"                                            
     49 "5qe"                                                             
     50 ""                                                                
     51 "7pe, 7qe"                                                        
     52 "5qc"                                                             
     53 "3qe, 4qe, 5qe, 7pe, 13pe, 16qe, 17pe, 19pe "                     
     54 "3pe, 5qe, 7qe, 13pe"                                             
     55 ""                                                                
     56 "4pe, 4qe, 5pe, 5qe, 7pe, 7qe, 12pe, 12qe, 13pe, 16qe, 17pe, 17qe"
     57 "2pe, 5qe, 7qe, 16qe, 17pe"                                       
     58 "3pe, 4qe, 7qe, 9qe, 11pc"                                        
     59 "5qc, 11pe"                                                       
     60 "4qe, 5qe, 7qe, 11pe, 12pe, 15pe, 16qe, 20qe"                     
     61 ""                                                                
     62 "3pe, 5qe, 7pe, 16pe, 16qe, 17pe, 18pe, 18qe, 20qe"               
     63 "5qe, 6qe, 16qe"                                                  
     64 "2pe, 2qe, 3pe, 7pe, 7qe, 12pe, 17pe, 20qe"                       
     65 "5qe, 7pe, 7qe, 17pe, 18pe, 18qe, 20qe"                           
     66 "3pe, 5qe, 7pe, 7qe, 10pe, 12pe"                                  
     67 ""                                                                
     68 ""                                                                
     69 "6pe, 6qe, 17pe, 17qe, 21pe, 21qe"                                
     70 ""                                                                
     71 "5qe, 7pe, 12pe, 16qe, 20pe"                                      
     72 ""                                                                
     73 "5qe, 7pe"                                                        
     74 ""                                                                
     75 ""                                                                
     76 "5qe, 7qe, 12qe, 16pe, 17pe"                                      
     77 ""                                                                
     78 ""                                                                
     79 ""                                                                
     80 "5qe, 7pe, 7qe"                                                   
     81 "7pe, 7qe"                                                        
     82 ""                                                                
     83 ""                                                                
     84 "2qe, 5qe, 16pe"                                                  
     85 ""                                                                
     86 ""                                                                
     87 "5qe, 7qe, 16pe, 17pe, 20qe"                                      
     88 ""                                                                
     89 "5qc, 7qc, 8pe, 18pc, 20qc"                                       
     90 "5qe, 7pe, 7qe, 11pe, 17pe"                                       
     91 ""                                                                
     92 "5qe, 8qe, 18pe"                                                  
     93 ""                                                                
     94 ""                                                                
     95 ""                                                                
     96 ""                                                                
     97 ""                                                                
     98 ""                                                                
     99 "5qc, 11qe, 17pc, 20qc"                                           
    101 "5qe, 7qe, 15pe"                                                  
    end

  • #2
    Here is some technique that ignores regular expressions

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int obs str64 cnaloss
      1 ""                                                                
      2 ""                                                                
      3 ""                                                                
      4 "12pe"                                                            
      5 "5qe, 12qe, 16qe"                                                 
      6 ""                                                                
      7 "7pe, 7qe"                                                        
      8 "5qe, 7qe, 12pe"                                                  
      9 ""                                                                
     10 "5qe, 16qe, 17pe"                                                 
     11 "3qe, 5qe, 7pe, 7qe"                                              
     12 ""                                                                
     13 "2qe, 3pe, 3qe, 5qe, 12pe, 18pe, 18qe"                            
     14 "5qe, 7qe, 12pe, 17qe"                                            
     15 "5qe, 7qe, 9pe, 11qe, 13pe, 16pe, 16qe"                           
     16 "5qc, 16pe"                                                       
     17 ""                                                                
     18 ""                                                                
     19 "3pe, 5qe, 11qe, 16qe"                                            
     20 ""                                                                
     21 ""                                                                
     22 "5qc, 7qe, 12pc, 14pc, 16qe, 17pe, 18qe, 20qe"                    
     23 "3qe, 5qe, 8pe, 13pe, 14qe, 15pe, 16qe, 17pe, 18qe"               
     24 "5qe, 6qe, 19pe"                                                  
     25 ""                                                                
     26 ""                                                                
     27 ""                                                                
     28 ""                                                                
     29 "9qe, 18pc, 18qc"                                                 
     30 ""                                                                
     31 "7pc, 7qc, 11pe, 17pe"                                            
     32 ""                                                                
     33 ""                                                                
     34 ""                                                                
     35 "9qe, 21qe"                                                       
     36 "3pe, 3qe, 5qe, 7pe, 7qe, 8pe"                                    
     37 ""                                                                
     38 ""                                                                
     39 ""                                                                
     40 "5qc, 15pe, 16qc"                                                 
     41 "2qe, 3pe, 5qe, 10pe, 12pe, 20pe"                                 
     42 "3pe, 5qc, 7qc"                                                   
     43 ""                                                                
     44 "3pc, 5qe, 11pc"                                                  
     45 "3pe, 5qe, 7qe, 16pe, 16qe, 20qe"                                 
     46 "5qc, 7pc, 7qc, 16pe, 17pe, 18pc, 19pe"                           
     47 ""                                                                
     48 "4qe, 5qe, 17pe, 18pe"                                            
     49 "5qe"                                                             
     50 ""                                                                
     51 "7pe, 7qe"                                                        
     52 "5qc"                                                             
     53 "3qe, 4qe, 5qe, 7pe, 13pe, 16qe, 17pe, 19pe "                     
     54 "3pe, 5qe, 7qe, 13pe"                                             
     55 ""                                                                
     56 "4pe, 4qe, 5pe, 5qe, 7pe, 7qe, 12pe, 12qe, 13pe, 16qe, 17pe, 17qe"
     57 "2pe, 5qe, 7qe, 16qe, 17pe"                                       
     58 "3pe, 4qe, 7qe, 9qe, 11pc"                                        
     59 "5qc, 11pe"                                                       
     60 "4qe, 5qe, 7qe, 11pe, 12pe, 15pe, 16qe, 20qe"                     
     61 ""                                                                
     62 "3pe, 5qe, 7pe, 16pe, 16qe, 17pe, 18pe, 18qe, 20qe"               
     63 "5qe, 6qe, 16qe"                                                  
     64 "2pe, 2qe, 3pe, 7pe, 7qe, 12pe, 17pe, 20qe"                       
     65 "5qe, 7pe, 7qe, 17pe, 18pe, 18qe, 20qe"                           
     66 "3pe, 5qe, 7pe, 7qe, 10pe, 12pe"                                  
     67 ""                                                                
     68 ""                                                                
     69 "6pe, 6qe, 17pe, 17qe, 21pe, 21qe"                                
     70 ""                                                                
     71 "5qe, 7pe, 12pe, 16qe, 20pe"                                      
     72 ""                                                                
     73 "5qe, 7pe"                                                        
     74 ""                                                                
     75 ""                                                                
     76 "5qe, 7qe, 12qe, 16pe, 17pe"                                      
     77 ""                                                                
     78 ""                                                                
     79 ""                                                                
     80 "5qe, 7pe, 7qe"                                                   
     81 "7pe, 7qe"                                                        
     82 ""                                                                
     83 ""                                                                
     84 "2qe, 5qe, 16pe"                                                  
     85 ""                                                                
     86 ""                                                                
     87 "5qe, 7qe, 16pe, 17pe, 20qe"                                      
     88 ""                                                                
     89 "5qc, 7qc, 8pe, 18pc, 20qc"                                       
     90 "5qe, 7pe, 7qe, 11pe, 17pe"                                       
     91 ""                                                                
     92 "5qe, 8qe, 18pe"                                                  
     93 ""                                                                
     94 ""                                                                
     95 ""                                                                
     96 ""                                                                
     97 ""                                                                
     98 ""                                                                
     99 "5qc, 11qe, 17pc, 20qc"                                           
    101 "5qe, 7qe, 15pe"                                                  
    end
    
    gen work = cnaloss 
    
    quietly foreach x in q c e p {
        replace work = subinstr(work, "`x'", "", .)
    }
    
    split work, destring parse(,)
    local parts `r(varlist)'
    
    forval i = 1/22 { 
        gen is`i' = 0 
        
        quietly foreach v of local parts { 
            replace is`i' = 1 if `v' == `i' 
        }
    }
    
    su is*
    and some technique that doesn't

    Code:
    * ssc install moss 
    moss cnaloss, match("([0-9]+)") regex 
    
    * proceed as above

    Comment


    • #3
      This is quite useful, Nick Cox. I should have clarified better earlier. I had to have the match pick the p or q after the number corresponding loss of 12p (short arm of chromosome 12) or 17q (long arm of chromosome 17) for example. How do I configure the above code to work with that? I tried this below and it still picks 6q as a match for 16q. The c and e were extra suffixes for for partial or equivocal losses.

      Code:
      forvalue num = 1/22{
          gen cnaloss`num'p = ustrregexm(cnaloss, "`num'(?![0-9]+)p.{0,1}") if !missing(cnaloss)
          gen cnaloss`num'q = ustrregexm(cnaloss, "`num'(?![0-9]+)q.{0,1}") if !missing(cnaloss)
          gen cnagain`num'p = ustrregexm(cnagain, "`num'(?![0-9]+)p.{0,1}") if !missing(cnaloss) 
          gen cnagain`num'q = ustrregexm(cnagain, "`num'(?![0-9]+)q.{0,1}") if !missing(cnaloss) 
          }

      Comment


      • #4
        There may be a way to get this right with regular expressions, but I don't see it. (Of course, my regular expressions skills are not the best, so take that with a pinch of salt.) But here's a simple way to get what you want:
        Code:
        preserve
        split cnaloss, parse(", ") gen(token)
        reshape long token, i(obs)
        drop if missing(token)
        drop _j
        levelsof token, local(tokens)
        
        restore
        foreach t of local tokens {
            gen byte cnaloss_`t' = strpos(cnaloss, "`t'") > 0
        }
        The key here is not to focus on the numbers, since the numbers themselves lack enough information to determine the handling you want. I think you have to just do it token by token.

        Comment


        • #5
          There may be a way to get this right with regular expressions, but I don't see it. (Of course, my regular expressions skills are not the best, so take that with a pinch of salt.) But here's a simple way to get what you want:
          Code:
          preserve
          split cnaloss, parse(", ") gen(token)
          reshape long token, i(obs)
          drop if missing(token)
          drop _j
          levelsof token, local(tokens)
          
          restore
          foreach t of local tokens {
              gen byte cnaloss_`t' = strpos(cnaloss, "`t'") > 0
          }
          The key here is not to focus on the numbers, since the numbers themselves lack enough information to determine the handling you want. I think you have to just do it whole token by whole token.

          Added: This is very similar to what Nick proposed in #2 as a non-regex approach. The difference is that this one does not look separately at the numbers and letters: it treats the number-letter combination (token) as a whole.
          Last edited by Clyde Schechter; 03 Sep 2023, 18:28.

          Comment


          • #6
            Looks like you have a very specific pattern. Just specify it explicitly.

            Code:
            forval i=1/22{
                gen isp`i'= ustrregexm(cnaloss, "\b`i'p[a-z]\b")
                gen isq`i'= ustrregexm(cnaloss, "\b`i'q[a-z]\b")
            }

            Comment


            • #7
              I overlooked the mention of p and q in #1 but

              Code:
              moss cnaloss, match("([0-9]+[p|q])") regex
              finds trailing p or q as is desired, after which a set of indicators can be found.

              Comment


              • #8
                Thanks Clyde Schechter, Nick Cox and Andrew Musau for all your approaches. All of them worked for the issue at hand. I adapted Andrew Musau's version and just added a bit to separate equivocal ("-e" suffix) and complete ("-c" suffix) losses with respective levels denoted by 1 and 2 respectively for each chromosome loss. Good to know that two commas may also serve a word boundaries. Somehow, I always thought there had to be at least one space in word boundaries per regex.

                My final version:
                Code:
                forval i=1/22{
                    gen isp`i'= ustrregexm(cnaloss, "\b`i'p[a-z]\b")
                    replace isp`i'= 2 if ustrregexm(cnaloss, "\b`i'pc\b")
                    gen isq`i'= ustrregexm(cnaloss, "\b`i'q[a-z]\b")
                    replace isp`i'= 2 if ustrregexm(cnaloss, "\b`i'qc\b")
                }

                Comment


                • #9
                  Sorry, my final code has a copy/paste error in p and q.

                  This was the final version.
                  Code:
                  forval i=1/22{
                      gen isp`i'= ustrregexm(cnaloss, "\b`i'p[a-z]\b") 
                      replace isp`i'= 2 if ustrregexm(cnaloss, "\b`i'pc\b")
                      replace isp`i' = . if missing(cnaloss)
                      gen isq`i'= ustrregexm(cnaloss, "\b`i'q[a-z]\b") 
                      replace isq`i'= 2 if ustrregexm(cnaloss, "\b`i'qc\b") 
                      replace isq`i' = . if missing(cnaloss)
                  }

                  Comment

                  Working...
                  X