Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Text splitting with more than one repeating value (I am using moss but appreciate other possible solutions)

    Hi,

    I have the following data

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str200 var1
    "='General Inputs'!AC107"                                                                                                                                                                                 
    "='NC_Opex'!AL37"                                                                                                                                                                                         
    "='Model - Revenue Impact'!N$41"                                                                                                                                                                          
    "=SUMIFS('S8.Billed Quantities+Revenues'!$V,'S8.Billed Quantities+Revenues'!$D,$B364,'S8.Billed Quantities+Revenues'!$E,$C364,'S8.Billed Quantities+Revenues'!$F,$D364,'S8.Billed Quantities+Revenues'!$G"
    end
    I want to extract any information between ' '.

    I am using the following formula


    Code:
    moss var1, match("(\'.+\')") regex max(1)
    But my problem is that I only need the First instance where we have ' ',

    The ending solution i.e., var2 should look like the following for the last observation S8.Billed Quantities+Revenues.


    Any help would be greatly appreciated.

    Thanks

  • #2
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str200 var1
    "='General Inputs'!AC107"                                                                                                                                                                                 
    "='NC_Opex'!AL37"                                                                                                                                                                                         
    "='Model - Revenue Impact'!N$41"                                                                                                                                                                          
    "=SUMIFS('S8.Billed Quantities+Revenues'!$V,'S8.Billed Quantities+Revenues'!$D,$B364,'S8.Billed Quantities+Revenues'!$E,$C364,'S8.Billed Quantities+Revenues'!$F,$D364,'S8.Billed Quantities+Revenues'!$G"
    end
    
    g wanted= ustrregexra(var1, ".*\'(.*)\'.*","$1")
    Res.:

    Code:
    . l wanted
    
         +-------------------------------+
         |                        wanted |
         |-------------------------------|
      1. |                General Inputs |
      2. |                       NC_Opex |
      3. |        Model - Revenue Impact |
      4. | S8.Billed Quantities+Revenues |
         +-------------------------------+

    Comment


    • #3
      Code:
      loc from substr(var1, strpos(var1, char(39))+1, .)
      gen var2 = substr(`from', 1, strpos(`from', char(39))-1)
      

      Comment


      • #4
        A reg-ex based solution which run close to the speed of #3:
        Code:
         gen wanted3 = regexs(1) if regexm(var1, "'([^']*)" )

        Comment


        • #5
          Thanks Both,

          Bjarte Aagnes both of your solutions work great.

          Comment


          • #6
            Originally posted by Bjarte Aagnes View Post
            A reg-ex based solution which run close to the speed of #3:
            Code:
             gen wanted3 = regexs(1) if regexm(var1, "'([^']*)" )
            A secondary question. What if we want to find the 2nd instance of words between ' '. How would we change the formula?

            Thanks

            Comment


            • #7
              gen m2 = regexs(2) if regexm(var1, "'([^']*)'[^']*'([^']*)" )
              Last edited by Bjarte Aagnes; 27 Jul 2022, 02:20.

              Comment


              • #8
                ref #7. You might next want to find 3 or more instances: First, using regexm() up to nine capture groups can be defined. Second, we may either have data with a known number of elements to match (in each observation), or a unknown number (1-9) of elements. The following example code, which should be copied to the do file editor and run from start to end, show code for up to nine matches. First for a unknown number of matches, next for a known number of (9) matches.

                Some example data:
                Code:
                * Example generated by -dataex-. For more info, type help dataex
                clear
                input str135 var1
                "noise'AAA'noice"                                                                                                                        
                "noise'AAA'noicenoise'BBB'noice"                                                                                                        
                "noise'AAA'noicenoise'BBB'noicenoise'CCC'noice"                                                                                          
                "noise'AAA'noicenoise'BBB'noicenoise'CCC'noicenoise'DDD'noice"                                                                          
                "noise'AAA'noicenoise'BBB'noicenoise'CCC'noicenoise'DDD'noicenoise'EEE'noice"                                                            
                "noise'AAA'noicenoise'BBB'noicenoise'CCC'noicenoise'DDD'noicenoise'EEE'noicenoise'FFF'noice"                                            
                "noise'AAA'noicenoise'BBB'noicenoise'CCC'noicenoise'DDD'noicenoise'EEE'noicenoise'FFF'noicenoise'GGG'noice"                              
                "noise'AAA'noicenoise'BBB'noicenoise'CCC'noicenoise'DDD'noicenoise'EEE'noicenoise'FFF'noicenoise'GGG'noicenoise'HHH'noice"              
                "noise'AAA'noicenoise'BBB'noicenoise'CCC'noicenoise'DDD'noicenoise'EEE'noicenoise'FFF'noicenoise'GGG'noicenoise'HHH'noicenoise'III'noice"
                end
                
                expand 1000
                If a unknown number of matches:
                Code:
                local cpt '([^']*)'
                local oth [^']*
                
                gen len = strlen(var1) - strlen(subinstr(var1, char(39), "", .))
                qui su len , meanonly
                local np = r(max)/2
                
                forvalues i = 1/`np' {
                
                   local rx = `i' * "`cpt'`oth'"
                   gen match_`i' = regexs(`i') if regexm(var1, "`rx'" )
                }
                If all observations have the same known number of matches a faster alternative is
                Code:
                keep if  len==18  // exactly 9 pairs for 9 regex capture groups
                
                local rx = 9 * "`cpt'`oth'" // build regex
                
                tempvar match
                
                gen `match' =  ///
                  regexs(1) + "||" ///
                + regexs(2) + "||" ///
                + regexs(3) + "||" ///
                + regexs(4) + "||" ///
                + regexs(5) + "||" ///
                + regexs(6) + "||" ///
                + regexs(7) + "||" ///
                + regexs(8) + "||" ///
                + regexs(9) + "||" ///
                     if regexm(var1, "`rx'" )  // matches exactly 9 pairs
                
                forvalues i=1/9 {
                
                    gen m_`i' = substr(`match', 1, strpos(`match',"||")-1)
                    replace `match' = substr(`match', strpos(`match',"||")+2,.)
                }
                

                Comment


                • #9
                  Thanks for going an extra mile and showing me the loop for finding more than one instances. Worked like a charm. Much appreciated.

                  Comment


                  • #10
                    Thanks, #8 covered the regexm() solution. If running time matters, for large data and/or many observations/repetitions, or more than 9 pairs (max # capture groups) a extension of #3 is:
                    Code:
                    tempvar match 
                    gen `match' = var1
                        
                    forvalues i=1/9 {
                    
                        loc from substr(`match', strpos(`match', char(39))+1, .)    
                        gen m_`i' = substr(`from', 1, strpos(`from', char(39))-1)    
                        replace `match' = subinstr(`match', char(39) + m_`i' + char(39), "", 1)
                    }

                    Comment


                    • #11
                      I have a very allied query. So I am asking in this thread instead of reposting.

                      I want to get the block of text matching from "ER" to the first instance of the options as below. PR would be the most constant end of block string.

                      I could get as far as:

                      replace ertest = ustrregexs(0) if ustrregexm(Diagnosis_Text, "ER.*(nuclei|expression|PR|%|-|pending)")

                      Code:
                      * Example generated by -dataex-. For more info, type help dataex
                      clear
                      input str611 ertest
                      "ER: 34%"                                                                                                                                                
                      "ER: Strong nuclear staining in >90% of invasive tumor nuclei"                                                                                           
                      "ER: Strong expression in >90% of invasive tumor nuclei"                                                                                                 
                      "ER: strong expression in >90% of invasive nuclei"                                                                                                       
                      "ER: Strong expression in >90% of invasive nuclei"                                                                                                       
                      "ER: Strong expression in > 90% of invasive tumor nuclei"                                                                                                
                      "ER: Negative"                                                                                                                                           
                      "ER: strong expression in >90% of invasive nuclei"                                                                                                       
                      "ER: Strong expression in >90% of invasive nuclei"                                                                                                       
                      "ER: Strong expression in 90% of invasive tumor nuclei"                                                                                                  
                      "ER: strong expression in 90%"                                                                                                                           
                      "ER: Weak expression in 50% of invasive nuclei"                                                                                                          
                      "ER (>95%, strong staining) and PR (>95%"                                                                                                                
                      "ER (>95%"                                                                                                                                               
                      "ER: expressed in 95% of invasive tumor nuclei"                                                                                                          
                      "ER:  Strong staining in >90% of invasive nuclei (in-"                                                                                                   
                      "ER: Strong expression in >90 % of invasive nuclei"                                                                                                      
                      "ER: No expression"                                                                                                                                      
                      "ER: Strong expression in >90% of invasive tumor nuclei"                                                                                                 
                      ""                                                                                                                                                       
                      "ER: Faint expression in 5% of invasive tumor nuclei"                                                                                                    
                      "ER:  Strong staining in 98% of tumor nuclei"                                                                                                            
                      "ER: Strong expression in >95% of invasive nuclei"                                                                                                       
                      "ER: Strong expression in >95% of invasive nuclei"                                                                                                       
                      "ER: moderate expression in 90% of invasive tumor nuclei"                                                                                                
                      "ER: Moderate expression in 90% of invasive tumor nuclei"                                                                                                
                      "ER: Weak expression in 25% of invasive tumor nuclei"                                                                                                    
                      ""                                                                                                                                                       
                      "ER: strong expression in >90% of invasive nuclei"                                                                                                       
                      ""                                                                                                                                                       
                      "ER: Strong expression in >90% of invasive tumor nuclei"                                                                                                 
                      "ER: strong expression in >95% of invasive nuclei"                                                                                                       
                      "ER: strong expression in >95% of invasive nuclei"                                                                                                       
                      "ER: Strong expression in >95% of invasive tumor nuclei"                                                                                                 
                      "ER: Moderate to strong expression in 80% invasive nuclei"                                                                                               
                      "ER: Strong expression in >90% of invasive nuclei"                                                                                                       
                      "ER: strong expression in 95% of invasive nuclei"                                                                                                        
                      "ER: Strong expression in 95% of invasive nuclei"                                                                                                        
                      "ER: No expression of invasive tumor nuclei (0%"                                                                                                         
                      "ER: Moderate and weak expression in 30% of tumor nuclei"                                                                                                
                      "ER: strong to moderate expression in >90% of invasive nuclei"                                                                                           
                      "ER: Strong to moderate expression in >90% of invasive nuclei"                                                                                           
                      "ER: Strong expression in 90% of invasive tumor nuclei"                                                                                                  
                      ""                                                                                                                                                       
                      "ER: Moderate expression in 50% of invasive nuclei"                                                                                                      
                      ""                                                                                                                                                       
                      "ER: Strong expression in >90% of invasive tumor nuclei"                                                                                                 
                      "ER: No expression in invasive nuclei"                                                                                                                   
                      "ER: No expression"                                                                                                                                      
                      "ER: Strong expression in >90% of invasive nuclei"                                                                                                       
                      "ER: Strong expression in 90% of invasive nuclei"                                                                                                        
                      "ER: strong expression in 90% of invasive nuclei"                                                                                                        
                      "ER:  Strong staining in >95% of invasive nuclei"                                                                                                        
                      "ER: No expression in invasive tumor nuclei"                                                                                                             
                      "ER: no staining"                                                                                                                                        
                      "ER (no staining identified), PR (no staining identified), and HER2 (score 0). Ki67 marks approximately 20%"                                             
                      "ER: strong staining in 100% of invasive tumor nuclei"                                                                                                   
                      "ER: Strong expression in 95% of invasive tumor nuclei"                                                                                                  
                      "ER: Strongly positive in 80% of tumor cell nuclei"                                                                                                      
                      "ER: strong to moderate expression in >90% of invasive nuclei"                                                                                           
                      "ER: Moderate to strong expression in >90% of invasive nuclei"                                                                                           
                      "ER: Strong expression in >95% of in situ nuclei"                                                                                                        
                      "ER: Strong expression in >95 % of DCIS nuclei"                                                                                                          
                      "ER: Strong expression in >95% of invasive tumor nuclei"                                                                                                 
                      "ER (>95%, strong staining) and PR (>90%, moderate to strong staining). HER2 via IHC is negative (score 0). Ki67 marks <10% of invasive carcinoma nuclei"
                      "ER: strong to moderate expression in 90% of invasive nuclei"                                                                                            
                      "ER: Strong to moderate expression in 90% of invasive nuclei"                                                                                            
                      "ER: Weak expression in 1% of invasive nuclei"                                                                                                           
                      "ER: Positive, 100%"                                                                                                                                     
                      "ER: Strong expression in >90% of invasive nuclei"                                                                                                       
                      "ER: Moderately positive in >51%"                                                                                                                        
                      "ER:  Strong staining in 90%"                                                                                                                            
                      ""                                                                                                                                                       
                      "ER: Strong expression in >90% of invasive nuclei"                                                                                                       
                      ""                                                                                                                                                       
                      "ER:  100% of invasive nuclei"                                                                                                                           
                      "ER: Weak to faint expression in 1% of LCIS nuclei"                                                                                                      
                      "ER: Strong expression in >90% of invasive nuclei"                                                                                                       
                      "ER: Strong expression in 100% of invasive carcinoma nuclei, per report.     - PR: Moderate expression in 80% of invasive carcinoma nuclei"         
                      "ER: Strong expression in 80-90% of tumor nuclei"                                                                                                        
                      ""                                                                                                                                                       
                      "ER: Strong expression in >90% of invasive nuclei"                                                                                                       
                      ""                                                                                                                                                       
                      "ER: Negative, per report"                                                                                                                               
                      "ER: weak to faint expression in 30% of tumor nuclei"                                                                                                    
                      "ER: Negative (0%"                                                                                                                                       
                      "ER: No expression"                                                                                                                                      
                      ""                                                                                                                                                       
                      "ER: Strong expression in 99% of invasive tumor nuclei"                                                                                                  
                      ""                                                                                                                                                       
                      "ER: Strong expression in 99% of invasive tumor nuclei"                                                                                                  
                      "ER: Weak expression in 20% of DCIS nuclei"                                                                                                              
                      "ER: Expression in 82% of invasive tumor nuclei"                                                                                                         
                      "ER: weak expression in rare cells (<1%"                                                                                                                 
                      "ER: Strong expression in >95% of DCIS nuclei"                                                                                                           
                      "ER: Strong expression in 90% of invasive nuclei"                                                                                                        
                      "ER: Strong expression in >90% of invasive tumor nuclei"                                                                                                 
                      "ER: Strong expression in >90% of in situ nuclei"                                                                                                        
                      "ER: strong expression in >90% of insitu nuclei"                                                                                                         
                      "ER: strong expression in >95% of invasive nuclei"                                                                                                       
                      end

                      Comment

                      Working...
                      X