Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • regexm vs strpos for excluding part of a string

    Hi,

    I am learning to use regular expressions, and I am curious about the approach to take when trying to exclude a specific portion of text in a string with regexm.

    A simplification of my problem: I have a variable called rainbow, which can be: "pinkyellow redyellow yellowgreen yellow purple" and I want to generate a variable colours = 1 when rainbow contains yellow but not red.

    I can achieve this with strops as follows: replace colours= 1 if strpos(rainbow,"yellow") > 0 & strpos(rainbow,"red") == 0

    how can I achieve the same result with regexm? that is, how do you exclude a specific string such as "red" using regexm?

    Thanks a lot!


  • #2
    Hi, learning regex, is also learning when not to use a regex. In general, if you can use standard string functions they will be faster. What you ask for is not where regular expressions are most useful. I doubt there is any solution using regexm(), but ustrregexm() supports more operators, including "Negative Look-behind" and "Negative look-ahead". Below you find an example, with timings, which show the use of strpos() is much faster. The expression tested are:
    Code:
    replace c1 = strpos(rainbow,"yellow") > 0 & strpos(rainbow,"red") == 0
    replace c2  = ustrregexm(rainbow,"(?<!red ?+)yellow(?! ?+red)")
    Code:
    clear
    input str20 rainbow
    "pinkyellow"
    "redyellow" 
    "yellowgreen" 
    "yellow purple"
    "yellowred"
    "yellow red"
    "red yellow"   
    end
    
    expand 10000 , gen(expanded)
    
    gen c1 = .
    gen c2 = .
    
    forvalues i = 1/100 {
    
        timer on 1
        replace c1 = strpos(rainbow,"yellow") > 0 & strpos(rainbow,"red") == 0
        timer off 1
    
        timer on 2
        replace c2  = ustrregexm(rainbow,"(?<!red ?+)yellow(?! ?+red)")
        timer off 2
    } 
    assert c1 == c2
    
    timer list
    timer clear
    
    drop if expanded
    drop expanded
    list
    Code:
    . timer list
       1:      1.36 /      100 =       0.0136
       2:     11.73 /      100 =       0.1173
    
    . timer clear
    
    . 
    . drop if expanded
    (69,993 observations deleted)
    
    . drop expanded
    
    . list
    
         +-------------------------+
         |       rainbow   c1   c2 |
         |-------------------------|
      1. |    pinkyellow    1    1 |
      2. |     redyellow    0    0 |
      3. |   yellowgreen    1    1 |
      4. | yellow purple    1    1 |
      5. |     yellowred    0    0 |
         |-------------------------|
      6. |    yellow red    0    0 |
      7. |    red yellow    0    0 |
         +-------------------------+

    Comment


    • #3
      You may try this (I added "blue", just for the sake of expanding the exercise:

      Code:
      . list
      
           +-------------+
           |        var1 |
           |-------------|
        1. |  pinkyellow |
        2. |   redyellow |
        3. | yellowgreen |
        4. |      yellow |
        5. |      purple |
           |-------------|
        6. |        blue |
           +-------------+
      
      . * you shall start by typing the commands below
      
      . gen color =0
      
      . replace color = 1 if regexm(var1, "yellow")
      (4 real changes made)
      
      . * that's done. Below, you may check it out
      
      . list
      
           +---------------------+
           |        var1   color |
           |---------------------|
        1. |  pinkyellow       1 |
        2. |   redyellow       1 |
        3. | yellowgreen       1 |
        4. |      yellow       1 |
        5. |      purple       0 |
           |---------------------|
        6. |        blue       0 |
           +---------------------+
      Hopefully that helps
      Last edited by Marcos Almeida; 14 Jun 2018, 16:23.
      Best regards,

      Marcos

      Comment


      • #4
        My view on regex functions appears identical to that of Bjarte Aagnes. They can be invaluable, but I've seen much time wasted by people struggling to find regular expression solutions to problems that are easily soluble with more mundane string functions. But in turn you don't become fluent with either group of functions without lots of practice.

        This is just some footnotes to earlier points:

        1. If you care just about whether a match exists, you can exploit the fact that a positive result for strpos() means true and a zero result means false. In this sense strpos() within logical expressions (those with results 0 or 1) can be thought of privately as indicating "contains" or "includes". It follows that you can often omit detail such as > 0 or == 0

        2. strmatch() can also be useful if a minimal step towards regex is congenial.


        Code:
        clear
        input str20 rainbow
        "pinkyellow"
        "redyellow"
        "yellowgreen"
        "yellow purple"
        "yellowred"
        "yellow red"
        "red yellow"  
        end
        
        gen c1 = strpos(rainbow, "yellow") & !strpos(rainbow, "red")
        
        gen c2 = strmatch(rainbow, "*yellow*") & !strmatch(rainbow, "*red*")
        
        . list, sep(0)
        
             +-------------------------+
             |       rainbow   c1   c2 |
             |-------------------------|
          1. |    pinkyellow    1    1 |
          2. |     redyellow    0    0 |
          3. |   yellowgreen    1    1 |
          4. | yellow purple    1    1 |
          5. |     yellowred    0    0 |
          6. |    yellow red    0    0 |
          7. |    red yellow    0    0 |
             +-------------------------+

        Comment

        Working...
        X