Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • regex expression, extract the number from string

    Hi, dear all
    I want to extact all numbers from the string. I want to use the Lookahead and Lookbehind Zero-Length Assertions to do it (https://bedigit.com/blog/regex-how-t...cular-pattern/).

    Code:
    clear
    input str64 x
    "math:96;chinese:85; english:92; physical:90;"
    "math:91;chinese:82; english:88; physical:98;"
    "math:86;chinese:85; english:81; physical:90;"
    "math:93;chinese:85; english:88; physical:90;"
    "math:70;chinese:85; english:83; physical:91;"
    "math:80;chinese:85; english:81; physical:92;"
    end
    
    gen grade1 = ustrregexs(1) if ustrregexm(x, "(?<=\:)([0-9]{2})")
    list
     +-----------------------------------------------------------+
         |                                                x   grade1 |
         |-----------------------------------------------------------|
      1. | math:96;chinese:85; english:92; physical:90;       96 |
      2. | math:91;chinese:82; english:88; physical:98;       91 |
      3. | math:86;chinese:85; english:81; physical:90;       86 |
      4. | math:93;chinese:85; english:88; physical:90;       93 |
      5. | math:70;chinese:85; english:83; physical:91;       70 |
         |-----------------------------------------------------------|
      6. | math:80;chinese:85; english:81; physical:92;       80 |
         +-----------------------------------------------------------+
    However, why only the first number is extracted?

    Thanks very much!

    Bests,
    wanhai

  • #2
    Because your regular expression has only one pair of capturing parentheses in it.

    Perhaps the following example will point you in a useful direction. I define a local macro to contain the regular expression to make the code more readable, it is not necessary. I note that what appears to be a colon in all the data above is actually the Unicode "fullwidth colon" character U+FF1A. If you try to copy-and-paste it you'll see that the apparent space following the colon is actually part of the Unicode character. This explains why the column headings do not properly align with the data below them when displayed on Statalist in a CODE block.
    Code:
    local regex `"(?<=\:)([0-9]{2})[^UFF1A]*(?<=\:)([0-9]{2})[^UFF1A]*(?<=\:)([0-9]{2})[^UFF1A]*(?<=\:)([0-9]{2})"'
    gen grade1 = ustrregexs(1) if ustrregexm(x, `"`regex'"')
    gen grade2 = ustrregexs(2) if ustrregexm(x, `"`regex'"')
    gen grade3 = ustrregexs(3) if ustrregexm(x, `"`regex'"')
    gen grade4 = ustrregexs(4) if ustrregexm(x, `"`regex'"')
    list, clean noobs
    Code:
    . list, clean noobs
    
                                                       x   grade1   grade2   grade3   grade4  
        math:96;chinese:85; english:92; physical:90;       96       85       92       90  
        math:91;chinese:82; english:88; physical:98;       91       82       88       98  
        math:86;chinese:85; english:81; physical:90;       86       85       81       90  
        math:93;chinese:85; english:88; physical:90;       93       85       88       90  
        math:70;chinese:85; english:83; physical:91;       70       85       83       91  
        math:80;chinese:85; english:81; physical:92;       80       85       81       92
    And the following example demonstrates a different approach utilizing Stata's tools for splitting text strings, and converting the grades from strings to numbers in the process.
    Code:
    split x, parse(; :) destring
    rename (x2 x4 x6 x8) (grade#), addnumber
    drop x?
    list, clean noobs
    Code:
    . split x, parse(; :) destring
    variables born as string:
    x1  x2  x3  x4  x5  x6  x7  x8
    x1: contains nonnumeric characters; no replace
    x2: all characters numeric; replaced as byte
    x3: contains nonnumeric characters; no replace
    x4: all characters numeric; replaced as byte
    x5: contains nonnumeric characters; no replace
    x6: all characters numeric; replaced as byte
    x7: contains nonnumeric characters; no replace
    x8: all characters numeric; replaced as byte
    
    . rename (x2 x4 x6 x8) (grade#), addnumber
    
    . drop x?
    
    . list, clean noobs
    
                                                       x   grade1   grade2   grade3   grade4  
        math:96;chinese:85; english:92; physical:90;       96       85       92       90  
        math:91;chinese:82; english:88; physical:98;       91       82       88       98  
        math:86;chinese:85; english:81; physical:90;       86       85       81       90  
        math:93;chinese:85; english:88; physical:90;       93       85       88       90  
        math:70;chinese:85; english:83; physical:91;       70       85       83       91  
        math:80;chinese:85; english:81; physical:92;       80       85       81       92
    Last edited by William Lisowski; 17 Mar 2019, 08:34.

    Comment


    • #3
      As long as I'm at it, this example might be more useful in some situations, especially if different observations have different sets of subjects or the subjects can appear in a different order.
      Code:
      clear
      input str64 x
      "math:96;chinese:85; english:92; physical:90;"
      "math:91;chinese:82; english:88; physical:98;"
      "math:86;chinese:85; english:81; physical:90;"
      "math:93;chinese:85; english:88; physical:90;"
      "math:70;chinese:85; english:83; physical:91;"
      "math:80;chinese:85; english:81; physical:92;"
      "math:80;chinese:85; french:81; physical:92;"
      end
      
      generate id = _n
      split x, parse(; :) destring
      drop x
      ds x*, has(type numeric)
      rename (`r(varlist)') (grade#), addnumber
      rename (x*) (subject#), addnumber
      list, clean noobs
      
      reshape long grade subject, i(id) j(j) 
      replace subject = trim(subject)
      drop j
      reshape wide grade, i(id) j(subject) string
      rename (grade*) (*)
      list, clean noobs
      Code:
      . list, clean noobs
      
          id   subject1   grade1   subject2   grade2   subject3   grade3    subject4   grade4  
           1       math       96    chinese       85    english       92    physical       90  
           2       math       91    chinese       82    english       88    physical       98  
           3       math       86    chinese       85    english       81    physical       90  
           4       math       93    chinese       85    english       88    physical       90  
           5       math       70    chinese       85    english       83    physical       91  
           6       math       80    chinese       85    english       81    physical       92  
           7       math       80    chinese       85     french       81    physical       92
      Code:
      . list, clean noobs
      
          id   chinese   english   french   math   physical  
           1        85        92        .     96         90  
           2        82        88        .     91         98  
           3        85        81        .     86         90  
           4        85        88        .     93         90  
           5        85        83        .     70         91  
           6        85        81        .     80         92  
           7        85         .       81     80         92
      Last edited by William Lisowski; 17 Mar 2019, 09:28.

      Comment


      • #4
        An alternative simpler regex:
        Code:
        local re = "(\d\d)\D+" * 4  
        
        forvalues i = 1/4 {
        
            gen byte g`i' = real(ustrregexs(`i')) if ustrregexm(x,"`re'")
        }

        Comment


        • #5
          Presumably grades can vary from 0 to 100.

          Comment


          • #6
            You could also use moss (from SSC) to target subject and grades:

            Code:
            . moss x, match("([0-9]+|[a-z]+)") regex
            
            . list _match*
            
                 +--------------------------------------------------------------------------------+
                 | _match1   _match2   _match3   _match4   _match5   _match6    _match7   _match8 |
                 |--------------------------------------------------------------------------------|
              1. |    math        96   chinese        85   english        92   physical        90 |
              2. |    math        91   chinese        82   english        88   physical        98 |
              3. |    math        86   chinese        85   english        81   physical        90 |
              4. |    math        93   chinese        85   english        88   physical        90 |
              5. |    math        70   chinese        85   english        83   physical        91 |
                 |--------------------------------------------------------------------------------|
              6. |    math        80   chinese        85   english        81   physical        92 |
                 +--------------------------------------------------------------------------------+

            Comment


            • #7
              Originally posted by William Lisowski View Post
              Because your regular expression has only one pair of capturing parentheses in it.

              Perhaps the following example will point you in a useful direction. I define a local macro to contain the regular expression to make the code more readable, it is not necessary. I note that what appears to be a colon in all the data above is actually the Unicode "fullwidth colon" character U+FF1A. If you try to copy-and-paste it you'll see that the apparent space following the colon is actually part of the Unicode character. This explains why the column headings do not properly align with the data below them when displayed on Statalist in a CODE block.
              Code:
              local regex `"(?<=\:)([0-9]{2})[^UFF1A]*(?<=\:)([0-9]{2})[^UFF1A]*(?<=\:)([0-9]{2})[^UFF1A]*(?<=\:)([0-9]{2})"'
              gen grade1 = ustrregexs(1) if ustrregexm(x, `"`regex'"')
              gen grade2 = ustrregexs(2) if ustrregexm(x, `"`regex'"')
              gen grade3 = ustrregexs(3) if ustrregexm(x, `"`regex'"')
              gen grade4 = ustrregexs(4) if ustrregexm(x, `"`regex'"')
              list, clean noobs
              Code:
              . list, clean noobs
              
              x grade1 grade2 grade3 grade4
              math:96;chinese:85; english:92; physical:90; 96 85 92 90
              math:91;chinese:82; english:88; physical:98; 91 82 88 98
              math:86;chinese:85; english:81; physical:90; 86 85 81 90
              math:93;chinese:85; english:88; physical:90; 93 85 88 90
              math:70;chinese:85; english:83; physical:91; 70 85 83 91
              math:80;chinese:85; english:81; physical:92; 80 85 81 92
              And the following example demonstrates a different approach utilizing Stata's tools for splitting text strings, and converting the grades from strings to numbers in the process.
              Code:
              split x, parse(; :) destring
              rename (x2 x4 x6 x8) (grade#), addnumber
              drop x?
              list, clean noobs
              Code:
              . split x, parse(; :) destring
              variables born as string:
              x1 x2 x3 x4 x5 x6 x7 x8
              x1: contains nonnumeric characters; no replace
              x2: all characters numeric; replaced as byte
              x3: contains nonnumeric characters; no replace
              x4: all characters numeric; replaced as byte
              x5: contains nonnumeric characters; no replace
              x6: all characters numeric; replaced as byte
              x7: contains nonnumeric characters; no replace
              x8: all characters numeric; replaced as byte
              
              . rename (x2 x4 x6 x8) (grade#), addnumber
              
              . drop x?
              
              . list, clean noobs
              
              x grade1 grade2 grade3 grade4
              math:96;chinese:85; english:92; physical:90; 96 85 92 90
              math:91;chinese:82; english:88; physical:98; 91 82 88 98
              math:86;chinese:85; english:81; physical:90; 86 85 81 90
              math:93;chinese:85; english:88; physical:90; 93 85 88 90
              math:70;chinese:85; english:83; physical:91; 70 85 83 91
              math:80;chinese:85; english:81; physical:92; 80 85 81 92
              Thank for your excellent answer! I see now. I find different softwares have different rules for this process.
              This step is needed only once for R program. Thanks again.

              Bests,
              wanhai

              Comment


              • #8
                Originally posted by Bjarte Aagnes View Post
                An alternative simpler regex:
                Code:
                local re = "(\d\d)\D+" * 4
                
                forvalues i = 1/4 {
                
                gen byte g`i' = real(ustrregexs(`i')) if ustrregexm(x,"`re'")
                }
                Thanks for greatly help! First to see codes like this (* 4). Looks so nice!
                That, the following codes might be right
                Code:
                clear
                input str64 x
                "math:96;chinese:85; english:92; physical:90;"
                "math:91;chinese:82; english:88; physical:98;"
                "math:86;chinese:85; english:81; physical:90;"
                "math:93;chinese:85; english:88; physical:90;"
                "math:70;chinese:85; english:83; physical:91;"
                "math:80;chinese:85; english:81; physical:92;"
                end
                
                local regex="(?<=\:)([0-9]{2})[^UFF1A]*" * 4
                gen grade1 = ustrregexs(1) if ustrregexm(x, "`regex'")
                gen grade2 = ustrregexs(2) if ustrregexm(x, "`regex'")
                gen grade3 = ustrregexs(3) if ustrregexm(x, "`regex'")
                gen grade4 = ustrregexs(4) if ustrregexm(x, "`regex'")
                list, clean noobs

                Bests,
                wanhai

                Comment


                • #9
                  Originally posted by Nick Cox View Post
                  Presumably grades can vary from 0 to 100.
                  Wow, Thanks for your reminding, Nick! That should be the case.

                  Bests,
                  wanhai

                  Comment


                  • #10
                    Originally posted by Robert Picard View Post
                    You could also use moss (from SSC) to target subject and grades:

                    Code:
                    . moss x, match("([0-9]+|[a-z]+)") regex
                    
                    . list _match*
                    
                    +--------------------------------------------------------------------------------+
                    | _match1 _match2 _match3 _match4 _match5 _match6 _match7 _match8 |
                    |--------------------------------------------------------------------------------|
                    1. | math 96 chinese 85 english 92 physical 90 |
                    2. | math 91 chinese 82 english 88 physical 98 |
                    3. | math 86 chinese 85 english 81 physical 90 |
                    4. | math 93 chinese 85 english 88 physical 90 |
                    5. | math 70 chinese 85 english 83 physical 91 |
                    |--------------------------------------------------------------------------------|
                    6. | math 80 chinese 85 english 81 physical 92 |
                    +--------------------------------------------------------------------------------+
                    Thanks very much for concise code. 'moss' is powerful. Thanks for your contribution,@Nick @Picard!

                    Bests,
                    wanhai

                    Comment


                    • #11
                      Originally posted by William Lisowski View Post
                      Because your regular expression has only one pair of capturing parentheses in it.

                      Perhaps the following example will point you in a useful direction. I define a local macro to contain the regular expression to make the code more readable, it is not necessary. I note that what appears to be a colon in all the data above is actually the Unicode "fullwidth colon" character U+FF1A. If you try to copy-and-paste it you'll see that the apparent space following the colon is actually part of the Unicode character. This explains why the column headings do not properly align with the data below them when displayed on Statalist in a CODE block.
                      Code:
                      local regex `"(?<=\:)([0-9]{2})[^UFF1A]*(?<=\:)([0-9]{2})[^UFF1A]*(?<=\:)([0-9]{2})[^UFF1A]*(?<=\:)([0-9]{2})"'
                      gen grade1 = ustrregexs(1) if ustrregexm(x, `"`regex'"')
                      gen grade2 = ustrregexs(2) if ustrregexm(x, `"`regex'"')
                      gen grade3 = ustrregexs(3) if ustrregexm(x, `"`regex'"')
                      gen grade4 = ustrregexs(4) if ustrregexm(x, `"`regex'"')
                      list, clean noobs
                      Code:
                      . list, clean noobs
                      
                      x grade1 grade2 grade3 grade4
                      math:96;chinese:85; english:92; physical:90; 96 85 92 90
                      math:91;chinese:82; english:88; physical:98; 91 82 88 98
                      math:86;chinese:85; english:81; physical:90; 86 85 81 90
                      math:93;chinese:85; english:88; physical:90; 93 85 88 90
                      math:70;chinese:85; english:83; physical:91; 70 85 83 91
                      math:80;chinese:85; english:81; physical:92; 80 85 81 92
                      And the following example demonstrates a different approach utilizing Stata's tools for splitting text strings, and converting the grades from strings to numbers in the process.
                      Code:
                      split x, parse(; :) destring
                      rename (x2 x4 x6 x8) (grade#), addnumber
                      drop x?
                      list, clean noobs
                      Code:
                      . split x, parse(; :) destring
                      variables born as string:
                      x1 x2 x3 x4 x5 x6 x7 x8
                      x1: contains nonnumeric characters; no replace
                      x2: all characters numeric; replaced as byte
                      x3: contains nonnumeric characters; no replace
                      x4: all characters numeric; replaced as byte
                      x5: contains nonnumeric characters; no replace
                      x6: all characters numeric; replaced as byte
                      x7: contains nonnumeric characters; no replace
                      x8: all characters numeric; replaced as byte
                      
                      . rename (x2 x4 x6 x8) (grade#), addnumber
                      
                      . drop x?
                      
                      . list, clean noobs
                      
                      x grade1 grade2 grade3 grade4
                      math:96;chinese:85; english:92; physical:90; 96 85 92 90
                      math:91;chinese:82; english:88; physical:98; 91 82 88 98
                      math:86;chinese:85; english:81; physical:90; 86 85 81 90
                      math:93;chinese:85; english:88; physical:90; 93 85 88 90
                      math:70;chinese:85; english:83; physical:91; 70 85 83 91
                      math:80;chinese:85; english:81; physical:92; 80 85 81 92
                      Hi, dear William,
                      I have input the colon in the English version. Why the following programs don't work
                      Code:
                      clear
                      input str64 x
                      "math:96;chinese:85;english:92;physical:90;"
                      "math:91;chinese:82;english:88;physical:98;"
                      "math:86;chinese:85;english:81;physical:90;"
                      "math:93;chinese:85;english:88;physical:90;"
                      "math:70;chinese:85;english:83;physical:91;"
                      "math:80;chinese:85;english:81;physical:92;"
                      end
                      
                      local regex `"(?<=\:)([0-9]{2})*(?<=\:)([0-9]{2})*(?<=\:)([0-9]{2})*(?<=\:)([0-9]{2})"'
                      gen grade1 = ustrregexs(1) if ustrregexm(x, `"`regex'"')
                      gen grade2 = ustrregexs(2) if ustrregexm(x, `"`regex'"')
                      gen grade3 = ustrregexs(3) if ustrregexm(x, `"`regex'"')
                      gen grade4 = ustrregexs(4) if ustrregexm(x, `"`regex'"')
                      list, clean noobs
                      That is to say, when it makes sense to move away from [^UFF1A]? Could you give me an example please?

                      Thanks again!

                      Bests,
                      wanhai

                      Comment


                      • #12
                        The following regular expression - which adds a period (match any single character) before the asterisk (match what comes before as often as possible) - works as you expect.
                        Code:
                        local regex `"(?<=\:)([0-9]{2}).*(?<=\:)([0-9]{2}).*(?<=\:)([0-9]{2}).*(?<=\:)([0-9]{2})"'
                        In case it helps, Stata's unicode regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp which is my go-to source for regex syntax all on a single page.

                        Comment


                        • #13

                          wanhai, using lookarounds have a cost. Expressions can be compared using using the regex debugger at https://regex101.com/
                          Code:
                          201 steps "(?<=\:)([0-9]{2}).*(?<=\:)([0-9]{2}).*(?<=\:)([0-9]{2}).*(?<=\:)([0-9]{2})"
                           29 steps "(\d{1,3})\D+(\d{1,3})\D+(\d{1,3})\D+(\d{1,3})\D+"
                           27 steps "(\d+)\D+(\d+)\D+(\d+)\D+(\d+)\D+" 
                          
                          Test string : "math:96;chinese:85;english:92;physical:90;"
                          For your example lookarounds are not neccessary. Simpler and faster alternatives are:
                          Code:
                          local re = "(\d+)\D+" * 4 
                          
                          forvalues i = 1/4 {
                          
                              gen byte g`i' = real(ustrregexs(`i')) if ustrregexm(x,"`re'")
                          }
                          or if you want to restrict the number of digits (and not making a local macro):
                          Code:
                          forvalues i = 1/4 {
                          
                              gen byte g`i' = real(ustrregexs(`i')) if ustrregexm(x,"(\d{1,3})\D+" * 4)
                          }

                          Comment


                          • #14
                            Originally posted by William Lisowski View Post
                            The following regular expression - which adds a period (match any single character) before the asterisk (match what comes before as often as possible) - works as you expect.
                            Code:
                            local regex `"(?<=\:)([0-9]{2}).*(?<=\:)([0-9]{2}).*(?<=\:)([0-9]{2}).*(?<=\:)([0-9]{2})"'
                            In case it helps, Stata's unicode regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp which is my go-to source for regex syntax all on a single page.
                            Thanks for your answer,excellent! Yes, the revised codes are working. Also, thank you for the information.

                            Bests,
                            wanhai


                            Comment


                            • #15
                              Originally posted by Bjarte Aagnes View Post
                              wanhai, using lookarounds have a cost. Expressions can be compared using using the regex debugger at https://regex101.com/
                              Code:
                              201 steps "(?<=\:)([0-9]{2}).*(?<=\:)([0-9]{2}).*(?<=\:)([0-9]{2}).*(?<=\:)([0-9]{2})"
                              29 steps "(\d{1,3})\D+(\d{1,3})\D+(\d{1,3})\D+(\d{1,3})\D+"
                              27 steps "(\d+)\D+(\d+)\D+(\d+)\D+(\d+)\D+"
                              
                              Test string : "math:96;chinese:85;english:92;physical:90;"
                              For your example lookarounds are not neccessary. Simpler and faster alternatives are:
                              Code:
                              local re = "(\d+)\D+" * 4
                              
                              forvalues i = 1/4 {
                              
                              gen byte g`i' = real(ustrregexs(`i')) if ustrregexm(x,"`re'")
                              }
                              or if you want to restrict the number of digits (and not making a local macro):
                              Code:
                              forvalues i = 1/4 {
                              
                              gen byte g`i' = real(ustrregexs(`i')) if ustrregexm(x,"(\d{1,3})\D+" * 4)
                              }
                              I've recently been learning about lookarounds. However, I don't know it has low efficiency.
                              Thanks for the warning.

                              Bests,
                              wanhai

                              Comment

                              Working...
                              X