regex expression, extract the number from string

wanhaiyou

Join Date: May 2014
Posts: 130

regex expression, extract the number from string

17 Mar 2019, 07:51

Hi, dear all
I want to extact all numbers from the string. I want to use the Lookahead and Lookbehind Zero-Length Assertions to do it (https://bedigit.com/blog/regex-how-t...cular-pattern/).

Code:

clear
input str64 x
"math：96;chinese：85; english：92; physical：90;"
"math：91;chinese：82; english：88; physical：98;"
"math：86;chinese：85; english：81; physical：90;"
"math：93;chinese：85; english：88; physical：90;"
"math：70;chinese：85; english：83; physical：91;"
"math：80;chinese：85; english：81; physical：92;"
end

gen grade1 = ustrregexs(1) if ustrregexm(x, "(?<=\：)([0-9]{2})")
list
 +-----------------------------------------------------------+
     |                                                x   grade1 |
     |-----------------------------------------------------------|
  1. | math：96;chinese：85; english：92; physical：90;       96 |
  2. | math：91;chinese：82; english：88; physical：98;       91 |
  3. | math：86;chinese：85; english：81; physical：90;       86 |
  4. | math：93;chinese：85; english：88; physical：90;       93 |
  5. | math：70;chinese：85; english：83; physical：91;       70 |
     |-----------------------------------------------------------|
  6. | math：80;chinese：85; english：81; physical：92;       80 |
     +-----------------------------------------------------------+

However, why only the first number is extracted?

Thanks very much!

Bests,
wanhai

Tags: None

William Lisowski

Join Date: Dec 2014
Posts: 10150

17 Mar 2019, 08:22

Because your regular expression has only one pair of capturing parentheses in it.

Perhaps the following example will point you in a useful direction. I define a local macro to contain the regular expression to make the code more readable, it is not necessary. I note that what appears to be a colon in all the data above is actually the Unicode "fullwidth colon" character U+FF1A. If you try to copy-and-paste it you'll see that the apparent space following the colon is actually part of the Unicode character. This explains why the column headings do not properly align with the data below them when displayed on Statalist in a CODE block.

Code:

local regex `"(?<=\：)([0-9]{2})[^UFF1A]*(?<=\：)([0-9]{2})[^UFF1A]*(?<=\：)([0-9]{2})[^UFF1A]*(?<=\：)([0-9]{2})"'
gen grade1 = ustrregexs(1) if ustrregexm(x, `"`regex'"')
gen grade2 = ustrregexs(2) if ustrregexm(x, `"`regex'"')
gen grade3 = ustrregexs(3) if ustrregexm(x, `"`regex'"')
gen grade4 = ustrregexs(4) if ustrregexm(x, `"`regex'"')
list, clean noobs

Code:

. list, clean noobs

                                                   x   grade1   grade2   grade3   grade4  
    math：96;chinese：85; english：92; physical：90;       96       85       92       90  
    math：91;chinese：82; english：88; physical：98;       91       82       88       98  
    math：86;chinese：85; english：81; physical：90;       86       85       81       90  
    math：93;chinese：85; english：88; physical：90;       93       85       88       90  
    math：70;chinese：85; english：83; physical：91;       70       85       83       91  
    math：80;chinese：85; english：81; physical：92;       80       85       81       92

And the following example demonstrates a different approach utilizing Stata's tools for splitting text strings, and converting the grades from strings to numbers in the process.

Code:

split x, parse(; ：) destring
rename (x2 x4 x6 x8) (grade#), addnumber
drop x?
list, clean noobs

Code:

. split x, parse(; ：) destring
variables born as string:
x1  x2  x3  x4  x5  x6  x7  x8
x1: contains nonnumeric characters; no replace
x2: all characters numeric; replaced as byte
x3: contains nonnumeric characters; no replace
x4: all characters numeric; replaced as byte
x5: contains nonnumeric characters; no replace
x6: all characters numeric; replaced as byte
x7: contains nonnumeric characters; no replace
x8: all characters numeric; replaced as byte

. rename (x2 x4 x6 x8) (grade#), addnumber

. drop x?

. list, clean noobs

                                                   x   grade1   grade2   grade3   grade4  
    math：96;chinese：85; english：92; physical：90;       96       85       92       90  
    math：91;chinese：82; english：88; physical：98;       91       82       88       98  
    math：86;chinese：85; english：81; physical：90;       86       85       81       90  
    math：93;chinese：85; english：88; physical：90;       93       85       88       90  
    math：70;chinese：85; english：83; physical：91;       70       85       83       91  
    math：80;chinese：85; english：81; physical：92;       80       85       81       92

Last edited by William Lisowski; 17 Mar 2019, 08:34.

Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

17 Mar 2019, 09:20

As long as I'm at it, this example might be more useful in some situations, especially if different observations have different sets of subjects or the subjects can appear in a different order.

Code:

clear
input str64 x
"math：96;chinese：85; english：92; physical：90;"
"math：91;chinese：82; english：88; physical：98;"
"math：86;chinese：85; english：81; physical：90;"
"math：93;chinese：85; english：88; physical：90;"
"math：70;chinese：85; english：83; physical：91;"
"math：80;chinese：85; english：81; physical：92;"
"math：80;chinese：85; french：81; physical：92;"
end

generate id = _n
split x, parse(; ：) destring
drop x
ds x*, has(type numeric)
rename (`r(varlist)') (grade#), addnumber
rename (x*) (subject#), addnumber
list, clean noobs

reshape long grade subject, i(id) j(j) 
replace subject = trim(subject)
drop j
reshape wide grade, i(id) j(subject) string
rename (grade*) (*)
list, clean noobs

Code:

. list, clean noobs

    id   subject1   grade1   subject2   grade2   subject3   grade3    subject4   grade4  
     1       math       96    chinese       85    english       92    physical       90  
     2       math       91    chinese       82    english       88    physical       98  
     3       math       86    chinese       85    english       81    physical       90  
     4       math       93    chinese       85    english       88    physical       90  
     5       math       70    chinese       85    english       83    physical       91  
     6       math       80    chinese       85    english       81    physical       92  
     7       math       80    chinese       85     french       81    physical       92

Code:

. list, clean noobs

    id   chinese   english   french   math   physical  
     1        85        92        .     96         90  
     2        82        88        .     91         98  
     3        85        81        .     86         90  
     4        85        88        .     93         90  
     5        85        83        .     70         91  
     6        85        81        .     80         92  
     7        85         .       81     80         92

Last edited by William Lisowski; 17 Mar 2019, 09:28.

Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 783

17 Mar 2019, 12:36

An alternative simpler regex:

Code:

local re = "(\d\d)\D+" * 4  

forvalues i = 1/4 {

    gen byte g`i' = real(ustrregexs(`i')) if ustrregexm(x,"`re'")
}

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35673
#5

17 Mar 2019, 12:59

Presumably grades can vary from 0 to 100.
1 like
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

17 Mar 2019, 15:25

You could also use moss (from SSC) to target subject and grades:

Code:

. moss x, match("([0-9]+|[a-z]+)") regex

. list _match*

     +--------------------------------------------------------------------------------+
     | _match1   _match2   _match3   _match4   _match5   _match6    _match7   _match8 |
     |--------------------------------------------------------------------------------|
  1. |    math        96   chinese        85   english        92   physical        90 |
  2. |    math        91   chinese        82   english        88   physical        98 |
  3. |    math        86   chinese        85   english        81   physical        90 |
  4. |    math        93   chinese        85   english        88   physical        90 |
  5. |    math        70   chinese        85   english        83   physical        91 |
     |--------------------------------------------------------------------------------|
  6. |    math        80   chinese        85   english        81   physical        92 |
     +--------------------------------------------------------------------------------+

Comment

wanhaiyou

Join Date: May 2014
Posts: 130

17 Mar 2019, 18:33

Originally posted by William Lisowski View Post

Code:

local regex `"(?<=\：)([0-9]{2})[^UFF1A]*(?<=\：)([0-9]{2})[^UFF1A]*(?<=\：)([0-9]{2})[^UFF1A]*(?<=\：)([0-9]{2})"'
gen grade1 = ustrregexs(1) if ustrregexm(x, `"`regex'"')
gen grade2 = ustrregexs(2) if ustrregexm(x, `"`regex'"')
gen grade3 = ustrregexs(3) if ustrregexm(x, `"`regex'"')
gen grade4 = ustrregexs(4) if ustrregexm(x, `"`regex'"')
list, clean noobs

Code:

. list, clean noobs

x grade1 grade2 grade3 grade4
math：96;chinese：85; english：92; physical：90; 96 85 92 90
math：91;chinese：82; english：88; physical：98; 91 82 88 98
math：86;chinese：85; english：81; physical：90; 86 85 81 90
math：93;chinese：85; english：88; physical：90; 93 85 88 90
math：70;chinese：85; english：83; physical：91; 70 85 83 91
math：80;chinese：85; english：81; physical：92; 80 85 81 92

And the following example demonstrates a different approach utilizing Stata's tools for splitting text strings, and converting the grades from strings to numbers in the process.

Code:

split x, parse(; ：) destring
rename (x2 x4 x6 x8) (grade#), addnumber
drop x?
list, clean noobs

Code:

. split x, parse(; ：) destring
variables born as string:
x1 x2 x3 x4 x5 x6 x7 x8
x1: contains nonnumeric characters; no replace
x2: all characters numeric; replaced as byte
x3: contains nonnumeric characters; no replace
x4: all characters numeric; replaced as byte
x5: contains nonnumeric characters; no replace
x6: all characters numeric; replaced as byte
x7: contains nonnumeric characters; no replace
x8: all characters numeric; replaced as byte

. rename (x2 x4 x6 x8) (grade#), addnumber

. drop x?

. list, clean noobs

x grade1 grade2 grade3 grade4
math：96;chinese：85; english：92; physical：90; 96 85 92 90
math：91;chinese：82; english：88; physical：98; 91 82 88 98
math：86;chinese：85; english：81; physical：90; 86 85 81 90
math：93;chinese：85; english：88; physical：90; 93 85 88 90
math：70;chinese：85; english：83; physical：91; 70 85 83 91
math：80;chinese：85; english：81; physical：92; 80 85 81 92

Thank for your excellent answer! I see now. I find different softwares have different rules for this process.
This step is needed only once for R program. Thanks again.

Bests,
wanhai

Comment

wanhaiyou

Join Date: May 2014
Posts: 130

17 Mar 2019, 18:40

Originally posted by Bjarte Aagnes View Post

An alternative simpler regex:

Code:

local re = "(\d\d)\D+" * 4

forvalues i = 1/4 {

gen byte g`i' = real(ustrregexs(`i')) if ustrregexm(x,"`re'")
}

Thanks for greatly help! First to see codes like this (* 4). Looks so nice!
That, the following codes might be right

Code:

clear
input str64 x
"math：96;chinese：85; english：92; physical：90;"
"math：91;chinese：82; english：88; physical：98;"
"math：86;chinese：85; english：81; physical：90;"
"math：93;chinese：85; english：88; physical：90;"
"math：70;chinese：85; english：83; physical：91;"
"math：80;chinese：85; english：81; physical：92;"
end

local regex="(?<=\：)([0-9]{2})[^UFF1A]*" * 4
gen grade1 = ustrregexs(1) if ustrregexm(x, "`regex'")
gen grade2 = ustrregexs(2) if ustrregexm(x, "`regex'")
gen grade3 = ustrregexs(3) if ustrregexm(x, "`regex'")
gen grade4 = ustrregexs(4) if ustrregexm(x, "`regex'")
list, clean noobs

Bests,
wanhai

Comment

wanhaiyou

Join Date: May 2014

Posts: 130
#9

17 Mar 2019, 18:42

Originally posted by Nick Cox View Post

Presumably grades can vary from 0 to 100.

Wow, Thanks for your reminding, Nick! That should be the case.

Bests,
wanhai
Comment

wanhaiyou

Join Date: May 2014
Posts: 130

#10

17 Mar 2019, 18:48

Originally posted by Robert Picard View Post

You could also use moss (from SSC) to target subject and grades:

Code:

. moss x, match("([0-9]+|[a-z]+)") regex

. list _match*

+--------------------------------------------------------------------------------+
| _match1 _match2 _match3 _match4 _match5 _match6 _match7 _match8 |
|--------------------------------------------------------------------------------|
1. | math 96 chinese 85 english 92 physical 90 |
2. | math 91 chinese 82 english 88 physical 98 |
3. | math 86 chinese 85 english 81 physical 90 |
4. | math 93 chinese 85 english 88 physical 90 |
5. | math 70 chinese 85 english 83 physical 91 |
|--------------------------------------------------------------------------------|
6. | math 80 chinese 85 english 81 physical 92 |
+--------------------------------------------------------------------------------+

Thanks very much for concise code. 'moss' is powerful. Thanks for your contribution,@Nick @Picard!

Bests,
wanhai

Comment

wanhaiyou

Join Date: May 2014
Posts: 130

#11

17 Mar 2019, 21:11

Originally posted by William Lisowski View Post

Code:

local regex `"(?<=\：)([0-9]{2})[^UFF1A]*(?<=\：)([0-9]{2})[^UFF1A]*(?<=\：)([0-9]{2})[^UFF1A]*(?<=\：)([0-9]{2})"'
gen grade1 = ustrregexs(1) if ustrregexm(x, `"`regex'"')
gen grade2 = ustrregexs(2) if ustrregexm(x, `"`regex'"')
gen grade3 = ustrregexs(3) if ustrregexm(x, `"`regex'"')
gen grade4 = ustrregexs(4) if ustrregexm(x, `"`regex'"')
list, clean noobs

Code:

. list, clean noobs

x grade1 grade2 grade3 grade4
math：96;chinese：85; english：92; physical：90; 96 85 92 90
math：91;chinese：82; english：88; physical：98; 91 82 88 98
math：86;chinese：85; english：81; physical：90; 86 85 81 90
math：93;chinese：85; english：88; physical：90; 93 85 88 90
math：70;chinese：85; english：83; physical：91; 70 85 83 91
math：80;chinese：85; english：81; physical：92; 80 85 81 92

And the following example demonstrates a different approach utilizing Stata's tools for splitting text strings, and converting the grades from strings to numbers in the process.

Code:

split x, parse(; ：) destring
rename (x2 x4 x6 x8) (grade#), addnumber
drop x?
list, clean noobs

Code:

. split x, parse(; ：) destring
variables born as string:
x1 x2 x3 x4 x5 x6 x7 x8
x1: contains nonnumeric characters; no replace
x2: all characters numeric; replaced as byte
x3: contains nonnumeric characters; no replace
x4: all characters numeric; replaced as byte
x5: contains nonnumeric characters; no replace
x6: all characters numeric; replaced as byte
x7: contains nonnumeric characters; no replace
x8: all characters numeric; replaced as byte

. rename (x2 x4 x6 x8) (grade#), addnumber

. drop x?

. list, clean noobs

x grade1 grade2 grade3 grade4
math：96;chinese：85; english：92; physical：90; 96 85 92 90
math：91;chinese：82; english：88; physical：98; 91 82 88 98
math：86;chinese：85; english：81; physical：90; 86 85 81 90
math：93;chinese：85; english：88; physical：90; 93 85 88 90
math：70;chinese：85; english：83; physical：91; 70 85 83 91
math：80;chinese：85; english：81; physical：92; 80 85 81 92

Hi, dear William,
I have input the colon in the English version. Why the following programs don't work

Code:

clear
input str64 x
"math:96;chinese:85;english:92;physical:90;"
"math:91;chinese:82;english:88;physical:98;"
"math:86;chinese:85;english:81;physical:90;"
"math:93;chinese:85;english:88;physical:90;"
"math:70;chinese:85;english:83;physical:91;"
"math:80;chinese:85;english:81;physical:92;"
end

local regex `"(?<=\:)([0-9]{2})*(?<=\:)([0-9]{2})*(?<=\:)([0-9]{2})*(?<=\:)([0-9]{2})"'
gen grade1 = ustrregexs(1) if ustrregexm(x, `"`regex'"')
gen grade2 = ustrregexs(2) if ustrregexm(x, `"`regex'"')
gen grade3 = ustrregexs(3) if ustrregexm(x, `"`regex'"')
gen grade4 = ustrregexs(4) if ustrregexm(x, `"`regex'"')
list, clean noobs

That is to say, when it makes sense to move away from [^UFF1A]? Could you give me an example please?

Thanks again!

Bests,
wanhai

Comment

William Lisowski

Join Date: Dec 2014

Posts: 10150
#12

18 Mar 2019, 06:05

The following regular expression - which adds a period (match any single character) before the asterisk (match what comes before as often as possible) - works as you expect.

Code:

local regex `"(?<=\:)([0-9]{2}).*(?<=\:)([0-9]{2}).*(?<=\:)([0-9]{2}).*(?<=\:)([0-9]{2})"'

In case it helps, Stata's unicode regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp which is my go-to source for regex syntax all on a single page.
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 783

#13

18 Mar 2019, 07:24

wanhai, using lookarounds have a cost. Expressions can be compared using using the regex debugger at https://regex101.com/

Code:

201 steps "(?<=\:)([0-9]{2}).*(?<=\:)([0-9]{2}).*(?<=\:)([0-9]{2}).*(?<=\:)([0-9]{2})"
 29 steps "(\d{1,3})\D+(\d{1,3})\D+(\d{1,3})\D+(\d{1,3})\D+"
 27 steps "(\d+)\D+(\d+)\D+(\d+)\D+(\d+)\D+" 

Test string : "math:96;chinese:85;english:92;physical:90;"

For your example lookarounds are not neccessary. Simpler and faster alternatives are:

Code:

local re = "(\d+)\D+" * 4 

forvalues i = 1/4 {

    gen byte g`i' = real(ustrregexs(`i')) if ustrregexm(x,"`re'")
}

or if you want to restrict the number of digits (and not making a local macro):

Code:

forvalues i = 1/4 {

    gen byte g`i' = real(ustrregexs(`i')) if ustrregexm(x,"(\d{1,3})\D+" * 4)
}

Comment

wanhaiyou

Join Date: May 2014

Posts: 130
#14

18 Mar 2019, 09:08

Originally posted by William Lisowski View Post

The following regular expression - which adds a period (match any single character) before the asterisk (match what comes before as often as possible) - works as you expect.

Code:

local regex `"(?<=\:)([0-9]{2}).*(?<=\:)([0-9]{2}).*(?<=\:)([0-9]{2}).*(?<=\:)([0-9]{2})"'

In case it helps, Stata's unicode regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp which is my go-to source for regex syntax all on a single page.

Thanks for your answer,excellent! Yes, the revised codes are working. Also, thank you for the information.

Bests,
wanhai
Comment
wanhaiyou

Join Date: May 2014

Posts: 130
#15

18 Mar 2019, 09:15

Originally posted by Bjarte Aagnes View Post

wanhai, using lookarounds have a cost. Expressions can be compared using using the regex debugger at https://regex101.com/

Code:

201 steps "(?<=\:)([0-9]{2}).*(?<=\:)([0-9]{2}).*(?<=\:)([0-9]{2}).*(?<=\:)([0-9]{2})" 29 steps "(\d{1,3})\D+(\d{1,3})\D+(\d{1,3})\D+(\d{1,3})\D+" 27 steps "(\d+)\D+(\d+)\D+(\d+)\D+(\d+)\D+" Test string : "math:96;chinese:85;english:92;physical:90;"

For your example lookarounds are not neccessary. Simpler and faster alternatives are:

Code:

local re = "(\d+)\D+" * 4 forvalues i = 1/4 { gen byte g`i' = real(ustrregexs(`i')) if ustrregexm(x,"`re'") }

or if you want to restrict the number of digits (and not making a local macro):

Code:

forvalues i = 1/4 { gen byte g`i' = real(ustrregexs(`i')) if ustrregexm(x,"(\d{1,3})\D+" * 4) }

I've recently been learning about lookarounds. However, I don't know it has low efficiency.
Thanks for the warning.

Bests,
wanhai
Comment

Announcement

regex expression, extract the number from string

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment