Generating regular expression that would replace a specific character located in a specific place

Konrad Zdeb

Join Date: Apr 2014

Posts: 496
#1

Generating regular expression that would replace a specific character located in a specific place

29 Jul 2015, 08:04

Hi,

I have a data set corresponding to the extract below.

Code:

clear input str9 var1 "A11OAB" "B22OBD" "CC21OAA " end

I'm looking for a regular expression that would replace O with 0 across the observations that are meeting the following criteria:
Are 6 characters long

Start with a letter

End with two letters

Counting from left, O is fourth character

Kind regards,
Konrad
Version: Stata/IC 13.1
Tags: regex, regular expression, string
daniel klein

Join Date: Mar 2014

Posts: 3845
#2

29 Jul 2015, 08:29

Code:

generate byte flag = /// regexm(var1, "^[a-zA-z]..[O][a-zA-Z][a-zA-Z]$")

Best
Daniel
Comment
Konrad Zdeb

Join Date: Apr 2014

Posts: 496
#3

29 Jul 2015, 08:31

Thank you very much.

Kind regards,
Konrad
Version: Stata/IC 13.1
Comment
Christophe Kolodziejczyk

Join Date: Mar 2014

Posts: 377
#4

31 Jul 2015, 08:22

How do you replace the O with 0? Like this?

Code:

gen newvar1 = substr(var1,1,3)+"0"+substr(var1,5,2) if flag

In SAS you could write

Code:

substr(var1,4,1) = "0"

Is there a similar solution in Stata? I know that I could use subinstr(), but it could replace other O's coming before and after the fourth position.

Moreover it does not seem possible to request repeating patterns with Stata regex. I would like to write something like

Code:

generate byte flag = /// regexm(var1, "^[a-zA-z]..[O][a-zA-Z]{2}$")

It should be possible if Stata is following the Posix.2 standard, shouldn't it?
Comment
daniel klein

Join Date: Mar 2014

Posts: 3845
#5

31 Jul 2015, 08:54

Your first syntax is what I had in mind. I am not aware of an official Stata command or function that matches your SAS code.

Regarding the last point, the apparent lack of proper documentation of Stata's regex machinery has been discussed on the list repeatedly. It seems there is merely no complete answer. With Stata 14's unicode features, what you want seems to be possible - at least there is an example in the respective help file.

Best
Daniel
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

31 Jul 2015, 09:17

I don't like the new names of the unicode regex functions. Here's an example that shows a repeating pattern

Code:

clear
input str9 var1
"A11OAB"
"B22OBD"
"CC21OAA "
end

clonevar vcopy = var1
replace var1 = ustrregexs(1) + "0" + ustrregexs(2) if ///
    ustrregexm(var1,"^([a-zA-z]..)O([a-zA-Z]{2})$")
    
list if vcopy != var1

Announcement

Generating regular expression that would replace a specific character located in a specific place

Comment

Comment

Comment

Comment

Comment