regex strings of question marks

Bryant Be

Join Date: Nov 2015

Posts: 14
#1

regex strings of question marks

07 Nov 2019, 12:51

I have a long string variable which contains sequences of question marks of variable lengths, like this:

Code:

some words???? words words?another word no????spaces more question marks?? yes? why not????????

Without knowing the maximum length of the longest sequence of question marks in any row, is it possible to strip out all such long sequences using regexr?

I know that I can do

Code:

subinstr( text_variable, "?", "", .)

to remove all of the question marks, but that's going to generate a typo: "words?another" above will become "wordsanother". I want to avoid that.

If possible, it would suffice for my problem to replace all sequences of ?'s with a single question mark:

Code:

replace text_variable = regexr(text_variable, <<identify sequences of \? 2 or longer >>, "?")

then use regex or subinstr to replace the remaining "?"'s with spaces depending on where they appear relative to a space (to avoid generating a typo above).

I have tried

Code:

replace text_variable = regexr(text_variable, "\?(\?)+\?", "?")

which I thought meant "match one or more of the characters \? in between \? and \?" -> getting everything like "???" or longer and

Code:

replace text_variable = regexr(text_variable, "\?([\?])+\?" "?")

which I thought meant "match one or more of the single allowable character \? between \? and \?"

but both of these are defeated by "word???another word" -> the case where the sequence "???" appears between two words without spaces.

I would love to know what I am misunderstanding about the syntax. Thanks!
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

07 Nov 2019, 13:25

Assuming you are using a relatively recent version of Stata, the so-called Unicode regular expression commands include a much more powerful regular expression engine. (Since Unicode contains the 128 characters of ASCII as a proper subset, they work with ASCII strings as well.) Is what I show below the sort of results you have in mind - removing every sequence of two or more question marks, but leaving single question marks alone?

Code:

. clear . input str30 wha wha 1. "some words????" 2. "more??? words??" 3. "words words?another word" 4. "no????spaces" 5. "more question marks??" 6. "yes? why not????????" 7. end . . generate huh = ustrregexra(wha,"\?{2,}", "") . list, clean noobs wha huh some words???? some words more??? words?? more words words words?another word words words?another word no????spaces nospaces more question marks?? more question marks yes? why not???????? yes? why not .

To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's new regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp.
2 likes
Comment

William Lisowski

Join Date: Dec 2014
Posts: 10150

07 Nov 2019, 14:45

For those interested in regular expressions, it seems to me the main problem with the standard regexr() function in this context is that it will do at most one such replacement per observation.

Code:

. clear

. input str30 wha

                                wha
  1. "some words????"
  2. "more??? words??"
  3. "words words?another word" 
  4. "no????spaces"
  5. "more question marks??"
  6. "yes? why not????????"
  7. end

. generate huh = regexr(wha,"\?\?+", "")

. list, clean noobs

                         wha                        huh  
              some words????                 some words  
             more??? words??               more words??  
    words words?another word   words words?another word  
                no????spaces                   nospaces  
       more question marks??        more question marks  
        yes? why not????????               yes? why not  

. replace huh = regexr(huh,"\?\?+", "")
(1 real change made)

. list, clean noobs

                         wha                        huh  
              some words????                 some words  
             more??? words??                 more words  
    words words?another word   words words?another word  
                no????spaces                   nospaces  
       more question marks??        more question marks  
        yes? why not????????               yes? why not  

.

Comment

Bryant Be

Join Date: Nov 2015

Posts: 14
#4

08 Nov 2019, 07:32

Thanks. That's very useful. I am using Stata 16, but would not have thought to use the unicode regex functions. Your solution will work well for my case. Thanks!

EDIT: Thanks too for the linked page. That is a much, much more capable set of functions! Maybe I'll put in a request for documentation of this feature in Stata 17.
Comment

Announcement

regex strings of question marks

Comment

Comment

Comment