Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • regex strings of question marks

    I have a long string variable which contains sequences of question marks of variable lengths, like this:

    Code:
    some words???? words words?another word no????spaces more question marks?? yes? why not????????
    Without knowing the maximum length of the longest sequence of question marks in any row, is it possible to strip out all such long sequences using regexr?

    I know that I can do

    Code:
    subinstr( text_variable, "?", "", .)
    to remove all of the question marks, but that's going to generate a typo: "words?another" above will become "wordsanother". I want to avoid that.

    If possible, it would suffice for my problem to replace all sequences of ?'s with a single question mark:
    Code:
    replace text_variable = regexr(text_variable, <<identify sequences of \? 2 or longer >>, "?")
    then use regex or subinstr to replace the remaining "?"'s with spaces depending on where they appear relative to a space (to avoid generating a typo above).

    I have tried

    Code:
    replace text_variable = regexr(text_variable, "\?(\?)+\?", "?")
    which I thought meant "match one or more of the characters \? in between \? and \?" -> getting everything like "???" or longer and

    Code:
    replace text_variable = regexr(text_variable, "\?([\?])+\?" "?")
    which I thought meant "match one or more of the single allowable character \? between \? and \?"

    but both of these are defeated by "word???another word" -> the case where the sequence "???" appears between two words without spaces.

    I would love to know what I am misunderstanding about the syntax. Thanks!

  • #2
    Assuming you are using a relatively recent version of Stata, the so-called Unicode regular expression commands include a much more powerful regular expression engine. (Since Unicode contains the 128 characters of ASCII as a proper subset, they work with ASCII strings as well.) Is what I show below the sort of results you have in mind - removing every sequence of two or more question marks, but leaving single question marks alone?
    Code:
    . clear
    
    . input str30 wha
    
                                    wha
      1. "some words????"
      2. "more??? words??"
      3. "words words?another word" 
      4. "no????spaces"
      5. "more question marks??"
      6. "yes? why not????????"
      7. end
    
    . 
    . generate huh = ustrregexra(wha,"\?{2,}", "")
    
    . list, clean noobs
    
                             wha                        huh  
                  some words????                 some words  
                 more??? words??                 more words  
        words words?another word   words words?another word  
                    no????spaces                   nospaces  
           more question marks??        more question marks  
            yes? why not????????               yes? why not  
    
    .
    To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's new regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp.

    Comment


    • #3
      For those interested in regular expressions, it seems to me the main problem with the standard regexr() function in this context is that it will do at most one such replacement per observation.
      Code:
      . clear
      
      . input str30 wha
      
                                      wha
        1. "some words????"
        2. "more??? words??"
        3. "words words?another word" 
        4. "no????spaces"
        5. "more question marks??"
        6. "yes? why not????????"
        7. end
      
      . generate huh = regexr(wha,"\?\?+", "")
      
      . list, clean noobs
      
                               wha                        huh  
                    some words????                 some words  
                   more??? words??               more words??  
          words words?another word   words words?another word  
                      no????spaces                   nospaces  
             more question marks??        more question marks  
              yes? why not????????               yes? why not  
      
      . replace huh = regexr(huh,"\?\?+", "")
      (1 real change made)
      
      . list, clean noobs
      
                               wha                        huh  
                    some words????                 some words  
                   more??? words??                 more words  
          words words?another word   words words?another word  
                      no????spaces                   nospaces  
             more question marks??        more question marks  
              yes? why not????????               yes? why not  
      
      .

      Comment


      • #4
        Thanks. That's very useful. I am using Stata 16, but would not have thought to use the unicode regex functions. Your solution will work well for my case. Thanks!

        EDIT: Thanks too for the linked page. That is a much, much more capable set of functions! Maybe I'll put in a request for documentation of this feature in Stata 17.

        Comment

        Working...
        X