Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • regexr - Regular Expression in Stata

    Hi everyone! I am working on some regular expression problem. The data I have is something like this.

    input str20 name
    "ABC- A - D"
    "A -ABD - D"
    "A-A - D"
    "A-b - D"
    "b-A - D"
    end


    I would like to cleanse the data by removing the hyphens between two characters or a character and a space. In other words, I want to cleanse my data into

    input str20 name
    "ABC A - D"
    "A ABD - D"
    "A A - D"
    "A b - D"
    "b A - D"
    end

    That is, replace the hypthen with a space at the same time preserving the characters.

    I have tried something like
    gen help1 = regexs(1) if regexm(name, "[a-zA-Z]-[a-zA-Z]")
    gen help2 = regexr(name, "-", "")
    But it does not work. Please help me. Thanks in advance!


  • #2
    This works for your data example, but you if you have multiple cases of hyphens that you want to replace per observation you'll have to run it multuple times until no changes are made anymore.
    Code:
    clear
    input str20 name
    "ABC- A - D"
    "A -ABD - D"
    "A-A - D"
    "A-b - D"
    "b-A - D"
    end
    
    clonevar wanted = name
    tempvar hyph nohyph
    gen `hyph' = regexs(0) if regexm(name, "[a-zA-Z]- | -[a-zA-Z]|[a-zA-Z]-[a-zA-Z]")
    gen `nohyph' = subinstr(`hyph', "-", " ", 1)
    replace `nohyph' = subinstr(`nohyph', "  ", " ", 1)
    replace wanted = subinstr(wanted, `hyph', `nohyph', 1)
    list
    
    
         +----------------------------------------------+
         |       name      wanted   __000000   __000001 |
         |----------------------------------------------|
      1. | ABC- A - D   ABC A - D        C-         C   |
      2. | A -ABD - D   A ABD - D         -A          A |
      3. |    A-A - D     A A - D        A-A        A A |
      4. |    A-b - D     A b - D        A-b        A b |
      5. |    b-A - D     b A - D        b-A        b A |
         +----------------------------------------------+
    Last edited by Wouter Wakker; 09 Nov 2020, 01:32.

    Comment


    • #3
      Code:
      clear
      input str20 name
      "ABC- A - D"
      "A -ABD - D"
      "A-A - D"
      "A-b - D"
      "b-A - D"
      end
      
      gen wanted= ustrregexra(ustrregexra(name, "(?<![\s])(-)(?<![\s])", " "), "((-)(?=[\w]))", " ")
      Res.:

      Code:
      . l
      
           +-------------------------+
           |       name       wanted |
           |-------------------------|
        1. | ABC- A - D   ABC  A - D |
        2. | A -ABD - D   A  ABD - D |
        3. |    A-A - D      A A - D |
        4. |    A-b - D      A b - D |
        5. |    b-A - D      b A - D |
           +-------------------------+

      See the ICU Regular Expression User Guide: http://userguide.icu-project.org/strings/regexp

      Comment


      • #4
        Thank you very much for all your help!

        Comment

        Working...
        X