Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Replacing string using regexm/regexs

    Hi,

    My data consists of a list of viral mutations, separated by a comma. Here is some dummy data:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str33 nrti
    "K65R,Y115F,M184V"           
    "D67N,K70R,M184MV,K219E"      
    "D67N,K70E,M184V"            
    "D67DN,K70R,M184V,T215I,K219E"
    "D67DN,K70E,M184V,K219KR"    
    "K70Q,M184V"                 
    "M184V"                      
    end
    The format for each mutation (separated by a comma, no space), should be capital string, 1-3 numbers, followed by one capital string (eg, K65R). However, sometimes there are two string characters at the end (eg, K65KR). I want to replace this so that the first of the two string characters at the end is removed (eg, K65KR -> K65R).

    I am trying to achieve this using the regexm/regexs string functions. I can identify the issue using regexr to replace the errors with a different text (repeating the code to identify cases where there are more than one problem mutation in a cell).

    Code:
    gen dup = nrti
    replace dup = regexr(dup, "[A-Z][0-9]+[A-Z][A-Z]","issue")
    But this isn't exactly what I want to do. I am trying various iterations using regexs but can't quite seem to get there. Does anyone have any advice on how I could achieve this?

    I really appreciate your any help on this.

    Bryony

  • #2
    Here is one way

    Code:
    generate nrti_clean = ""
    split nrti , parse(",")
    foreach var in `r(varlist)' {
        replace nrti_clean = nrti_clean           ///
                           + substr(`var', 1, 3)  ///
                           + substr(`var', -1, .) ///
                           + cond(mi(`var'), "", " ")
    }
    replace nrti_clean = subinstr(strtrim(nrti_clean), " ", ",", .)
    Best
    Daniel

    Comment


    • #3
      You can think of this as eliminating any letter that is between a number and a letter. This suggests a simple one-line solution.

      Code:
      gen wanted= ustrregexra(nrti, "(?<=[0-9])([A-Z])(?=[A-Z])", "")
      Res.:

      Code:
      . l, sep(10)
      
           +------------------------------------------------------------+
           |                         nrti                        wanted |
           |------------------------------------------------------------|
        1. |             K65R,Y115F,M184V              K65R,Y115F,M184V |
        2. |       D67N,K70R,M184MV,K219E         D67N,K70R,M184V,K219E |
        3. |              D67N,K70E,M184V               D67N,K70E,M184V |
        4. | D67DN,K70R,M184V,T215I,K219E   D67N,K70R,M184V,T215I,K219E |
        5. |      D67DN,K70E,M184V,K219KR         D67N,K70E,M184V,K219R |
        6. |                   K70Q,M184V                    K70Q,M184V |
        7. |                        M184V                         M184V |
           +------------------------------------------------------------+

      Comment


      • #4
        fantastic, thank you both so much!

        Comment

        Working...
        X