Replacing string using regexm/regexs

Bryony Simmons

Join Date: Jan 2018

Posts: 37
#1

Replacing string using regexm/regexs

24 Mar 2020, 12:51

Hi,

My data consists of a list of viral mutations, separated by a comma. Here is some dummy data:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str33 nrti "K65R,Y115F,M184V" "D67N,K70R,M184MV,K219E" "D67N,K70E,M184V" "D67DN,K70R,M184V,T215I,K219E" "D67DN,K70E,M184V,K219KR" "K70Q,M184V" "M184V" end

The format for each mutation (separated by a comma, no space), should be capital string, 1-3 numbers, followed by one capital string (eg, K65R). However, sometimes there are two string characters at the end (eg, K65KR). I want to replace this so that the first of the two string characters at the end is removed (eg, K65KR -> K65R).

I am trying to achieve this using the regexm/regexs string functions. I can identify the issue using regexr to replace the errors with a different text (repeating the code to identify cases where there are more than one problem mutation in a cell).

Code:

gen dup = nrti replace dup = regexr(dup, "[A-Z][0-9]+[A-Z][A-Z]","issue")

But this isn't exactly what I want to do. I am trying various iterations using regexs but can't quite seem to get there. Does anyone have any advice on how I could achieve this?

I really appreciate your any help on this.

Bryony
Tags: regex string exract, regexm, regexs, replace, string

daniel klein

Join Date: Mar 2014
Posts: 3860

24 Mar 2020, 15:47

Here is one way

Code:

generate nrti_clean = ""
split nrti , parse(",")
foreach var in `r(varlist)' {
    replace nrti_clean = nrti_clean           ///
                       + substr(`var', 1, 3)  ///
                       + substr(`var', -1, .) ///
                       + cond(mi(`var'), "", " ")
}
replace nrti_clean = subinstr(strtrim(nrti_clean), " ", ",", .)

Best
Daniel

Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10216

24 Mar 2020, 15:59

You can think of this as eliminating any letter that is between a number and a letter. This suggests a simple one-line solution.

Code:

gen wanted= ustrregexra(nrti, "(?<=[0-9])([A-Z])(?=[A-Z])", "")

Res.:

Code:

. l, sep(10)

     +------------------------------------------------------------+
     |                         nrti                        wanted |
     |------------------------------------------------------------|
  1. |             K65R,Y115F,M184V              K65R,Y115F,M184V |
  2. |       D67N,K70R,M184MV,K219E         D67N,K70R,M184V,K219E |
  3. |              D67N,K70E,M184V               D67N,K70E,M184V |
  4. | D67DN,K70R,M184V,T215I,K219E   D67N,K70R,M184V,T215I,K219E |
  5. |      D67DN,K70E,M184V,K219KR         D67N,K70E,M184V,K219R |
  6. |                   K70Q,M184V                    K70Q,M184V |
  7. |                        M184V                         M184V |
     +------------------------------------------------------------+

Comment

Bryony Simmons

Join Date: Jan 2018

Posts: 37
#4

25 Mar 2020, 13:37

fantastic, thank you both so much!
Comment

Announcement

Replacing string using regexm/regexs

Comment

Comment

Comment