Hi,
My data consists of a list of viral mutations, separated by a comma. Here is some dummy data:
The format for each mutation (separated by a comma, no space), should be capital string, 1-3 numbers, followed by one capital string (eg, K65R). However, sometimes there are two string characters at the end (eg, K65KR). I want to replace this so that the first of the two string characters at the end is removed (eg, K65KR -> K65R).
I am trying to achieve this using the regexm/regexs string functions. I can identify the issue using regexr to replace the errors with a different text (repeating the code to identify cases where there are more than one problem mutation in a cell).
But this isn't exactly what I want to do. I am trying various iterations using regexs but can't quite seem to get there. Does anyone have any advice on how I could achieve this?
I really appreciate your any help on this.
Bryony
My data consists of a list of viral mutations, separated by a comma. Here is some dummy data:
Code:
* Example generated by -dataex-. To install: ssc install dataex clear input str33 nrti "K65R,Y115F,M184V" "D67N,K70R,M184MV,K219E" "D67N,K70E,M184V" "D67DN,K70R,M184V,T215I,K219E" "D67DN,K70E,M184V,K219KR" "K70Q,M184V" "M184V" end
I am trying to achieve this using the regexm/regexs string functions. I can identify the issue using regexr to replace the errors with a different text (repeating the code to identify cases where there are more than one problem mutation in a cell).
Code:
gen dup = nrti replace dup = regexr(dup, "[A-Z][0-9]+[A-Z][A-Z]","issue")
I really appreciate your any help on this.
Bryony
Comment