Help with "Split"- splitting a string variable that has varying numbers, text and symbols

Monica Larissa

Join Date: Sep 2021

Posts: 6
#1

Help with "Split"- splitting a string variable that has varying numbers, text and symbols

06 Sep 2021, 19:59

Hi all,

RE: split using Stata version 16.0.

As part of data cleaning my data sets I need to generate a new variable that gives the final result for a variable. The variable it is derived from (sero) includes preliminary results. For a given dataset I have multiple "sero" variables (sero1, sero2, sero3) so would like an automated way to do this.

sero is a string variable that is of varying length (example A), that could be solely numerical (example B), or may contain symbols (example C). I would like to generate a new variable that has the value as listed, except for when there are square brackets ([ ]). When there are square brackets, the final value is what is listed inside the bracket. For example, 13/14/12B[14] should become 14. The "split" command works well for this when the dataset is large as the variable will typically include a mix of all types of variables (e.g. Example D). However, when there are few observations I run into issues with the command as the variable does not always contain the square brackets (e.g. examples A and B).

Example A
sero
3A
3
1-like

Example B-- Note: this is still a string variable
sero
1
3
5

Example C
sero
13/14/12B[14]
15A/F[15F]
2-like

Example D
sero
13/14/12B[14]
15A/F[15F]
2-like
3A
1
2
4
4-like
32A/12F/19A[32A]

For a single variable, a simple example is listed below. This works perfectly if there are square brackets, but does not work if no brackets.
split sero1, p("[")
split sero12, p("]")
gen sero1_new = sero11 if sero121==""
replace sero1_new = sero121 if sero121!=""
drop sero11 sero12 sero121

For the data cleaning I need this automated so that it will work for several sero (i.e. sero1, sero2, sero3 etc) variables, regardless of whether they contain the square brackets. The coding is shown below:
Note: max_sero is the maximum number of sero variables in the data set (i.e. global max_sero 3, if 3 variables). As for the simplified example, this works if there are square brackets but not if there aren't.

Is there a way I can use the same coding regardless of whether there are brackets or not? I have a work around but it is a bit messy (e.g. only adjust the variables with brackets).

forval x=1/$max_sero {
split sero`x', p("[")
split sero`x'2, p("]")
gen sero`x'_clean = sero`x'1 if sero`x'21=="" //generate sero*_clean variable, use the value present if no brackets
replace sero`x'_clean = sero`x'21 if sero`x'21!="" //updating to final serotype call given inside the bracket for closely related calls
generate sero`x'_clean_test = 1 if sero`x'_clean!= sero`x'
}

I also tried to approach this issue by using ustrregexra but could only work out how to get rid of the data inside of the brackets, not keep it (e.g. replace sero1=ustrregexra(sero1, "\[.*?\]" , "" ) if strpos(sero1,"[") ).

Thank you for your help!

Last edited by Monica Larissa; 06 Sep 2021, 20:24.
Tags: None

Ali Atia

Join Date: May 2020
Posts: 737

06 Sep 2021, 20:17

Not 100% sure this is what you're looking for, but this code replicates sero if there are no square brackets, and extracts what's inside of the square brackets if they are there:

Code:

gen sero_new = ustrregexs(0) if ustrregexm(sero,"((?<=\[).*?(?=\]))|(^[^\[\]]*$)")

. list, sep(0)

     +--------------------------+
     |         sero1   sero_new |
     |--------------------------|
  1. | 13/14/12B[14]         14 |
  2. |    15A/F[15F]        15F |
  3. |        2-like     2-like |
  4. |            3A         3A |
  5. |             1          1 |
  6. |             2          2 |
  7. |             4          4 |
  8. |        4-like     4-like |
     +--------------------------+

It can easily be inserted into a loop to deal with many seros.

PS: In future, a better way to show data examples is to use the dataex command, more information on which is contained in this forum's FAQ.

Last edited by Ali Atia; 06 Sep 2021, 20:25.

Comment

Monica Larissa

Join Date: Sep 2021

Posts: 6
#3

06 Sep 2021, 20:27

Originally posted by Ali Atia View Post

Not 100% sure this is what you're looking for, but this code replicates sero if there are no square brackets, and extracts what's inside of the square brackets if they are there:

Code:

gen sero_new = ustrregexs(0) if ustrregexm(sero,"((?<=\[).*(?=\]))|(^[^\[\]]*$)") . list, sep(0) +--------------------------+ | sero1 sero_new | |--------------------------| 1. | 13/14/12B[14] 14 | 2. | 15A/F[15F] 15F | 3. | 2-like 2-like | 4. | 3A 3A | 5. | 1 1 | 6. | 2 2 | 7. | 4 4 | 8. | 4-like 4-like | +--------------------------+

It can easily be inserted into a loop to deal with many seros.

PS: In future, a better way to show data examples is to use the dataex command, more information on which is contained in this forum's FAQ.

Thank you, this is exactly what I would like to do. I will use dataex next time. Are you able to explain what each of the things listed means?

Last edited by Monica Larissa; 06 Sep 2021, 20:40.
Comment
Ali Atia

Join Date: May 2020

Posts: 737
#4

06 Sep 2021, 20:44

The ustrregexs(n) function returns the nth subexpression from an ustrregexm() regular expression match. If n=0, the entire string that satisfied the regular expression is returned.

The ustrregexm() function performs a match of a regular expression and evaluates to 1 if the regular expression is satisfied, and 0 otherwise.

See -help ustrregexm()- for more information.

As to the regular expression itself:

Code:

((?<=\[).*?(?=\]))|(^[^\[\]]*$)

It can be broken down into two components, one dealing with strings which contain square brackets and one for those which don't, separated by a boolean OR pipe (|) operator.

The first component:

Code:

((?<=\[).*?(?=\]))

Contains two positive lookaround assertions (?<= and ?=) matching any character(s) surrounded by an opening square bracket and a closing square bracket. This allows the regex to match what is contained between the square brackets without matching the square brackets themselves.

The second component:

Code:

(^[^\[\]]*$)

Matches a string if does not contain square brackets from start (^) to finish ($). The [^] syntax means match any character except what is contained between the square brackets. So [^\[\]]* matches all characters except square brackets. If the string does contain square brackets, it is not matched.

Take a look at this link: https://regex101.com/r/4NiHvZ/1, where you can experiment with the regular expression and read comprehensive explanations of each part.

Last edited by Ali Atia; 06 Sep 2021, 20:47.
1 like
Comment
Monica Larissa

Join Date: Sep 2021

Posts: 6
#5

06 Sep 2021, 21:17

Originally posted by Ali Atia View Post

The ustrregexs(n) function returns the nth subexpression from an ustrregexm() regular expression match. If n=0, the entire string that satisfied the regular expression is returned.

The ustrregexm() function performs a match of a regular expression and evaluates to 1 if the regular expression is satisfied, and 0 otherwise.

See -help ustrregexm()- for more information.

As to the regular expression itself:

Code:

((?<=\[).*?(?=\]))|(^[^\[\]]*$)

It can be broken down into two components, one dealing with strings which contain square brackets and one for those which don't, separated by a boolean OR pipe (|) operator.

The first component:

Code:

((?<=\[).*?(?=\]))

Contains two positive lookaround assertions (?<= and ?=) matching any character(s) surrounded by an opening square bracket and a closing square bracket. This allows the regex to match what is contained between the square brackets without matching the square brackets themselves.

The second component:

Code:

(^[^\[\]]*$)

Matches a string if does not contain square brackets from start (^) to finish ($). The [^] syntax means match any character except what is contained between the square brackets. So [^\[\]]* matches all characters except square brackets. If the string does contain square brackets, it is not matched.

Take a look at this link: https://regex101.com/r/4NiHvZ/1, where you can experiment with the regular expression and read comprehensive explanations of each part.

Thank you so much. This is incredibly helpful. I started looking up the ICU regular expressions, but could not work out the last section with the ^ so your description is very helpful.
Comment

Announcement

Help with "Split"- splitting a string variable that has varying numbers, text and symbols

Comment

Comment

Comment

Comment