Using regexp to extract variables from string

David Smerdon

Join Date: Jul 2019
Posts: 24

Using regexp to extract variables from string

14 Jan 2021, 18:34

I have a string variable

Code:

People

that provides a sentence about the number and type of people on board of boats. I want to convert this variable into three variables:

Code:

N_all

for the total number of passengers,

Code:

N_crew

for the number of crew and

Code:

N_children

for the number of children. The text is inconsistent in that it doesn't mention crew or children if there are none, e.g.:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str29 People
"49 (2 crew, 17 children) "
"50 (1 crew, 8 children) " 
"40 (2 crew, 4 children) " 
"47 (2 crew, 13 children) "
"27 (2 crew, 4 children) " 
"58 (2, crew, 2 children) "
"38 (2 crew, 3 children) " 
"28 (2 crew, 2 children) " 
"20 (2 crew) "             
"3 (1 crew) "              
"3 (2 crew) "              
"41 (1 crew, 9 children) " 
"10 (3 crew) "             
"37 (6 children) "         
"3 (2 crew) "              
"4 "                       
end

I created N_all via:

Code:

gen         N_all = regexs(0) if regexm(People, "^[0-9]+")

But I have not been able to successfully extract the crew or children using regular expressions. For example,

Code:

gen         N_crew = regexs(0) if regexm(People, "(\d+)[^\d]+?(?=crew)")

gives the error "regexp: nested *?+". What am I doing wrong?

Tags: regex, Regular expressions

William Lisowski

Join Date: Dec 2014
Posts: 10150

14 Jan 2021, 19:42

This seems to accomplish what you seek, including converting the results from string to numeric values, at least on your example data.

Code:

gen N_all   = real(regexs(1)) if regexm(People, "^([0-9]+)")
gen N_crew  = real(regexs(1)) if regexm(People, "([0-9]+) crew")
gen N_child = real(regexs(1)) if regexm(People, "([0-9]+) children")

Code:

. list, clean

                          People   N_all   N_crew   N_child  
  1.   49 (2 crew, 17 children)       49        2        17  
  2.    50 (1 crew, 8 children)       50        1         8  
  3.    40 (2 crew, 4 children)       40        2         4  
  4.   47 (2 crew, 13 children)       47        2        13  
  5.    27 (2 crew, 4 children)       27        2         4  
  6.   58 (2, crew, 2 children)       58        .         2  
  7.    38 (2 crew, 3 children)       38        2         3  
  8.    28 (2 crew, 2 children)       28        2         2  
  9.                20 (2 crew)       20        2         .  
 10.                 3 (1 crew)        3        1         .  
 11.                 3 (2 crew)        3        2         .  
 12.    41 (1 crew, 9 children)       41        1         9  
 13.                10 (3 crew)       10        3         .  
 14.            37 (6 children)       37        .         6  
 15.                 3 (2 crew)        3        2         .  
 16.                          4        4        .         .

With that said, I would be more likely to use Stata's Unicode regular expression functions introduced in Stata 14. They have a much more powerful definition of regular expressions than the non-Unicode functions. To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's Unicode regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp. A comprehensive discussion of regular expressions can be found at https://www.regular-expressions.info/unicode.html.

My suspicion is that your error message was because the non-Unicode expression parser didn't correctly recognize the expression.

Last edited by William Lisowski; 14 Jan 2021, 20:09. Reason: Forgot to update the code posted when I changed it to include real() to convert the string values to numeric values.

Announcement

Using regexp to extract variables from string

Comment