Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using regexp to extract variables from string

    I have a string variable
    Code:
    People
    that provides a sentence about the number and type of people on board of boats. I want to convert this variable into three variables:
    Code:
    N_all
    for the total number of passengers,
    Code:
    N_crew
    for the number of crew and
    Code:
    N_children
    for the number of children. The text is inconsistent in that it doesn't mention crew or children if there are none, e.g.:
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str29 People
    "49 (2 crew, 17 children) "
    "50 (1 crew, 8 children) " 
    "40 (2 crew, 4 children) " 
    "47 (2 crew, 13 children) "
    "27 (2 crew, 4 children) " 
    "58 (2, crew, 2 children) "
    "38 (2 crew, 3 children) " 
    "28 (2 crew, 2 children) " 
    "20 (2 crew) "             
    "3 (1 crew) "              
    "3 (2 crew) "              
    "41 (1 crew, 9 children) " 
    "10 (3 crew) "             
    "37 (6 children) "         
    "3 (2 crew) "              
    "4 "                       
    end
    I created N_all via:
    Code:
    gen         N_all = regexs(0) if regexm(People, "^[0-9]+")
    But I have not been able to successfully extract the crew or children using regular expressions. For example,
    Code:
    gen         N_crew = regexs(0) if regexm(People, "(\d+)[^\d]+?(?=crew)")
    gives the error "regexp: nested *?+". What am I doing wrong?

  • #2
    This seems to accomplish what you seek, including converting the results from string to numeric values, at least on your example data.
    Code:
    gen N_all   = real(regexs(1)) if regexm(People, "^([0-9]+)")
    gen N_crew  = real(regexs(1)) if regexm(People, "([0-9]+) crew")
    gen N_child = real(regexs(1)) if regexm(People, "([0-9]+) children")
    Code:
    . list, clean
    
                              People   N_all   N_crew   N_child  
      1.   49 (2 crew, 17 children)       49        2        17  
      2.    50 (1 crew, 8 children)       50        1         8  
      3.    40 (2 crew, 4 children)       40        2         4  
      4.   47 (2 crew, 13 children)       47        2        13  
      5.    27 (2 crew, 4 children)       27        2         4  
      6.   58 (2, crew, 2 children)       58        .         2  
      7.    38 (2 crew, 3 children)       38        2         3  
      8.    28 (2 crew, 2 children)       28        2         2  
      9.                20 (2 crew)       20        2         .  
     10.                 3 (1 crew)        3        1         .  
     11.                 3 (2 crew)        3        2         .  
     12.    41 (1 crew, 9 children)       41        1         9  
     13.                10 (3 crew)       10        3         .  
     14.            37 (6 children)       37        .         6  
     15.                 3 (2 crew)        3        2         .  
     16.                          4        4        .         .
    With that said, I would be more likely to use Stata's Unicode regular expression functions introduced in Stata 14. They have a much more powerful definition of regular expressions than the non-Unicode functions. To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's Unicode regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp. A comprehensive discussion of regular expressions can be found at https://www.regular-expressions.info/unicode.html.

    My suspicion is that your error message was because the non-Unicode expression parser didn't correctly recognize the expression.
    Last edited by William Lisowski; 14 Jan 2021, 20:09. Reason: Forgot to update the code posted when I changed it to include real() to convert the string values to numeric values.

    Comment

    Working...
    X