Seperating a string into substrings (by upper case letters)

Lynde Yn

Join Date: Apr 2019

Posts: 6
#1

Seperating a string into substrings (by upper case letters)

04 May 2019, 03:47

Dear all,

I have to "cut a string into pieces". The "seperator" being a upper case letter.

e.g. var name "AbrahamAndyGregor" should be divided in var name1 "Abraham" var name2 "Andy" and var name3 "Gregor".

There are about 180 "names" combined in different ways. I do have a list of names.

What I tried was to set the first letter to lower case -->

gen name1=(lower(substr(name, 1, 1))) + (substr(name, 2, .)) ---> name will be abrahamAndyGregor.

Thus I could extract the first name ...
gen name1 = regexs(0) if regexm(name, "[a-zäöüß ]+")

(Those freaky letters äöüß are just because they appear in the names - also blanks).

Then I tried to get the second name - and failed ;-)!
I tried different combinations like ...
gen name2 = regexs(1) if regexm(name, "(([a-zäöüß ]+)([A-ZÄÖÜ][a-zäöüß ]+))")
gen name2 = regexs(0) if regexm(name, "([A-ZÄÖÜ][a-zäöüß ]+)")

Asking the forum kindy for help - best regards,
Lynde
Tags: None

Bjarte Aagnes

Join Date: Apr 2014
Posts: 783

04 May 2019, 06:08

Running

Code:

tempvar names nu    // "nu" to hold number of upper case letters
gen `names' = names
gen `nu' = ustrlen(ustrregexra(names, "[^\p{Lu}]" ,""))
qui su `nu'

local ulmatch "ustrregexm(`names', "(\p{Lu}\p{Ll}+)" )"

forvalues i = 1/`r(max)' {

    gen name`i' = ustrregexs(1) if `ulmatch' 
    replace `names' = subinstr(`names', ustrregexs(1), "", 1) if `ulmatch'
}

applied to

Code:

    input str50 names
    "AnnaBusterCecilDavid"
    "AnnaBusterCecil"
    "AnnaBuster"
    "Anna"
    "AnnaBuster"
    "AnnaBusterCecil"
    "AnnaBusterCecilDavid"
    end

return

Code:

list , clean

                      names   name1    name2   name3   name4  
  1.   AnnaBusterCecilDavid    Anna   Buster   Cecil   David  
  2.        AnnaBusterCecil    Anna   Buster   Cecil          
  3.             AnnaBuster    Anna   Buster                  
  4.                   Anna    Anna                           
  5.             AnnaBuster    Anna   Buster                  
  6.        AnnaBusterCecil    Anna   Buster   Cecil          
  7.   AnnaBusterCecilDavid    Anna   Buster   Cecil   David

Comment

Lynde Yn

Join Date: Apr 2019

Posts: 6
#3

04 May 2019, 13:42

Dear Aagnes,

many thanks for the promt reply!

Tried the code but STATA returned - unknown function ustrlen()

I guess this is because I still use STATA13 - which seems not to know the functions usrtlen and ustrregexs.

If there was any other way, which is okay in v13 - this would be of great help!

Lynde
Comment

Bjarte Aagnes

Join Date: Apr 2014
Posts: 783

04 May 2019, 13:59

Please follow the advice given in the FAQ

The current version of Stata is 15.1. Please specify if you are using an earlier version; otherwise, the answer to your question may refer to commands or features unavailable to you.

The following is an adoption, also allowing spaces (double names):

Code:

tempvar names
gen `names' = names

local rx [A-Z][a-z ]+

local ulmatch regexm(`names', "(`rx')")

forvalues i = 1/100 {
    
    gen name`i' = regexs(1) if `ulmatch'
    replace `names' = subinstr(`names', regexs(1), "", 1) if `ulmatch'
    
    capture assert mi(`names')
    if ( _rc == 0 ) continue, break
}

test data:

Code:

input str50 names    
"AnnaBusterCecilDavidThe prince of names "
"AnnaBusterCecil"
"AnnaBuster"
"Anna"
"AnnaBuster"
"AnnaBusterCecil"
"AnnaBusterCecilDavid"
"AnnaBusterCecilDavidAnna"
end

result:

Code:

     +--------------------------------------------------------------------------------------------------+
     |                                    names   name1    name2   name3   name4                  name5 |
     |--------------------------------------------------------------------------------------------------|
  1. | AnnaBusterCecilDavidThe prince of names     Anna   Buster   Cecil   David   The prince of names  |
  2. |                          AnnaBusterCecil    Anna   Buster   Cecil                                |
  3. |                               AnnaBuster    Anna   Buster                                        |
  4. |                                     Anna    Anna                                                 |
  5. |                               AnnaBuster    Anna   Buster                                        |
     |--------------------------------------------------------------------------------------------------|
  6. |                          AnnaBusterCecil    Anna   Buster   Cecil                                |
  7. |                     AnnaBusterCecilDavid    Anna   Buster   Cecil   David                        |
  8. |                 AnnaBusterCecilDavidAnna    Anna   Buster   Cecil   David                   Anna |
     +--------------------------------------------------------------------------------------------------+

Last edited by Bjarte Aagnes; 04 May 2019, 14:01.

Comment

daniel klein

Join Date: Mar 2014
Posts: 3843

04 May 2019, 14:14

moss (SSC) can do this.

Code:

// create toy data
clear
input str50 names
    "AnnaBusterCecilDavid"
    "AnnaBusterCecil"
    "AnnaBuster"
    "Anna"
    "AnnaBuster"
    "AnnaBusterCecil"
    "AnnaBusterCecilDavid"
end

// start here
moss names , match("([A-Z])") regex
summarize _count , meanonly
local max = r(max)
generate _pos`=`max'+1' = .
forvalues i = 1/`max' {
    generate name`i' = substr(names, _pos`i', _pos`=`i'+1'-_pos`i')
}

// result
list

yields

Code:

. // start here
. moss names , match("([A-Z])") regex

. summarize _count , meanonly

. local max = r(max)

. generate _pos`=`max'+1' = .
(7 missing values generated)

. forvalues i = 1/`max' {
  2.     generate name`i' = substr(names, _pos`i', _pos`=`i'+1'-_pos`i')
  3. }
(1 missing value generated)
(3 missing values generated)
(5 missing values generated)

.
. // result
. list

     +------------------------------------------------------------------------------------------------------------------------------------------------+
     |                names   _count   _match1   _pos1   _match2   _pos2   _match3   _pos3   _match4   _pos4   _pos5   name1    name2   name3   name4 |
     |------------------------------------------------------------------------------------------------------------------------------------------------|
  1. | AnnaBusterCecilDavid        4         A       1         B       5         C      11         D      16       .    Anna   Buster   Cecil   David |
  2. |      AnnaBusterCecil        3         A       1         B       5         C      11                 .       .    Anna   Buster   Cecil         |
  3. |           AnnaBuster        2         A       1         B       5                 .                 .       .    Anna   Buster                 |
  4. |                 Anna        1         A       1                 .                 .                 .       .    Anna                          |
  5. |           AnnaBuster        2         A       1         B       5                 .                 .       .    Anna   Buster                 |
     |------------------------------------------------------------------------------------------------------------------------------------------------|
  6. |      AnnaBusterCecil        3         A       1         B       5         C      11                 .       .    Anna   Buster   Cecil         |
  7. | AnnaBusterCecilDavid        4         A       1         B       5         C      11         D      16       .    Anna   Buster   Cecil   David |
     +------------------------------------------------------------------------------------------------------------------------------------------------+

Best
Daniel

Comment

Lynde Yn

Join Date: Apr 2019

Posts: 6
#6

04 May 2019, 16:51

Dear Aagnes,

many thanks for the v13 alternative - worked out fine (with blanks, and the ÄÖÜ which I added!)

Best regards, Lynde

Also thanks to Daniel!!
Comment
Bjarte Aagnes

Join Date: Apr 2014

Posts: 783
#7

05 May 2019, 07:08

Another -moss- alternative:

Code:

moss names , match("([A-Z][a-z ]+)") regex rename _match* name* keep name*

Ref.: a similar problem: Trouble identifying lowercase-uppercase regular expression
2 likes
Comment
daniel klein

Join Date: Mar 2014

Posts: 3843
#8

05 May 2019, 07:33

[QUOTE=Bjarte Aagnes;n1496747]
Another -moss- alternative:

Code:

moss names , match("([A-Z][a-z ]+)") regex rename _match* name* keep name*

Much better than mine. How did I not see this?

Best
Daniel
Comment

Marcos Almeida

Join Date: Apr 2014
Posts: 4047

05 May 2019, 08:08

This may sound awkward but, if you have 3 or 4 names, you can do it without any need to apply a loop:

Code:

. gen var1 = regexs(1) if regexm(names, "((^[A-Z]+)[ ]*([a-z]+))")

. gen var2 = subinstr(names, var1, "", 1)
(1 missing value generated)

. gen var3 = regexs(1) if regexm(var2, "((^[A-Z]+)[ ]*([a-z]+))")
(1 missing value generated)

. gen var4 = subinstr(var2, var3, "", 1)
(3 missing values generated)

. gen var5 = regexs(1) if regexm(var4, "((^[A-Z]+)[ ]*([a-z]+))")
(3 missing values generated)

. gen var6 = subinstr(var4, var5, "", 1)
(5 missing values generated)

. list

     +--------------------------------------------------------------------------------------+
     |                names   var1               var2     var3         var4    var5    var6 |
     |--------------------------------------------------------------------------------------|
  1. | AnnaBusterCecilDavid   Anna   BusterCecilDavid   Buster   CecilDavid   Cecil   David |
  2. |      AnnaBusterCecil   Anna        BusterCecil   Buster        Cecil   Cecil         |
  3. |           AnnaBuster   Anna             Buster   Buster                              |
  4. |                 Anna   Anna                                                          |
  5. |           AnnaBuster   Anna             Buster   Buster                              |
     |--------------------------------------------------------------------------------------|
  6. |      AnnaBusterCecil   Anna        BusterCecil   Buster        Cecil   Cecil         |
  7. | AnnaBusterCecilDavid   Anna   BusterCecilDavid   Buster   CecilDavid   Cecil   David |
     +--------------------------------------------------------------------------------------+

. drop var2 var4

. list

     +------------------------------------------------------+
     |                names   var1     var3    var5    var6 |
     |------------------------------------------------------|
  1. | AnnaBusterCecilDavid   Anna   Buster   Cecil   David |
  2. |      AnnaBusterCecil   Anna   Buster   Cecil         |
  3. |           AnnaBuster   Anna   Buster                 |
  4. |                 Anna   Anna                          |
  5. |           AnnaBuster   Anna   Buster                 |
     |------------------------------------------------------|
  6. |      AnnaBusterCecil   Anna   Buster   Cecil         |
  7. | AnnaBusterCecilDavid   Anna   Buster   Cecil   David |
     +------------------------------------------------------+

Hopefully that helps

Best regards,

Marcos

Comment

Lynde Yn

Join Date: Apr 2019

Posts: 6
#10

18 May 2019, 07:47

Dear Marcos, this sounds like is a very practical solution! Thanks! Lynde
Comment
Sonia Chen

Join Date: Feb 2017

Posts: 18
#11

14 Jul 2021, 18:36

Thank you Bjarte and Marcos! 've found solutions for a long time.
Comment

Announcement