Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Seperating a string into substrings (by upper case letters)

    Dear all,

    I have to "cut a string into pieces". The "seperator" being a upper case letter.

    e.g. var name "AbrahamAndyGregor" should be divided in var name1 "Abraham" var name2 "Andy" and var name3 "Gregor".

    There are about 180 "names" combined in different ways. I do have a list of names.

    What I tried was to set the first letter to lower case -->

    gen name1=(lower(substr(name, 1, 1))) + (substr(name, 2, .)) ---> name will be abrahamAndyGregor.

    Thus I could extract the first name ...
    gen name1 = regexs(0) if regexm(name, "[a-zäöüß ]+")

    (Those freaky letters äöüß are just because they appear in the names - also blanks).

    Then I tried to get the second name - and failed ;-)!
    I tried different combinations like ...
    gen name2 = regexs(1) if regexm(name, "(([a-zäöüß ]+)([A-ZÄÖÜ][a-zäöüß ]+))")
    gen name2 = regexs(0) if regexm(name, "([A-ZÄÖÜ][a-zäöüß ]+)")

    Asking the forum kindy for help - best regards,
    Lynde

  • #2
    Running
    Code:
    tempvar names nu    // "nu" to hold number of upper case letters
    gen `names' = names
    gen `nu' = ustrlen(ustrregexra(names, "[^\p{Lu}]" ,""))
    qui su `nu'
    
    local ulmatch "ustrregexm(`names', "(\p{Lu}\p{Ll}+)" )"
    
    forvalues i = 1/`r(max)' {
    
        gen name`i' = ustrregexs(1) if `ulmatch' 
        replace `names' = subinstr(`names', ustrregexs(1), "", 1) if `ulmatch'
    }
    applied to
    Code:
        input str50 names
        "AnnaBusterCecilDavid"
        "AnnaBusterCecil"
        "AnnaBuster"
        "Anna"
        "AnnaBuster"
        "AnnaBusterCecil"
        "AnnaBusterCecilDavid"
        end
    return
    Code:
    list , clean
    
                          names   name1    name2   name3   name4  
      1.   AnnaBusterCecilDavid    Anna   Buster   Cecil   David  
      2.        AnnaBusterCecil    Anna   Buster   Cecil          
      3.             AnnaBuster    Anna   Buster                  
      4.                   Anna    Anna                           
      5.             AnnaBuster    Anna   Buster                  
      6.        AnnaBusterCecil    Anna   Buster   Cecil          
      7.   AnnaBusterCecilDavid    Anna   Buster   Cecil   David

    Comment


    • #3
      Dear Aagnes,

      many thanks for the promt reply!

      Tried the code but STATA returned - unknown function ustrlen()

      I guess this is because I still use STATA13 - which seems not to know the functions usrtlen and ustrregexs.

      If there was any other way, which is okay in v13 - this would be of great help!

      Lynde

      Comment


      • #4
        Please follow the advice given in the FAQ
        The current version of Stata is 15.1. Please specify if you are using an earlier version; otherwise, the answer to your question may refer to commands or features unavailable to you.
        The following is an adoption, also allowing spaces (double names):
        Code:
        tempvar names
        gen `names' = names
        
        local rx [A-Z][a-z ]+
        
        local ulmatch regexm(`names', "(`rx')")
        
        forvalues i = 1/100 {
            
            gen name`i' = regexs(1) if `ulmatch'
            replace `names' = subinstr(`names', regexs(1), "", 1) if `ulmatch'
            
            capture assert mi(`names')
            if ( _rc == 0 ) continue, break
        }
        test data:
        Code:
        input str50 names    
        "AnnaBusterCecilDavidThe prince of names "
        "AnnaBusterCecil"
        "AnnaBuster"
        "Anna"
        "AnnaBuster"
        "AnnaBusterCecil"
        "AnnaBusterCecilDavid"
        "AnnaBusterCecilDavidAnna"
        end
        result:
        Code:
             +--------------------------------------------------------------------------------------------------+
             |                                    names   name1    name2   name3   name4                  name5 |
             |--------------------------------------------------------------------------------------------------|
          1. | AnnaBusterCecilDavidThe prince of names     Anna   Buster   Cecil   David   The prince of names  |
          2. |                          AnnaBusterCecil    Anna   Buster   Cecil                                |
          3. |                               AnnaBuster    Anna   Buster                                        |
          4. |                                     Anna    Anna                                                 |
          5. |                               AnnaBuster    Anna   Buster                                        |
             |--------------------------------------------------------------------------------------------------|
          6. |                          AnnaBusterCecil    Anna   Buster   Cecil                                |
          7. |                     AnnaBusterCecilDavid    Anna   Buster   Cecil   David                        |
          8. |                 AnnaBusterCecilDavidAnna    Anna   Buster   Cecil   David                   Anna |
             +--------------------------------------------------------------------------------------------------+
        Last edited by Bjarte Aagnes; 04 May 2019, 14:01.

        Comment


        • #5
          moss (SSC) can do this.

          Code:
          // create toy data
          clear
          input str50 names
              "AnnaBusterCecilDavid"
              "AnnaBusterCecil"
              "AnnaBuster"
              "Anna"
              "AnnaBuster"
              "AnnaBusterCecil"
              "AnnaBusterCecilDavid"
          end
          
          // start here
          moss names , match("([A-Z])") regex
          summarize _count , meanonly
          local max = r(max)
          generate _pos`=`max'+1' = .
          forvalues i = 1/`max' {
              generate name`i' = substr(names, _pos`i', _pos`=`i'+1'-_pos`i')
          }
          
          // result
          list
          yields

          Code:
          . // start here
          . moss names , match("([A-Z])") regex
          
          . summarize _count , meanonly
          
          . local max = r(max)
          
          . generate _pos`=`max'+1' = .
          (7 missing values generated)
          
          . forvalues i = 1/`max' {
            2.     generate name`i' = substr(names, _pos`i', _pos`=`i'+1'-_pos`i')
            3. }
          (1 missing value generated)
          (3 missing values generated)
          (5 missing values generated)
          
          .
          . // result
          . list
          
               +------------------------------------------------------------------------------------------------------------------------------------------------+
               |                names   _count   _match1   _pos1   _match2   _pos2   _match3   _pos3   _match4   _pos4   _pos5   name1    name2   name3   name4 |
               |------------------------------------------------------------------------------------------------------------------------------------------------|
            1. | AnnaBusterCecilDavid        4         A       1         B       5         C      11         D      16       .    Anna   Buster   Cecil   David |
            2. |      AnnaBusterCecil        3         A       1         B       5         C      11                 .       .    Anna   Buster   Cecil         |
            3. |           AnnaBuster        2         A       1         B       5                 .                 .       .    Anna   Buster                 |
            4. |                 Anna        1         A       1                 .                 .                 .       .    Anna                          |
            5. |           AnnaBuster        2         A       1         B       5                 .                 .       .    Anna   Buster                 |
               |------------------------------------------------------------------------------------------------------------------------------------------------|
            6. |      AnnaBusterCecil        3         A       1         B       5         C      11                 .       .    Anna   Buster   Cecil         |
            7. | AnnaBusterCecilDavid        4         A       1         B       5         C      11         D      16       .    Anna   Buster   Cecil   David |
               +------------------------------------------------------------------------------------------------------------------------------------------------+
          Best
          Daniel

          Comment


          • #6
            Dear Aagnes,

            many thanks for the v13 alternative - worked out fine (with blanks, and the ÄÖÜ which I added!)

            Best regards, Lynde

            Also thanks to Daniel!!

            Comment


            • #7

              Another -moss- alternative:
              Code:
              moss names , match("([A-Z][a-z ]+)") regex   
              
              rename _match* name* 
              keep name*
              Ref.: a similar problem: Trouble identifying lowercase-uppercase regular expression

              Comment


              • #8
                [QUOTE=Bjarte Aagnes;n1496747]
                Another -moss- alternative:
                Code:
                moss names , match("([A-Z][a-z ]+)") regex
                
                rename _match* name*
                keep name*
                Much better than mine. How did I not see this?

                Best
                Daniel

                Comment


                • #9
                  This may sound awkward but, if you have 3 or 4 names, you can do it without any need to apply a loop:

                  Code:
                  . gen var1 = regexs(1) if regexm(names, "((^[A-Z]+)[ ]*([a-z]+))")
                  
                  . gen var2 = subinstr(names, var1, "", 1)
                  (1 missing value generated)
                  
                  . gen var3 = regexs(1) if regexm(var2, "((^[A-Z]+)[ ]*([a-z]+))")
                  (1 missing value generated)
                  
                  . gen var4 = subinstr(var2, var3, "", 1)
                  (3 missing values generated)
                  
                  . gen var5 = regexs(1) if regexm(var4, "((^[A-Z]+)[ ]*([a-z]+))")
                  (3 missing values generated)
                  
                  . gen var6 = subinstr(var4, var5, "", 1)
                  (5 missing values generated)
                  
                  . list
                  
                       +--------------------------------------------------------------------------------------+
                       |                names   var1               var2     var3         var4    var5    var6 |
                       |--------------------------------------------------------------------------------------|
                    1. | AnnaBusterCecilDavid   Anna   BusterCecilDavid   Buster   CecilDavid   Cecil   David |
                    2. |      AnnaBusterCecil   Anna        BusterCecil   Buster        Cecil   Cecil         |
                    3. |           AnnaBuster   Anna             Buster   Buster                              |
                    4. |                 Anna   Anna                                                          |
                    5. |           AnnaBuster   Anna             Buster   Buster                              |
                       |--------------------------------------------------------------------------------------|
                    6. |      AnnaBusterCecil   Anna        BusterCecil   Buster        Cecil   Cecil         |
                    7. | AnnaBusterCecilDavid   Anna   BusterCecilDavid   Buster   CecilDavid   Cecil   David |
                       +--------------------------------------------------------------------------------------+
                  
                  . drop var2 var4
                  
                  . list
                  
                       +------------------------------------------------------+
                       |                names   var1     var3    var5    var6 |
                       |------------------------------------------------------|
                    1. | AnnaBusterCecilDavid   Anna   Buster   Cecil   David |
                    2. |      AnnaBusterCecil   Anna   Buster   Cecil         |
                    3. |           AnnaBuster   Anna   Buster                 |
                    4. |                 Anna   Anna                          |
                    5. |           AnnaBuster   Anna   Buster                 |
                       |------------------------------------------------------|
                    6. |      AnnaBusterCecil   Anna   Buster   Cecil         |
                    7. | AnnaBusterCecilDavid   Anna   Buster   Cecil   David |
                       +------------------------------------------------------+
                  Hopefully that helps
                  Best regards,

                  Marcos

                  Comment


                  • #10
                    Dear Marcos, this sounds like is a very practical solution! Thanks! Lynde

                    Comment


                    • #11
                      Thank you Bjarte and Marcos! 've found solutions for a long time.

                      Comment

                      Working...
                      X