Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Stata command to split string variable that does not have spaces

    I have a string variable h416 which has 123B as values. I would like to split this variable into 4 parts Var 1, Var2, Var3, Var4 that have values 1, 2, 3, B

  • #2
    Perhaps something like this:
    Code:
    forvalues i = 1/4 {
        gen var`i' = substr(h416, `i', 1)
    }
    Otherwise it would help if you provide an example of your data.

    Comment


    • #3
      I would add to Wouter's response by noting that Stata has a comprehensive set of functions for handling strings. Judging from the number of questions about manipulating strings that appear on StataList, the documentation not many new users are aware of this, so I'd recommend to you and others to see -help string functions-. No one, including me, remembers all those functions, but it's good to be aware that they exist.

      Comment


      • #4
        Here is a possibly amusing alternative approach that uses Stata's Unicode regular expression replacement function to insert a comma after each character in the string, after which the split command can work its magic.
        Code:
        . generate h416a = ustrregexra(h416,"(.)","$1,")
        
        . split h416a, generate(Var) parse(",")
        variables created as string: 
        Var1  Var2  Var3  Var4
        
        . list, clean
        
               h416      h416a   Var1   Var2   Var3   Var4  
          1.   123B   1,2,3,B,      1      2      3      B
        The advantage to the solution in post #2 is that it uses truly basic Stata commands that everyone learns quickly as they learn Stata, and I expect it took Wouter less time to get a solution that it took me to look up the syntax of the two commands in the code - and were it not that I saw my first regular expression too many years ago, I would have spent serious time trying to get the match and replacement expression right. Indeed, I was shocked that it ran perfectly on my first attempt.

        I post this only for the benefit of those experienced with regular expressions who may come across this post as the result of a search. I will add, as I always do when discussing regular expressions, that the real benefit of the Unicode regular expression functions is their much more powerful definition of regular expressions. To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's new regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp.

        Comment


        • #5
          When I originally wrote split I thought of including this kind of problem, but decided rightly or wrongly that it was a different problem -- because there are no separators -- and at best would require complicating the syntax when there are usually direct solutions. When the command was folded into official Stata the company went along with that. But I (and they) included a strong hint in the manual entry:

          If your problem is not defined by splitting on separators, you will probably want to use substr()
          directly. Suppose that you have a string variable, date, containing dates in the form "21011952" so
          that the last four characters define a year. This string contains no separators. To extract the year, you
          would use substr(date,-4,4). Again suppose that each woman’s obstetric history over the last 12
          months was recorded by a str12 variable containing values such as "nppppppppbnn", where p, b,
          and n denote months of pregnancy, birth, and nonpregnancy. Once more, there are no separators, so
          you would use substr() to subdivide the string.

          Comment


          • #6
            Originally posted by William Lisowski View Post
            Here is a possibly amusing alternative approach that uses Stata's Unicode regular expression replacement function to insert a comma after each character in the string, after which the split command can work its magic.
            Code:
            . generate h416a = ustrregexra(h416,"(.)","$1,")
            
            . split h416a, generate(Var) parse(",")
            variables created as string:
            Var1 Var2 Var3 Var4
            
            . list, clean
            
            h416 h416a Var1 Var2 Var3 Var4
            1. 123B 1,2,3,B, 1 2 3 B
            The advantage to the solution in post #2 is that it uses truly basic Stata commands that everyone learns quickly as they learn Stata, and I expect it took Wouter less time to get a solution that it took me to look up the syntax of the two commands in the code - and were it not that I saw my first regular expression too many years ago, I would have spent serious time trying to get the match and replacement expression right. Indeed, I was shocked that it ran perfectly on my first attempt.

            I post this only for the benefit of those experienced with regular expressions who may come across this post as the result of a search. I will add, as I always do when discussing regular expressions, that the real benefit of the Unicode regular expression functions is their much more powerful definition of regular expressions. To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's new regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp.
            This was helpful. It did the magic!! Thanks!

            Comment

            Working...
            X