Stata command to split string variable that does not have spaces

Ko Chavula

Join Date: Mar 2017

Posts: 2
#1

Stata command to split string variable that does not have spaces

15 Jul 2020, 14:45

I have a string variable h416 which has 123B as values. I would like to split this variable into 4 parts Var 1, Var2, Var3, Var4 that have values 1, 2, 3, B
Tags: None
Wouter Wakker

Join Date: Nov 2018

Posts: 621
#2

15 Jul 2020, 14:55

Perhaps something like this:

Code:

forvalues i = 1/4 { gen var`i' = substr(h416, `i', 1) }

Otherwise it would help if you provide an example of your data.
1 like
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#3

15 Jul 2020, 15:01

I would add to Wouter's response by noting that Stata has a comprehensive set of functions for handling strings. Judging from the number of questions about manipulating strings that appear on StataList, the documentation not many new users are aware of this, so I'd recommend to you and others to see -help string functions-. No one, including me, remembers all those functions, but it's good to be aware that they exist.
2 likes
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

15 Jul 2020, 19:30

Here is a possibly amusing alternative approach that uses Stata's Unicode regular expression replacement function to insert a comma after each character in the string, after which the split command can work its magic.

Code:

. generate h416a = ustrregexra(h416,"(.)","$1,") . split h416a, generate(Var) parse(",") variables created as string: Var1 Var2 Var3 Var4 . list, clean h416 h416a Var1 Var2 Var3 Var4 1. 123B 1,2,3,B, 1 2 3 B

The advantage to the solution in post #2 is that it uses truly basic Stata commands that everyone learns quickly as they learn Stata, and I expect it took Wouter less time to get a solution that it took me to look up the syntax of the two commands in the code - and were it not that I saw my first regular expression too many years ago, I would have spent serious time trying to get the match and replacement expression right. Indeed, I was shocked that it ran perfectly on my first attempt.

I post this only for the benefit of those experienced with regular expressions who may come across this post as the result of a search. I will add, as I always do when discussing regular expressions, that the real benefit of the Unicode regular expression functions is their much more powerful definition of regular expressions. To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's new regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp.
2 likes
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35699
#5

16 Jul 2020, 01:17

When I originally wrote split I thought of including this kind of problem, but decided rightly or wrongly that it was a different problem -- because there are no separators -- and at best would require complicating the syntax when there are usually direct solutions. When the command was folded into official Stata the company went along with that. But I (and they) included a strong hint in the manual entry:

If your problem is not defined by splitting on separators, you will probably want to use substr()
directly. Suppose that you have a string variable, date, containing dates in the form "21011952" so
that the last four characters define a year. This string contains no separators. To extract the year, you
would use substr(date,-4,4). Again suppose that each woman’s obstetric history over the last 12
months was recorded by a str12 variable containing values such as "nppppppppbnn", where p, b,
and n denote months of pregnancy, birth, and nonpregnancy. Once more, there are no separators, so
you would use substr() to subdivide the string.
2 likes
Comment
Ko Chavula

Join Date: Mar 2017

Posts: 2
#6

30 Jul 2020, 01:15

Originally posted by William Lisowski View Post

Here is a possibly amusing alternative approach that uses Stata's Unicode regular expression replacement function to insert a comma after each character in the string, after which the split command can work its magic.

Code:

. generate h416a = ustrregexra(h416,"(.)","$1,") . split h416a, generate(Var) parse(",") variables created as string: Var1 Var2 Var3 Var4 . list, clean h416 h416a Var1 Var2 Var3 Var4 1. 123B 1,2,3,B, 1 2 3 B

The advantage to the solution in post #2 is that it uses truly basic Stata commands that everyone learns quickly as they learn Stata, and I expect it took Wouter less time to get a solution that it took me to look up the syntax of the two commands in the code - and were it not that I saw my first regular expression too many years ago, I would have spent serious time trying to get the match and replacement expression right. Indeed, I was shocked that it ran perfectly on my first attempt.

I post this only for the benefit of those experienced with regular expressions who may come across this post as the result of a search. I will add, as I always do when discussing regular expressions, that the real benefit of the Unicode regular expression functions is their much more powerful definition of regular expressions. To the best of my knowledge, only in the Statlist post linked here is it documented that Stata's new regular expression parser is the ICU regular expression engine documented at http://userguide.icu-project.org/strings/regexp.

This was helpful. It did the magic!! Thanks!
Comment

Announcement

Stata command to split string variable that does not have spaces

Comment

Comment

Comment

Comment

Comment