How to Splitting a string variable, which without commas or space, into two parts

Mei Mei Liu

Join Date: May 2015

Posts: 19
#1

How to Splitting a string variable, which without commas or space, into two parts

22 May 2016, 22:50

Dear Stata Users,

I encounter a data management problem, which is to split a variable into two parts.
I know I can use split command to seperate SexAge and Race. e.g. split SexAge_Race, parse(_)
But I don't know what command to seperate Sex and Age. There is no space or comma between Sex and Age..

My original variables look like this
subject_ID SexAge_Race

1 MALE41.2_White

2 FEMALE42.9_White

3 FEMALE38.5_White

4 FEMALE35.6_Hispanic

5 FEMALE48.5_White

I want my variables look like as follows:

subject_ID Sex Age Race
1 MALE 41.2 White
2 FEMALE 42.9 White
3 FEMALE 38.5 White
4 FEMALE 35.6 Hispanic
5 FEMALE 48.5 White

********************************************
Please give me some advice ! Thank You!!

Last edited by Mei Mei Liu; 22 May 2016, 23:40.
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35703

22 May 2016, 23:46

Code:

clear 
input subect_ID str19 SexAge_Race
1    "MALE41.2_White"
2    "FEMALE42.9_White"
3    "FEMALE38.5_White"
4    "FEMALE35.6_Hispanic"
5    "FEMALE48.5_White"
end 

gen Sex = "FEMALE" if substr(SexA, 1, 6) == "FEMALE"
replace Sex = "MALE" if substr(SexA, 1, 4) == "MALE" 
gen work = subinstr(SexA, Sex, "",  .) 
split work, parse(_) destring 
rename (work1 work2) (Age Race) 
list 
edit if missing(Sex)

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35703

23 May 2016, 01:14

Another way to do it is to extract each feature with a regular expression:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input float subject_ID str19 SexAge_Race
1 "MALE41.2_White"    
2 "FEMALE42.9_White"  
3 "FEMALE38.5_White"  
4 "FEMALE35.6_Hispanic"
5 "FEMALE48.5_White"  
end
gen Sex = regexs(1) if regexm(SexA, "(.*MALE)")
gen Age = regexs(1) if regexm(SexA, "([0-9]+\.?[0-9]*)")
destring Age, replace 
gen Race = regexs(1) if regexm(SexA, "_(.*)")
list

     +-----------------------------------------------------------+
     | subjec~D           SexAge_Race      Sex    Age       Race |
     |-----------------------------------------------------------|
  1. |        1        MALE41.2_White     MALE   41.2      White |
  2. |        2      FEMALE42.9_White   FEMALE   42.9      White |
  3. |        3      FEMALE38.5_White   FEMALE   38.5      White |
  4. |        4   FEMALE35.6_Hispanic   FEMALE   35.6   Hispanic |
  5. |        5      FEMALE48.5_White   FEMALE   48.5      White |
     +-----------------------------------------------------------+

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35703
#4

23 May 2016, 08:53

Just a further comment: split is an official command but I can speak on the original design. It's based on the premise that there are separators such as spaces, commas or other punctuation characters. Naturally there are also problems in which there aren't separators, as here.

I spent a while thinking about extra syntax for these extra cases, but decided not to complicate split too much. As the problem here shows, there is usually scope when split doesn't apply to part or all of the problem for one or both of two kinds of solutions, one hinging on functions such as strpos() substr() subinstr() and the other hinging on regular expression machinery.
Comment
Mei Mei Liu

Join Date: May 2015

Posts: 19
#5

24 May 2016, 15:45

Dear Nick,

Thank you so much for taking time to share this suggestion with me! It works and it's a huge help to me. Thank You!
Comment

Announcement

How to Splitting a string variable, which without commas or space, into two parts

Comment

Comment

Comment

Comment