How to use substring command

Thong Nguyen

Join Date: Oct 2015

Posts: 236
#1

How to use substring command

14 Jul 2016, 00:32

Dear all,
I have a dataset which contain id number with the display format is %6.3f. And I would like to use substring command to create a new variable take the number before the dot '.'.
I mean if I have two subjects with their id are

Code:

1. 23.149 2. 24.001

And I want to create a variable cluster which has value equals 23 for the 1st one and equals 24 for the second one. The problem is part of the id's string ranges from 1 to 32, therefore, I can not use the following command

Code:

gen cluster=substr(id,1,2)

I read and use the following command, instead.

Code:

gen cluster=substr(id,-3,2)

I know that the wrong code, but how to fix it.

Thank you all inadvance.

Last edited by Thong Nguyen; 14 Jul 2016, 00:41.
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17712
#2

14 Jul 2016, 00:48

Thong:
you may want to try:

Code:

. set obs 2 . g string="23.149" in 1 . replace string="24.001" in 2 . split string, p(.) . destring string1, g(cluster) . drop string2

Kind regards,
Carlo
(Stata 19.0)
Comment
Thong Nguyen

Join Date: Oct 2015

Posts: 236
#3

14 Jul 2016, 02:05

I figured it out by myself.
tostring id, gen(x)
gen x2=x if nhom==1
split x2, p(.)
drop x22
ren x21 cluster
destring cluster, replace
Comment
Thong Nguyen

Join Date: Oct 2015

Posts: 236
#4

14 Jul 2016, 02:06

Dear Carlo,
Now I see your codes, thank you very much for your help.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35720
#5

14 Jul 2016, 02:38

Various small confusions here.

First, substr() is a function, not a command. Commands and functions are disjoint in Stata. See e.g. http://www.stata-journal.com/sjpdf.h...iclenum=dm0058 for a tutorial making that point and others.

As you have a numeric variable, e.g.

Code:

clear input id 23.149 24.001 end format id %6.3f

you cannot apply substr() directly, as you realised.

But there are easier solutions than posted. An integer identifier could be stored as such. Or you could hold it as string. For identifiers such as in your examples, there is not much in the choice. For long numeric identifiers, you need to be more careful.

Consider this direct approach:

Code:

gen numid = floor(id) gen strid = string(floor(id)) list +------------------------+ | id numid strid | |------------------------| 1. | 23.149 23 23 | 2. | 24.001 24 24 | +------------------------+

For this kind of problem, destring and tostring are just over-elaborate. I have nothing against those commands (see the manual entry) but their main points are

1. Convenience. You can apply them to several variables at once, even all the variables in the dataset. (In each case, the command will ignore what is irrelevant.)

2. Security. Each command has extra bells and whistles to try to ensure that you don't lose information, unless you say you don't care.

Those points are not pertinent here. You **know** you want to ignore the digits after the decimal point. You **know** you want just a single new variable.
1 like
Comment
Thong Nguyen

Join Date: Oct 2015

Posts: 236
#6

14 Jul 2016, 02:46

Dear Nick,
Thank for your help. You show me a better solution for that kind of problem and I learned a lot.
Comment
Thong Nguyen

Join Date: Oct 2015

Posts: 236
#7

14 Jul 2016, 11:11

Dear Nick,
Thank you for the document you shared, it's really interesting and I need to learn and master those functions.
Comment

Announcement

How to use substring command

Comment

Comment

Comment

Comment

Comment

Comment