Split string variable

Linh mt

Join Date: May 2017

Posts: 33
#1

Split string variable

12 Jun 2019, 03:13

Hi everyone,

I have a string var and I would like to split that var to get the last part of the string text, say the string var has dataset as below:
name
adam smith
julia jig
beffy mark jabcos
william adam beg tiffy

And I want to get the last part of this dataset, which is "smith", "jig" and "jabcos" and "tiffy" and the wanted results should be:
name newname
adam smith smith
julia jig jig
beffy mark jabcos jabcos
william adam beg tiffy tiffy

I have tried to use the command as below but it does not work:
-- egen newvar=ends(name) trim[last]

Could anyone help me to sold this issue. Sorry if this question is a basic one. I've searched on google but I cannot find the solution yet.

Thank you a lot
Kind regards
Linh

(editted: add more example)

Last edited by Linh mt; 12 Jun 2019, 03:33.
Tags: None
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17724
#2

12 Jun 2019, 03:19

Linh:
I do hope that the following toy-example will be useful:

Code:

. set obs 1 number of observations (_N) was 0, now 1 . g name="Stan Smith" . split name variables created as string: name1 name2 . list +----------------------------+ | name name1 name2 | |----------------------------| 1. | Stan Smith Stan Smith | +----------------------------+

PS: Despite being a(n) (health) economist, actually, my education owes more to Stan Smith (https://en.wikipedia.org/wiki/Stan_Smith) than to Adam Smith (https://en.wikipedia.org/wiki/Adam_Smith)!

Last edited by Carlo Lazzaro; 12 Jun 2019, 03:23.

Kind regards,
Carlo
(Stata 19.0)
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35754
#3

12 Jun 2019, 03:22

With the example given,

Code:

gen wanted = word(name, 2)

would also work.
1 like
Comment
Linh mt

Join Date: May 2017

Posts: 33
#4

12 Jun 2019, 03:36

Originally posted by Carlo Lazzaro View Post

Linh:
I do hope that the following toy-example will be useful:

Code:

. set obs 1 number of observations (_N) was 0, now 1 . g name="Stan Smith" . split name variables created as string: name1 name2 . list +----------------------------+ | name name1 name2 | |----------------------------| 1. | Stan Smith Stan Smith | +----------------------------+

PS: Despite being a(n) (health) economist, actually, my education owes more to Stan Smith (https://en.wikipedia.org/wiki/Stan_Smith) than to Adam Smith (https://en.wikipedia.org/wiki/Adam_Smith)!

Hi Carlo,

Thank you for your reply. Your advice is correct. However I would like to get the all last past of name in the dataset are under a one new variable. For more detail, please see my example of dataset which I have added. I am sorry because of changing a bit example, otherwise it may cause misunderstanding to everyone.

Best regards
Linh
Comment
Linh mt

Join Date: May 2017

Posts: 33
#5

12 Jun 2019, 03:42

Originally posted by Nick Cox View Post

With the example given,

Code:

gen wanted = word(name, 2)

would also work.

Dear Nick,

Thank you for your reply. My appologize when I post the example which does not cover other cases. I have just editted the example, in which name contains 3 or 4 or more words. In this case, what command I should use if I wish that all the last parts of name are presented under a generated new name. Could you please review my updated example in the first post?

Thank you very much
Kind regards
Linh
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35754
#6

12 Jun 2019, 04:15

Code:

help word()

tells you about the function I used in #3.

word(s,n)
Description: the nth word in s; missing ("") if n is missing

Positive numbers count words from the beginning of s, and negative numbers count words
from the end of s. (1 is the first word in s, and -1 is the last word in s.) A word is a
set of characters that start and terminate with spaces. This is different from a Unicode
word, which is a language unit based on either a set of word-boundary rules or
dictionaries for several languages (Chinese, Japanese, and Thai).
Domain s: strings
Domain n: integers

Hence

Code:

gen wanted = word(name, -1)

is a more general suggestion.
1 like
Comment
Linh mt

Join Date: May 2017

Posts: 33
#7

12 Jun 2019, 04:43

Originally posted by Nick Cox View Post

Code:

help word()

tells you about the function I used in #3.

Hence

Code:

gen wanted = word(name, -1)

is a more general suggestion.

Dear Nick,

It works now. My issue has been solved.
However, I used to tried another syntax "egen name1=ends(ten), last punct(" ")", some observations are correct but some observations are 'blank', say:

name -------------------------------newname
adam smith------- -----------------smith
julia jig------------------------------- jig
beffy mark jabcos-----------------
william adam beg tiffy tiffy------tiffy

I do not know why is that because most the observations, the results are correct but some are not. Could you help me to detect where the problem is please?

P/S: ((I dont know how can post the example in the stata format, so i just manually type like that. If you do not mind, could you instruct me on this matter. i really appreciate you time and enthusiasticness)

Thank you so much
Regards
Linh

Last edited by Linh mt; 12 Jun 2019, 04:47.
Comment

Marcos Almeida

Join Date: Apr 2014
Posts: 4047

12 Jun 2019, 04:47

If I understood right, you wish something like:

Code:

. egen lastpart = ends(name), last

. list

     +-----------------------------------+
     |                   name   lastpart |
     |-----------------------------------|
  1. |             adam smith      smith |
  2. |              julia jig        jig |
  3. |      beffy mark jabcos     jabcos |
  4. | william adam beg tiffy      tiffy |
     +-----------------------------------+

Hopefully that helps.

Best regards,

Marcos

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35754
#9

12 Jun 2019, 05:00

Please read the FAQ Advice at https://www.statalist.org/forums/help That gives the details you seek.
Comment
Linh mt

Join Date: May 2017

Posts: 33
#10

12 Jun 2019, 05:18

Originally posted by Marcos Almeida View Post

If I understood right, you wish something like:

Code:

. egen lastpart = ends(name), last . list +-----------------------------------+ | name lastpart | |-----------------------------------| 1. | adam smith smith | 2. | julia jig jig | 3. | beffy mark jabcos jabcos | 4. | william adam beg tiffy tiffy | +-----------------------------------+

Hopefully that helps.

Hi Marcos,
You understood. However, I have tried that syntax but some observations of lastpart are blank, but the syntax (gen lastpart=word(name,-1) is perfect. I dont know where is the syntax (egen lastpart = ends(name), last ) is problematic (

Thank you for your reply
Regards\
Linh
Comment
Carlo Lazzaro

Join Date: Apr 2014

Posts: 17724
#11

12 Jun 2019, 05:21

Linh:
probably you have a leading/trailing blanks issue with some of your observations.

Kind regards,
Carlo
(Stata 19.0)
Comment

Marcos Almeida

Join Date: Apr 2014
Posts: 4047

#12

12 Jun 2019, 07:56

As Carlo pointed out, you must have blank spaces. Therefore, you need to trim before applying the code above.

Look at the example below, before and after trimming:

Code:

. input str40 name

                                         name
  1. "adam smith"
  2. "julia jig"
  3. "beffy mark jabcos"
  4. "william adam beg tiffy"
  5. "william adam beg tiffy   "
  6. end

. egen lastpartnotrim = ends(name), last
(1 missing value generated)

. list

     +--------------------------------------+
     |                      name   lastpa~m |
     |--------------------------------------|
  1. |                adam smith      smith |
  2. |                 julia jig        jig |
  3. |         beffy mark jabcos     jabcos |
  4. |    william adam beg tiffy      tiffy |
  5. | william adam beg tiffy               |
     +--------------------------------------+

. gen name2 = trim(name)

. egen lastparttrimmed1 = ends(name2), last

. list

     +--------------------------------------------------------------------------+
     |                      name   lastpa~m                    name2   lastpa~1 |
     |--------------------------------------------------------------------------|
  1. |                adam smith      smith               adam smith      smith |
  2. |                 julia jig        jig                julia jig        jig |
  3. |         beffy mark jabcos     jabcos        beffy mark jabcos     jabcos |
  4. |    william adam beg tiffy      tiffy   william adam beg tiffy      tiffy |
  5. | william adam beg tiffy                 william adam beg tiffy      tiffy |
     +--------------------------------------------------------------------------+

Hopefully that helps.

Best regards,

Marcos

Comment

Linh mt

Join Date: May 2017

Posts: 33
#13

12 Jun 2019, 23:06

Originally posted by Marcos Almeida View Post

As Carlo pointed out, you must have blank spaces. Therefore, you need to trim before applying the code above.

Look at the example below, before and after trimming:

Code:

. input str40 name name 1. "adam smith" 2. "julia jig" 3. "beffy mark jabcos" 4. "william adam beg tiffy" 5. "william adam beg tiffy " 6. end . egen lastpartnotrim = ends(name), last (1 missing value generated) . list +--------------------------------------+ | name lastpa~m | |--------------------------------------| 1. | adam smith smith | 2. | julia jig jig | 3. | beffy mark jabcos jabcos | 4. | william adam beg tiffy tiffy | 5. | william adam beg tiffy | +--------------------------------------+ . gen name2 = trim(name) . egen lastparttrimmed1 = ends(name2), last . list +--------------------------------------------------------------------------+ | name lastpa~m name2 lastpa~1 | |--------------------------------------------------------------------------| 1. | adam smith smith adam smith smith | 2. | julia jig jig julia jig jig | 3. | beffy mark jabcos jabcos beffy mark jabcos jabcos | 4. | william adam beg tiffy tiffy william adam beg tiffy tiffy | 5. | william adam beg tiffy william adam beg tiffy tiffy | +--------------------------------------------------------------------------+

Hopefully that helps.

Hi Marcos and Carlo,

Yes, that's exact what you point out. My problem is solved. However, after TRIM, name2 is no different from name in my dataset, please see it in the attachment: Anyway, my aim is achieved

PHP Code:

ti h xa hoso name name2 2 30 919 200 Triệu Văn Huyện Triệu Văn Huyện 2 30 919 200 Triệu Văn Hồng Triệu Văn Hồng 2 30 919 201 Nguyễn Như Thế Nguyễn Như Thế 2 30 919 201 Tấn Thị Hoa Tấn Thị Hoa 2 30 919 201 Nguyễn Văn Ngọc Nguyễn Văn Ngọc 2 30 919 201 Nguyễn Thị Phương Nguyễn Thị Phương

Thank you all
Linh

Last edited by Linh mt; 12 Jun 2019, 23:10.
Comment

Carlo Lazzaro

Join Date: Apr 2014
Posts: 17724

#14

13 Jun 2019, 02:13

Linh:
exploiting Marcos' helpful code, what if you -trim- before -name2-?:

Code:

. input str40 name

                                         name
  1.   "adam smith"
  2.  "julia jig"
  3. "beffy mark jabcos"
  4. "william adam beg tiffy"
  5.  "william adam beg tiffy   "
  6.  end

. replace name = trim(name)
(1 real change made)

. egen lastparttrimmed1 = ends(name), last

. list

     +-----------------------------------+
     |                   name   lastpa~1 |
     |-----------------------------------|
  1. |             adam smith      smith |
  2. |              julia jig        jig |
  3. |      beffy mark jabcos     jabcos |
  4. | william adam beg tiffy      tiffy |
  5. | william adam beg tiffy      tiffy |
     +-----------------------------------+

Kind regards,
Carlo
(Stata 19.0)

Comment

Marcos Almeida

Join Date: Apr 2014

Posts: 4047
#15

13 Jun 2019, 06:17

Carlo already clarified the issue.

That said, when you say that there is no difference between name and name2, I believe you meant the names. But if you observe attentively, you will see differences concerning blank spaces throughout the variables: before, after and in-between.

Best regards,

Marcos
1 like
Comment

Announcement