split

Taqi Nabizada

Join Date: Mar 2020

Posts: 12
#1

split

24 Apr 2025, 00:46

there is a variable every observation of which includes more than two options differently. I want to split this variable into as many variable as possible of that every observation options. There is no separator between options, and also some observations are empty.. I used this command (split Q45, parse("")gen (sim_)), It creates only one variable as the original one. you can find the sample of variable below, can anyone please help me?
----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. For more info, type help dataex clear input str24 Q45 "abcd " "acde " "" "ace " "" "c " "ce " "abcde " "bcd " "cd " "" "" "c " "ace " "" "cd " "acde " "c " "c " "abcd " end

------------------ copy up to and including the previous line ------------------
Tags: None

Andrew Musau

Join Date: Oct 2014
Posts: 10098

24 Apr 2025, 05:12

Your post is not clear, but I will assume that a letter represents an option. In this case, creating indicators may be preferable to splitting the options.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str24 Q45
"abcd                    "
"acde                    "
""                        
"ace                     "
""                        
"c                       "
"ce                      "
"abcde                   "
"bcd                     "
"cd                      "
""                        
""                        
"c                       "
"ace                     "
""                        
"cd                      "
"acde                    "
"c                       "
"c                       "
"abcd                    "
end

foreach opt in a b c d e{
    gen `opt'= regexm(Q45, "`opt'")
    gen sim_`opt'= "`opt'" if regexm(Q45, "`opt'")
}
order Q45 ? sim_*

Res.:

Code:

. l, sep(0)

     +--------------------------------------------------------------------------------------+
     |                      Q45   a   b   c   d   e   sim_a   sim_b   sim_c   sim_d   sim_e |
     |--------------------------------------------------------------------------------------|
  1. | abcd                       1   1   1   1   0       a       b       c       d         |
  2. | acde                       1   0   1   1   1       a               c       d       e |
  3. |                            0   0   0   0   0                                         |
  4. | ace                        1   0   1   0   1       a               c               e |
  5. |                            0   0   0   0   0                                         |
  6. | c                          0   0   1   0   0                       c                 |
  7. | ce                         0   0   1   0   1                       c               e |
  8. | abcde                      1   1   1   1   1       a       b       c       d       e |
  9. | bcd                        0   1   1   1   0               b       c       d         |
 10. | cd                         0   0   1   1   0                       c       d         |
 11. |                            0   0   0   0   0                                         |
 12. |                            0   0   0   0   0                                         |
 13. | c                          0   0   1   0   0                       c                 |
 14. | ace                        1   0   1   0   1       a               c               e |
 15. |                            0   0   0   0   0                                         |
 16. | cd                         0   0   1   1   0                       c       d         |
 17. | acde                       1   0   1   1   1       a               c       d       e |
 18. | c                          0   0   1   0   0                       c                 |
 19. | c                          0   0   1   0   0                       c                 |
 20. | abcd                       1   1   1   1   0       a       b       c       d         |
     +--------------------------------------------------------------------------------------+

.

Comment

Taqi Nabizada

Join Date: Mar 2020

Posts: 12
#3

28 Apr 2025, 01:12

Thank you Andrew Musau a million. Please accept my apologies for being ambiguous. You perceived my aim very well and your command worked well. Thank you again.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35458
#4

28 Apr 2025, 02:29

As documented in its manual entry, split is an official command based on code I wrote (which in turn was based on code written jointly with Michael Blasnik).

So, I can speak on the original intent and broad implementation here, on which the original community-contributed version and the current official version do not differ.

As the manual entry explains

split is used to split a string variable into two or more component parts, for example, “words”. You
might need to correct a mistake, or the string variable might be a genuine composite that you wish to
subdivide before doing more analysis.

The basic steps applied by split are, given one or more separators, to find those separators within
the string and then to generate one or more new string variables, each containing a part of the original.
The separators could be, for example, spaces or other punctuation symbols, but they can in turn be strings
containing several characters. The default separator is a space.

The key string functions for subdividing string variables and, indeed, strings in general, are strpos(),
which finds the position of separators, and substr(), which extracts parts of the string. (See [FN] String
functions.) split is based on the use of those functions.

If your problem is not defined by splitting on separators, you will probably want to use substr()
directly. Suppose that you have a string variable, date, containing dates in the form ”21011952” so that
the last four characters define a year. This string contains no separators. To extract the year, you would
use substr(date,-4,4). Again suppose that each woman’s obstetric history over the last 12 months
was recorded by a str12 variable containing values such as ”nppppppppbnn”, where p, b, and n denote
months of pregnancy, birth, and nonpregnancy. Once more, there are no separators, so you would use
substr() to subdivide the string.

split discards the separators, because it presumes that they are irrelevant to further analysis or that
you could restore them at will. If this is not what you want, you might use substr() (and possibly
strpos()).

As explained there, and more tersely in the help, split by default parses on spaces. You tried to specify an
empty string as a separator, but split has a hard job of distinguishing between specifying nothing as a separator and
not specifying anything as a separator, which it overrides by deciding that spaces are to be separators.

Had your syntax worked, it would not have been helpful, I guess, as you would have ended up with something like this:

Code:

clear input str24 Q45 "abcd " "acde " "" "ace " "" "c " "ce " "abcde " "bcd " "cd " "" "" "c " "ace " "" "cd " "acde " "c " "c " "abcd " end forval j = 1/24 { gen wanted`j' = substr(Q45, `j', 1) } ds wanted* list wanted1-wanted5 +-------------------------------------------------+ | wanted1 wanted2 wanted3 wanted4 wanted5 | |-------------------------------------------------| 1. | a b c d | 2. | a c d e | 3. | | 4. | a c e | 5. | | |-------------------------------------------------| 6. | c | 7. | c e | 8. | a b c d e | 9. | b c d | 10. | c d | |-------------------------------------------------| 11. | | 12. | | 13. | c | 14. | a c e | 15. | | |-------------------------------------------------| 16. | c d | 17. | a c d e | 18. | c | 19. | c | 20. | a b c d | +-------------------------------------------------+

I haven't listed the last 19 variables created, which seem useless.

(Conversely, had the order within the data been other than alphabetical, that syntax might have been useful.)

It's been a while (split was made official and was off my hands in Stata 8), but I do remember clearly from writing it twenty and more years ago that there was a case for extending split to cover splitting strings without separators, and a case against it, as that implies much more complicated syntax. I went with not including it, and StataCorp didn't change that. In my experience problems with no separators usually call for use of substr(), for date functions. for regular expression syntax or for direct creation of indicator variables. Andrew Musau's code suggestions are an excellent example.

This would have been another way to get (0, 1) indicators:

Code:

foreach w in a b c d e { gen `w' = strpos(Q45, "`w'") > 0 }

Last edited by Nick Cox; 28 Apr 2025, 03:20.
1 like
Comment

Announcement

Comment

Comment

Comment