Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • split

    there is a variable every observation of which includes more than two options differently. I want to split this variable into as many variable as possible of that every observation options. There is no separator between options, and also some observations are empty.. I used this command (split Q45, parse("")gen (sim_)), It creates only one variable as the original one. you can find the sample of variable below, can anyone please help me?
    ----------------------- copy starting from the next line -----------------------
    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str24 Q45
    "abcd                    "
    "acde                    "
    ""                        
    "ace                     "
    ""                        
    "c                       "
    "ce                      "
    "abcde                   "
    "bcd                     "
    "cd                      "
    ""                        
    ""                        
    "c                       "
    "ace                     "
    ""                        
    "cd                      "
    "acde                    "
    "c                       "
    "c                       "
    "abcd                    "
    end
    ------------------ copy up to and including the previous line ------------------


  • #2
    Your post is not clear, but I will assume that a letter represents an option. In this case, creating indicators may be preferable to splitting the options.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input str24 Q45
    "abcd                    "
    "acde                    "
    ""                        
    "ace                     "
    ""                        
    "c                       "
    "ce                      "
    "abcde                   "
    "bcd                     "
    "cd                      "
    ""                        
    ""                        
    "c                       "
    "ace                     "
    ""                        
    "cd                      "
    "acde                    "
    "c                       "
    "c                       "
    "abcd                    "
    end
    
    foreach opt in a b c d e{
        gen `opt'= regexm(Q45, "`opt'")
        gen sim_`opt'= "`opt'" if regexm(Q45, "`opt'")
    }
    order Q45 ? sim_*
    Res.:

    Code:
    . l, sep(0)
    
         +--------------------------------------------------------------------------------------+
         |                      Q45   a   b   c   d   e   sim_a   sim_b   sim_c   sim_d   sim_e |
         |--------------------------------------------------------------------------------------|
      1. | abcd                       1   1   1   1   0       a       b       c       d         |
      2. | acde                       1   0   1   1   1       a               c       d       e |
      3. |                            0   0   0   0   0                                         |
      4. | ace                        1   0   1   0   1       a               c               e |
      5. |                            0   0   0   0   0                                         |
      6. | c                          0   0   1   0   0                       c                 |
      7. | ce                         0   0   1   0   1                       c               e |
      8. | abcde                      1   1   1   1   1       a       b       c       d       e |
      9. | bcd                        0   1   1   1   0               b       c       d         |
     10. | cd                         0   0   1   1   0                       c       d         |
     11. |                            0   0   0   0   0                                         |
     12. |                            0   0   0   0   0                                         |
     13. | c                          0   0   1   0   0                       c                 |
     14. | ace                        1   0   1   0   1       a               c               e |
     15. |                            0   0   0   0   0                                         |
     16. | cd                         0   0   1   1   0                       c       d         |
     17. | acde                       1   0   1   1   1       a               c       d       e |
     18. | c                          0   0   1   0   0                       c                 |
     19. | c                          0   0   1   0   0                       c                 |
     20. | abcd                       1   1   1   1   0       a       b       c       d         |
         +--------------------------------------------------------------------------------------+
    
    .

    Comment


    • #3
      Thank you Andrew Musau a million. Please accept my apologies for being ambiguous. You perceived my aim very well and your command worked well. Thank you again.

      Comment


      • #4
        As documented in its manual entry, split is an official command based on code I wrote (which in turn was based on code written jointly with Michael Blasnik).

        So, I can speak on the original intent and broad implementation here, on which the original community-contributed version and the current official version do not differ.

        As the manual entry explains

        split is used to split a string variable into two or more component parts, for example, “words”. You
        might need to correct a mistake, or the string variable might be a genuine composite that you wish to
        subdivide before doing more analysis.

        The basic steps applied by split are, given one or more separators, to find those separators within
        the string and then to generate one or more new string variables, each containing a part of the original.
        The separators could be, for example, spaces or other punctuation symbols, but they can in turn be strings
        containing several characters. The default separator is a space.

        The key string functions for subdividing string variables and, indeed, strings in general, are strpos(),
        which finds the position of separators, and substr(), which extracts parts of the string. (See [FN] String
        functions.) split is based on the use of those functions.

        If your problem is not defined by splitting on separators, you will probably want to use substr()
        directly. Suppose that you have a string variable, date, containing dates in the form ”21011952” so that
        the last four characters define a year. This string contains no separators. To extract the year, you would
        use substr(date,-4,4). Again suppose that each woman’s obstetric history over the last 12 months
        was recorded by a str12 variable containing values such as ”nppppppppbnn”, where p, b, and n denote
        months of pregnancy, birth, and nonpregnancy. Once more, there are no separators, so you would use
        substr() to subdivide the string.

        split discards the separators, because it presumes that they are irrelevant to further analysis or that
        you could restore them at will. If this is not what you want, you might use substr() (and possibly
        strpos()).
        As explained there, and more tersely in the help, split by default parses on spaces. You tried to specify an
        empty string as a separator, but split has a hard job of distinguishing between specifying nothing as a separator and
        not specifying anything as a separator, which it overrides by deciding that spaces are to be separators.

        Had your syntax worked, it would not have been helpful, I guess, as you would have ended up with something like this:

        Code:
        clear
        input str24 Q45
        "abcd                    "
        "acde                    "
        ""                        
        "ace                     "
        ""                        
        "c                       "
        "ce                      "
        "abcde                   "
        "bcd                     "
        "cd                      "
        ""                        
        ""                        
        "c                       "
        "ace                     "
        ""                        
        "cd                      "
        "acde                    "
        "c                       "
        "c                       "
        "abcd                    "
        end
        
        forval j = 1/24 {
            gen wanted`j' = substr(Q45, `j', 1)
        }
        
        ds wanted*
        
        list wanted1-wanted5
        
             +-------------------------------------------------+
             | wanted1   wanted2   wanted3   wanted4   wanted5 |
             |-------------------------------------------------|
          1. |       a         b         c         d           |
          2. |       a         c         d         e           |
          3. |                                                 |
          4. |       a         c         e                     |
          5. |                                                 |
             |-------------------------------------------------|
          6. |       c                                         |
          7. |       c         e                               |
          8. |       a         b         c         d         e |
          9. |       b         c         d                     |
         10. |       c         d                               |
             |-------------------------------------------------|
         11. |                                                 |
         12. |                                                 |
         13. |       c                                         |
         14. |       a         c         e                     |
         15. |                                                 |
             |-------------------------------------------------|
         16. |       c         d                               |
         17. |       a         c         d         e           |
         18. |       c                                         |
         19. |       c                                         |
         20. |       a         b         c         d           |
             +-------------------------------------------------+
        I haven't listed the last 19 variables created, which seem useless.

        (Conversely, had the order within the data been other than alphabetical, that syntax might have been useful.)

        It's been a while (split was made official and was off my hands in Stata 8), but I do remember clearly from writing it twenty and more years ago that there was a case for extending split to cover splitting strings without separators, and a case against it, as that implies much more complicated syntax. I went with not including it, and StataCorp didn't change that. In my experience problems with no separators usually call for use of substr(), for date functions. for regular expression syntax or for direct creation of indicator variables. Andrew Musau's code suggestions are an excellent example.

        This would have been another way to get (0, 1) indicators:


        Code:
        foreach w in a b c d e {
              gen `w' = strpos(Q45, "`w'") > 0
        }
        Last edited by Nick Cox; 28 Apr 2025, 03:20.

        Comment

        Working...
        X