Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to Splitting a string variable, which without commas or space, into two parts

    Dear Stata Users,

    I encounter a data management problem, which is to split a variable into two parts.
    I know I can use split command to seperate SexAge and Race. e.g. split SexAge_Race, parse(_)
    But I don't know what command to seperate Sex and Age. There is no space or comma between Sex and Age..

    My original variables look like this
    subject_ID SexAge_Race
    1 MALE41.2_White
    2 FEMALE42.9_White
    3 FEMALE38.5_White
    4 FEMALE35.6_Hispanic
    5 FEMALE48.5_White
    I want my variables look like as follows:

    subject_ID Sex Age Race
    1 MALE 41.2 White
    2 FEMALE 42.9 White
    3 FEMALE 38.5 White
    4 FEMALE 35.6 Hispanic
    5 FEMALE 48.5 White

    ********************************************
    Please give me some advice ! Thank You!!
    Last edited by Mei Mei Liu; 22 May 2016, 23:40.

  • #2
    Code:
    clear 
    input subect_ID str19 SexAge_Race
    1    "MALE41.2_White"
    2    "FEMALE42.9_White"
    3    "FEMALE38.5_White"
    4    "FEMALE35.6_Hispanic"
    5    "FEMALE48.5_White"
    end 
    
    gen Sex = "FEMALE" if substr(SexA, 1, 6) == "FEMALE"
    replace Sex = "MALE" if substr(SexA, 1, 4) == "MALE" 
    gen work = subinstr(SexA, Sex, "",  .) 
    split work, parse(_) destring 
    rename (work1 work2) (Age Race) 
    list 
    edit if missing(Sex)

    Comment


    • #3
      Another way to do it is to extract each feature with a regular expression:

      Code:
      * Example generated by -dataex-. To install: ssc install dataex
      clear
      input float subject_ID str19 SexAge_Race
      1 "MALE41.2_White"    
      2 "FEMALE42.9_White"  
      3 "FEMALE38.5_White"  
      4 "FEMALE35.6_Hispanic"
      5 "FEMALE48.5_White"  
      end
      gen Sex = regexs(1) if regexm(SexA, "(.*MALE)")
      gen Age = regexs(1) if regexm(SexA, "([0-9]+\.?[0-9]*)")
      destring Age, replace 
      gen Race = regexs(1) if regexm(SexA, "_(.*)")
      list
      
           +-----------------------------------------------------------+
           | subjec~D           SexAge_Race      Sex    Age       Race |
           |-----------------------------------------------------------|
        1. |        1        MALE41.2_White     MALE   41.2      White |
        2. |        2      FEMALE42.9_White   FEMALE   42.9      White |
        3. |        3      FEMALE38.5_White   FEMALE   38.5      White |
        4. |        4   FEMALE35.6_Hispanic   FEMALE   35.6   Hispanic |
        5. |        5      FEMALE48.5_White   FEMALE   48.5      White |
           +-----------------------------------------------------------+

      Comment


      • #4
        Just a further comment: split is an official command but I can speak on the original design. It's based on the premise that there are separators such as spaces, commas or other punctuation characters. Naturally there are also problems in which there aren't separators, as here.

        I spent a while thinking about extra syntax for these extra cases, but decided not to complicate split too much. As the problem here shows, there is usually scope when split doesn't apply to part or all of the problem for one or both of two kinds of solutions, one hinging on functions such as strpos() substr() subinstr() and the other hinging on regular expression machinery.

        Comment


        • #5
          Dear Nick,

          Thank you so much for taking time to share this suggestion with me! It works and it's a huge help to me. Thank You!

          Comment

          Working...
          X