Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • extract piece of a string variable which is always in the same position

    Dear STATA users
    I am interested in a specific part of a string, which is always in the same position; I have to extract it and input it in a new string.
    In particular, I am interested in the psycotherapists' title which is highlighted in red

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str92 Title
    "namesurname|||Counselor,||MA,||LPCC-S,|||NCC,||RPT-S,||TCADC||||"          
    "name-surname|||Counselor,||LCPC|||"                                               
    "name&surn,ame|||ClinicalSocialWork/Therapist,||MSW,||LCSW,||ADS||||"
    "name,surn-ame|||LicensedProfessionalCounselor,|||MA,||LPC,||NCC|||"                         
    "name.s;urname|||LicensedProfessionalCounselor,|||MA,||LPC|||"                          
    end
    Note that within the part of the string which occupied by name and surname, there might be punctuation or other sybmbols--& (this is because name and surname might be the name of the therapists' practice and not exclusively their own name); however, there is never a blank space.

    Based on some useful online examples I came up with the following:

    Code:
    gen TitleClean = regexs(1) if regexm(Title, "$.*[|||](.*)[,||].*")
    In my mind, the regexm function should take on the first combination of any character and length (from the left--this is why there is $ at the beginning), that:
    1. comes after any combination of characters ".*" followed by "|||"
    2. comes before a comma and two bars ",||" , which is followed by any combination of charracters ".*"
    Evidently, what I have written is wrong and returns a fully blank string (i.e. the new TitleClean variable is an empty string variable for each observation in the dataset)

    Differently, the new string should look like follows

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str92 TitleClean
    "Counselor"          
    "Counselor"                                               
    "ClinicalSocialWork/Therapist"
    "LicensedProfessionalCounselor"                         
    "LicensedProfessionalCounselor"                          
    end

    Any idea on how to build TitleClean?

  • #2
    This gets you some way with your problem for your example:


    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str92 Title
    "namesurname|||Counselor,||MA,||LPCC-S,|||NCC,||RPT-S,||TCADC||||"          
    "name-surname|||Counselor,||LCPC|||"                                               
    "name&surn,ame|||ClinicalSocialWork/Therapist,||MSW,||LCSW,||ADS||||"
    "name,surn-ame|||LicensedProfessionalCounselor,|||MA,||LPC,||NCC|||"                         
    "name.s;urname|||LicensedProfessionalCounselor,|||MA,||LPC|||"                          
    end
    
    gen work = subinstr(Title, ",", "", .) 
    replace work = subinstr(work, "|||", "|", .) 
    replace work = subinstr(work, "||", "|", .) 
    split work, parse(|) 
    
    list work? 
    
         +----------------------------------------------------------------------------------------+
         |         work1                           work2   work3    work4   work5   work6   work7 |
         |----------------------------------------------------------------------------------------|
      1. |   namesurname                       Counselor      MA   LPCC-S     NCC   RPT-S   TCADC |
      2. |  name-surname                       Counselor    LCPC                                  |
      3. |  name&surname    ClinicalSocialWork/Therapist     MSW     LCSW     ADS                 |
      4. |  namesurn-ame   LicensedProfessionalCounselor      MA      LPC     NCC                 |
      5. | name.s;urname   LicensedProfessionalCounselor      MA      LPC                         |
         +----------------------------------------------------------------------------------------+

    Comment


    • #3
      Code:
      gen TitleClean = substr(Title, strpos(Title, "|||")+3, .)
      replace TitleClean = substr(TitleClean, 1, strpos(TitleClean, "|")-2)
      This works on your data sample but it relies on the title always coming after the first occurrence of "|||" and ending with ",|". I don't know if this is the case for your whole dataset.

      Comment


      • #4
        Thank you Nick Cox and Wouter Wakker .

        Based on this suggestion I came up with an additional alternative solution.
        Code:
        split Title, parse(",||")                      
        gen TitleClean = regexs(1) if regexm(Title1, ".*[|||](.*)$")
        However, of the three alternative solutions posted here, I prefer Nick's as in that way I get also clean data to build a possible "degree/certificate" variable later on.
        Last edited by FLuca; 10 Oct 2019, 11:28.

        Comment

        Working...
        X