extract piece of a string variable which is always in the same position

FLuca

Join Date: May 2014

Posts: 35
#1

extract piece of a string variable which is always in the same position

10 Oct 2019, 10:25

Dear STATA users
I am interested in a specific part of a string, which is always in the same position; I have to extract it and input it in a new string.
In particular, I am interested in the psycotherapists' title which is highlighted in red

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str92 Title "namesurname|||Counselor,||MA,||LPCC-S,|||NCC,||RPT-S,||TCADC||||" "name-surname|||Counselor,||LCPC|||" "name&surn,ame|||ClinicalSocialWork/Therapist,||MSW,||LCSW,||ADS||||" "name,surn-ame|||LicensedProfessionalCounselor,|||MA,||LPC,||NCC|||" "name.s;urname|||LicensedProfessionalCounselor,|||MA,||LPC|||" end

Note that within the part of the string which occupied by name and surname, there might be punctuation or other sybmbols--& (this is because name and surname might be the name of the therapists' practice and not exclusively their own name); however, there is never a blank space.

Based on some useful online examples I came up with the following:

Code:

gen TitleClean = regexs(1) if regexm(Title, "$.*[|||](.*)[,||].*")

In my mind, the regexm function should take on the first combination of any character and length (from the left--this is why there is $ at the beginning), that:
comes after any combination of characters ".*" followed by "|||"

comes before a comma and two bars ",||" , which is followed by any combination of charracters ".*"

Evidently, what I have written is wrong and returns a fully blank string (i.e. the new TitleClean variable is an empty string variable for each observation in the dataset)

Differently, the new string should look like follows

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str92 TitleClean "Counselor" "Counselor" "ClinicalSocialWork/Therapist" "LicensedProfessionalCounselor" "LicensedProfessionalCounselor" end

Any idea on how to build TitleClean?
Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35724

10 Oct 2019, 10:54

This gets you some way with your problem for your example:

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input str92 Title
"namesurname|||Counselor,||MA,||LPCC-S,|||NCC,||RPT-S,||TCADC||||"          
"name-surname|||Counselor,||LCPC|||"                                               
"name&surn,ame|||ClinicalSocialWork/Therapist,||MSW,||LCSW,||ADS||||"
"name,surn-ame|||LicensedProfessionalCounselor,|||MA,||LPC,||NCC|||"                         
"name.s;urname|||LicensedProfessionalCounselor,|||MA,||LPC|||"                          
end

gen work = subinstr(Title, ",", "", .) 
replace work = subinstr(work, "|||", "|", .) 
replace work = subinstr(work, "||", "|", .) 
split work, parse(|) 

list work? 

     +----------------------------------------------------------------------------------------+
     |         work1                           work2   work3    work4   work5   work6   work7 |
     |----------------------------------------------------------------------------------------|
  1. |   namesurname                       Counselor      MA   LPCC-S     NCC   RPT-S   TCADC |
  2. |  name-surname                       Counselor    LCPC                                  |
  3. |  name&surname    ClinicalSocialWork/Therapist     MSW     LCSW     ADS                 |
  4. |  namesurn-ame   LicensedProfessionalCounselor      MA      LPC     NCC                 |
  5. | name.s;urname   LicensedProfessionalCounselor      MA      LPC                         |
     +----------------------------------------------------------------------------------------+

Comment

Wouter Wakker

Join Date: Nov 2018

Posts: 621
#3

10 Oct 2019, 10:55

Code:

gen TitleClean = substr(Title, strpos(Title, "|||")+3, .) replace TitleClean = substr(TitleClean, 1, strpos(TitleClean, "|")-2)

This works on your data sample but it relies on the title always coming after the first occurrence of "|||" and ending with ",|". I don't know if this is the case for your whole dataset.
2 likes
Comment
FLuca

Join Date: May 2014

Posts: 35
#4

10 Oct 2019, 11:24

Thank you Nick Cox and Wouter Wakker .

Based on this suggestion I came up with an additional alternative solution.

Code:

split Title, parse(",||") gen TitleClean = regexs(1) if regexm(Title1, ".*[|||](.*)$")

However, of the three alternative solutions posted here, I prefer Nick's as in that way I get also clean data to build a possible "degree/certificate" variable later on.

Last edited by FLuca; 10 Oct 2019, 11:28.
1 like
Comment

Announcement

extract piece of a string variable which is always in the same position

Comment

Comment

Comment