Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • String Var: Extract specific number of characters before and after a single word

    Hello Stata Forum,

    I have a series of doctors notes which range from 10,000 to 30,000 characters in length and I am hoping to identify specific medications from these notes as well as the medication instructions. As a result, I was planning to use Stata string functions to search the notes for specific key word (for example lisinopril) and then extract a set number of characters immediately preceding and following the key word (to start, was planning to extract 100 characters before and 100 characters after).

    Two examples
    1. Out of a 20,000 character note a fragment might say "Patient on lisinopril 10 MG daily".
    2. "Patient instructed to stop taking lisinopril due to side effects"

    My goal is not to extract exact sentences, but simply enough characters around the key word to provide a text fragment that gives context without having to read a full 20,000 character string note.

    I read the following posts which touches on a similar issue and provided guidance for how to handle misspellings as well as using regexm to search for words beforehand, but neither quite explained how to do character based extractions. I was hoping for suggestions on how to specify a number of characters be extracted to a new variable both before and after the key term.

    HTML Code:
    https://www.statalist.org/forums/forum/general-stata-discussion/general/1328155-string-var-extract-sentence-based-on-a-single-word
    HTML Code:
    https://stats.idre.ucla.edu/stata/faq/how-can-i-extract-a-portion-of-a-string-variable-using-regular-expressions/
    Best,
    Tim

  • #2
    Here is some example code that extracts 15 characters on either side of the target text. It should start you in a useful direction.
    Code:
    cls
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str64 text
    "Patient on lisinopril 10 MG daily"                               
    "Patient instructed to stop taking lisinopril due to side effects"
    "nothing here"
    end
    
    generate target = "lisinopril"
    generate lctext = strlower(text)
    
    generate tgtlen = length(target)
    generate hit = strpos(lctext,target)
    generate c1 = max(1,hit-15)
    generate extract = substr(text,c1,tgtlen+30) if hit>0
    list hit extract, clean
    Code:
    . list hit extract, clean
    
           hit                                    extract  
      1.    12          Patient on lisinopril 10 MG daily  
      2.    35   to stop taking lisinopril due to side ef  
      3.     0

    Comment

    Working...
    X