Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Regular Expressions in Stata: Regexs

    I am trying to extract string between two strings.

    INDICATION: (string i want capture) DESCRIPTION
    "i string i want to capture" has a high degree of variability.
    in some observations it is 1 word in others its 2 words and others 3 words.
    i am not having problems in capturing the first string but capturing 2 and 3 seems to be a problem.

    I am capturing the first string with

    Code:
    gen indication = regexs(1) if regexm(notes,"INDICATION: ([a-zA-z+])[ ]*DESCRIPTION ")

  • #2
    It's not clear to me what your strings look like, but how about you start with:

    Code:
    clear
    set more off
    
    set obs 1
    
    gen notes = "INDICATION: (string i want capture) DESCRIPTION"
    gen indic = regexs(1) if regexm(notes, "INDICATION: (.*) DESCRIPTION")
    
    list
    You should:

    1. Read the FAQ carefully.

    2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

    3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

    4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

    Comment


    • #3
      It's not a solution using regular expressions, but you could use something like the following to get what you want. Start at the "Begin here" comment.

      .ÿversionÿ14.0

      .ÿ
      .ÿclearÿ*

      .ÿsetÿmoreÿoff

      .ÿ
      .ÿinputÿstr50ÿinput_text

      ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿinput_text
      ÿÿ1.ÿ"stuffÿINDICATION:ÿ1ÿwordÿDESCRIPTION"
      ÿÿ2.ÿ"andÿINDICATION:ÿ2ÿwordsÿDESCRIPTION"
      ÿÿ3.ÿ"nonsenseÿINDICATION:ÿ3ÿwordsÿDESCRIPTION"
      ÿÿ4.ÿ"INDICATION:ÿ(stringÿiÿwantÿcapture)ÿDESCRIPTION"
      ÿÿ5.ÿend

      .ÿ
      .ÿ*
      .ÿ*ÿBeginÿhere
      .ÿ*
      .ÿlocalÿiÿINDICATION:

      .ÿlocalÿiÿstrpos(input_text,ÿ"`i'")ÿ+ÿstrlen("`i'")+1

      .ÿ
      .ÿlocalÿdÿDESCRIPTION

      .ÿlocalÿdÿstrpos(input_text,ÿ"`d'")

      .ÿ
      .ÿgenerateÿstrÿindicationÿ=ÿsubstr(input_text,ÿ`i',ÿ`d'ÿ-ÿ(`i'))

      .ÿ
      .ÿlistÿindication,ÿnoobs

      ÿÿ+--------------------------+
      ÿÿ|ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿindicationÿ|
      ÿÿ|--------------------------|
      ÿÿ|ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ1ÿwordÿÿ|
      ÿÿ|ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ2ÿwordsÿÿ|
      ÿÿ|ÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿÿ3ÿwordsÿÿ|
      ÿÿ|ÿ(stringÿiÿwantÿcapture)ÿÿ|
      ÿÿ+--------------------------+

      .ÿ
      .ÿexit

      endÿofÿdo-file


      .

      Comment


      • #4
        Well here you go:

        The variable "notes" contains a medical report in each field. The String i want to capture is in bold. There are 526 of such observations. The below report is one observation. The Regex captures only the bph. "bph with luts" is one of the strings i want to capture. There are n number of possibilities what goes in after indications. So its not just "bph with luts"


        DATE: 23/56/1987 PATIENT: CVFGB DATE OF BIRTH: 5/4/1876 SURGEON: John Rambo
        INDICATION: bph with luts
        DESCRIPTION OF PROCEDURE: A 7-French urethral catheter and a 9-French rectal manometer were atraumaticallly placed. Normal saline was instilled via the urethral catheter at a rate of 40-50mL per minute. Direct pressure transduction was used to record both the abdominal and vesicle pressures. The detrusor pressures were derived from the above tracings. Continuous EMG monitoring was performed via electrodes to the perineum.
        Filling Stage:
        Presence of stress urinary incontinence: No
        Presence of detrusor overactivity:No
        Bladder Compliance: Normal
        Maximum Capacity: 520mL
        First Sensation: 161mL
        First Desire: 250mL
        Strong Desire: 375mL
        Urgency: 448mL

        Voiding Stage:
        Destrusor Contraction: acontractile
        Voided Volume: 236
        Qmax: 9mL/S
        Qavg: 4mL/S
        PVR: 300mL
        Detrusor maximum pressure: 97cmH2O
        Vesical Maximum Pressure: 104cmH2O
        Detrusor Pressure at Qmax: 12cmH2O
        Vesical Pressure at Qmax: 40cmH2O
        Presence of abdominal straining: Yes

        EMG: Tracing quality and capture was good. There was appropriate quieting noted during voiding.


        Comment


        • #5
          Your post at #4 indicates that the regular expression approach suggested in post #2 did not work. Did the substring approach in post #3 also not work?

          I suspect that the difficulty is that your variable notes has multiple lines. That is, it has within it "invisible" linefeed (newline) characters or return characters. I do not have time to test at the moment, but I think that something like
          Code:
          clonevar note2 = note
          replace note2 = subinstr(note2,char(10)," ",.)
          replace note2 = subinstr(note2,char(13)," ",.)
          will create a new variable note2 with all the possible line break characters replaced by spaces. Then one of the two techniques should work.

          If you have no luck, could you post a .dta file containing your example from post #4: 1 variable, 1 observation? I don't want to try to recreate it in Stata and not have it match exactly the particular characteristics of your data.
          Last edited by William Lisowski; 07 Aug 2015, 07:48.

          Comment


          • #6
            Thank you for the suggestions. I met up with a programmer who resolved the regular expression problem. Here is the code he suggested.
            Code:
            gen indication1 = regexs(1) if regexm(notes,"INDICATION:\s*(.*)\s*DESCRIPTION")
            gen indication = trim(indication1)

            Comment


            • #7
              I assume that since you are matching past the current line to the next line, this text is not read with each line being stored in different observations. I copied the text as is from the sample in #4 to a text file that I called "obs1.txt" and read the whole file into a strL.

              If your programmer's suggestion works, it's totally by chance and is probably not what you want. That's because the \s whitespace character class is not supported by regexm(), even in Stata 14. The new unicode versions of regex functions do support it. Here's the difference

              Code:
              clear
              set obs 1
              gen id = 1
              generate strL notes = fileread("obs.txt")
              gen indication = trim(regexs(1)) if regexm(notes,"INDICATION:\s*(.*)\s*DESCRIPTION")
              gen indication2 = ustrregexs(1) if ustrregexm(notes,"INDICATION:\s*(.*?)\s*DESCRIPTION")
              
              dis "|`=indication[1]'|"
              dis "|`=indication2[1]'|"
              
              dis length(indication)
              dis strlen(indication2)
              dis strlen(indication2)
              dis ustrlen(indication2)
              
              assert indication == indication2
              And here's the results after the variables are generated

              Code:
              . dis "|`=indication[1]'|"
              |bph with luts
              |
              
              . dis "|`=indication2[1]'|"
              |bph with luts|
              
              . 
              . dis length(indication)
              14
              
              . dis strlen(indication2)
              13
              
              . dis strlen(indication2)
              13
              
              . dis ustrlen(indication2)
              13
              
              . 
              . assert indication == indication2
              assertion is false
              r(9);
              Note that the new unicode versions support lazy quantifiers (i.e. the question mark in the "(.*?)" sub pattern). Without it, the white space is absorbed by ".*" and therefore included in what's returned in ustrregexs(1).

              Comment


              • #8
                Note please that the original statement of the problem was

                INDICATION: (string i want capture) DESCRIPTION
                while the later display of the actual data was

                INDICATION: bph with luts
                DESCRIPTION OF PROCEDURE:
                spread across two lines: this is a signficant difference, and meant that two readers solved a problem different than the one that you needed solved. That is why the Statalist FAQ linked to from the top of each page advises

                Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly! If you can, reproduce the error with one of Stata's provided datasets, a small fragment of your dataset, or a simple concocted dataset that you include in your posting.

                Comment


                • #9
                  I Guess Roberto Ferrer solved my problem, and i specifically wanted a regex solution. Actual data and original problem both exists in the database. The variable's are not uniform. So i mention there are n number of possibility what goes after indication. Thanks

                  Comment

                  Working...
                  X