Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extracting parts of a string

    Dear all,

    I'd like to extract a part of a string from a string variable, to be specific: the indication in a radiology report.

    I created the following loop to extract a part of text starting with "indication"/"indications"/"history" AND ending with "view"/"views"

    However, it looks for the last time "view"/"views" is used in the report, while it should pick up the string between "indication"/"indications"/"history" AND the first time "view"/"views" is used.

    foreach x in indication indications history{
    foreach y in view views {
    replace indication = regexs(2) if regexm(lower(report_text),"(`x': )(.*)(`y')") & indication == ""
    }
    }

    Thank you,

    Stein

  • #2
    Stein,

    I'm not sure I entirely understand your question, but I have a feeling I know what's wrong. The (.*) in the middle of your regular expression doesn't care about what you have after and grabs everything until the end of the string. In regular-expression-speak this is known as "greedy" matching. Most standard implementations of regular expressions have a way of making this expression non-greedy, but Stata's does not. They are aware of this limitation and have promised to fix it someday.

    In the meantime, there are probably some alternatives, but to better help you it might be helpful if you can give some examples of the kinds of strings you have and what you want to extract from the string. That way, we can test possible solutions against actual data.

    Regards,
    Joe

    Comment


    • #3
      Stein,

      Here is one method that assumes that view/views is at the very end of the string:

      Code:
      input strL report_text
      "indication I want this 1 view"
      "indication I want this 2 views"
      "indications I want this 3 view"
      "indications I want this 4 views"
      "history I want this 5 view"
      "history I want this 6 views"
      "xxx I dont want this yyy"
      "indication I dont want this either yyy"
      "xxx nor this view"
      end
      
      gen indication=regexs(2) if regexm(lower(report_text),"^(indications|indication|history)(.*)(views|view)$")
      
      list
      The use of the "$" at the end is required to prevent greedy matching. If the assumption that view/views is at the end of the string is not valid, then perhaps you can add a preliminary step that creates a string that conforms to that assumption.

      Note also that the order of (indications|indication|history) is important. If you put (indication|indications|history) and the string starts with "indications" it will match "indication" and leave the "s" behind.

      Regards.
      Joe

      Comment


      • #4
        Thanks for your reply. To simplify it a bit:

        a string variable called report_text contains the following text:

        "patient 656294 elbow radiograph: indication: pain in the elbow after trauma : 2 elbow view. The imaging demonstrated xx and no fracture. end of report and images viewed by xxx"

        I'd like to extract this part: " pain in the elbow after trauma : 2 elbow"

        However, using this command: gen indication = regexs(2) if regexm(lower(report_text),"(indication: )(.*)(view)")

        extracts this: "pain in the elbow after trauma : 2 elbow view. The imaging demonstrated xx and no fracture. end of report and images"

        I'd like to have it stop at the first occasion of "view" not the last.


        Comment


        • #5
          Stein,

          Here is an alternative that doesn't assume that view/views is at the end of the string:

          Code:
          clear
          
          input strL report_text
          "indication I want this 1 view xxx"
          "indication I want this 2 views yyy"
          "indications I want this 3 view zzz"
          "indications I want this 4 views aaa"
          "history I want this 5 view bbb"
          "history I want this 6 views ccc"
          "xxx I dont want this yyy"
          "indication I dont want this either yyy"
          "xxx nor this view"
          end
          
          gen subtext=substr(report_text,1,strpos(report_text,"view")-1)
          gen indication=regexs(2) if regexm(subtext,"(indications|indication|history)(.*)")
          
          list
          Let us know if neither of these methods works with your data.

          Regards,
          Joe

          Comment


          • #6
            Great example, thanks:

            But how to handle this one:

            "indication I want this 1 view but there are multiple view"

            I only want the part "I want this 1"






            Comment


            • #7
              I think my second example will work for that, since it throws out everything after (and including) the first "view".

              Comment


              • #8
                Brilliant. Great solution; I haven't thought of this. Thank you very much.

                Comment


                • #9
                  Note that with Stata 14, the new unicode versions of regex functions do support non-greedy quantifiers.

                  Code:
                  clear
                  set obs 1
                  gen report_text = "patient 656294 elbow radiograph: indication: pain in the elbow after trauma : 2 elbow view. The imaging demonstrated xx and no fracture. end of report and images viewed by xxx"
                  gen indication = ustrregexs(1) if ustrregexm(report_text,"indication: (.+?) view")
                  list

                  Comment


                  • #10
                    Robert,

                    Thanks for pointing that out. Is this documented somewhere or did you (or someone else) discover it by accident? I'm surprised StataCorp didn't mention this when I brought up Stata's regular expression limitations at the Stata Conference.

                    Regards,
                    Joe

                    Comment


                    • #11
                      I haven't seen any documentation about the extra functionality of the new unicode versions. Credit goes to Dimitriy V. Masterov who noted the support for character classes in this post on Stack Overflow.

                      Comment


                      • #12
                        Incredible guys, thank you so much!

                        Comment


                        • #13
                          Dear all,
                          I have the example of following text and I want to extract "Max 8".

                          doser "1 tablet when needed . Max 8 tablet per day"
                          I have done the following but it said the doser max is invalid name

                          gen max = ustrregexs(1) if ustrregexm(lower(doser,"max [0-9]"))

                          I have done the following also but it produced all missing values. For information, max can be in various forms such as max, MAX and Max
                          gen max = ustrregexs(1) if ustrregexm(doser,"(Max|max|MAX)[0-9]$")


                          Appreciate any sharing.
                          Thanks

                          Comment


                          • #14
                            I guess you have a parenthesis too late. Try

                            Code:
                            lower(closer)
                            and cut the misplaced character.

                            Comment


                            • #15
                              Hi Nick,
                              Thanks. I have tried with the following but all missing values generated. I am not sure which is the misplaced character. Appreciate your help.

                              gen max = ustrregexs(1) if ustrregexm(lower(doser),"max [0-9]")

                              Thanks

                              Comment

                              Working...
                              X