Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Free text processing of medical notes


    I am attempting to search pathology reports, which are in string format, to identify only those which have positive results for a bacteria, H.pylori. The issue with searching is that the results are not done in a uniform manner, there are misspellings, and the term "H.pylori" is often present when it is tested for, not just when it is positive.

    What I have tried so far includes removing spaces, converting all to lower, and using regexm in a number of iterations: replace hp_present = 0 if regexm (`path_report', "h.pylori is not seen), replace hp_present = 0 if regexm (`path_report', "no h.pylori), replace hp_present = 0 if regexm (`path_report', "negative for h.pylori) .... and so on

    Then replace hp_present = 1 if regexm (`path_report', "h.pylori is seen), replace hp_present = 1 if regexm (`path_report', "h.pylori positive) , and so on


    The issue becomes that when doing an internal validation, sensitivity was only at 50% (I missed a lot of those with H pylori). I am wondering if anyone has any advice on how to approach this issue where spelling errors, non uniformity, and content need to be taken into account.

    Examples of strings of the path_report are given below.

    Very appreciated.


    Code:
    
    nal diagnosis                        1. "Duodenum biopsies:        histologically unremarkable duodenal mucsa with slight vascular congestion.
     
    2. Stomach biopsies: Achtive chronic H. pylori gastritis.
     
    3. Tranverse colon polyp" polypectomy: TUbular adenoma fragmented.
     
    4. Sessile cecum polyp polpyectomy: Hyperplastic poylpfragments.
     
    es PATHOLOGIST,MD
    Date Jun 08 2009
     
     
     
     
     
     
     
    BRIEF CLINICAL HISTORY: GERD, hx of H pylori chronic gastritis OPERATIVE FINDINGS: POSTOPERATIVE DIAGNOSIS: Surgeon: Surgeon MD
    GROSS DESCRIPTIO: Specimen is submitted in formalin and labeled biopsies gastric antrum. The one fragment shows mucosal lymphoid aggregate with herminal center. Giemsa             stains show organisms the morphology of which is consistent with H.       pylori.
     
    Gastric antrum biopsies:            Active chronic gastritis associated with H. pylori.
     
     
     
     
     
     
    BRIEF CLINICAL HISTORY: GERD, hx of H pylori chronic gastritis OPERATIVE FINDINGS: POSTOPERATIVE DIAGNOSIS: Surgeon: Surgeon MD
    GROSS DESCRIPTIO: Specimen is submitted in formalin and labeled biopsies gastric antrum. The one fragment shows mucosal lymphoid aggregate with herminal center. Warthin-Starry             stains pending.
     
    Gastric antrum biopsies:            Active chronic gastritis not seen, no H. pylori.
     
     
     
    BRIEF CLINICAL HISTORY: GERD, hx of H pylori chronic gastritis OPERATIVE FINDINGS: POSTOPERATIVE DIAGNOSIS: Surgeon: Surgeon MD
    GROSS DESCRIPTIO: Specimen is submitted in formalin and labeled biopsies gastric antrum. The one fragment shows mucosal lymphoid aggregate with herminal center. Warthin-Starry             stains pending.
     
    Gastric antrum biopsies:            Active chronic gastritis not seen, no H. pylori.
     ADDENDUM: POSITIVE HPYLORI
    
     
     
    MEDICAL RECORD: 78907689
    SURGEONPHYSICIAN: Surgeon MD
    PREOPERATIVE DX: r/o Helicobacter pylori
     
    Final diagnosis: rare comma shaped organisms seen consistent with H pylori
     
     
     
     
     
     
     
    MEDICAL RECORD: 78907689
    SURGEONPHYSICIAN: Surgeon MD
    PREOPERATIVE DX: r/o Helicobacter pylori
     
    Final diagnosis: no comma shaped organisms seen consistent with H pylori




  • #2
    While Stata does text processing, there are packages specifically written for text processing that might be better.

    The only think I can think of is to take your examples, and write specific recognition procedures for each of the cases your current program didn't find.

    Comment


    • #3
      How many reports do you have? Of those, what fraction do you think include some variant of "H. pylori" in the text?

      I just don't see how you're going to build something smart enough to distinguish "no comma shaped organisms seen consistent with H pylori" from a positive report in a general sense - the "no" is so far from the "consistent with H pylori" and it could have been "no indications seen that were not consistent with H pylori" - a double-negative.

      My thought is, can you focus your work on building an initial filter to restrict the reports to those that mention H pylori (in all the many spellings and misspellings) and then visually review that subset and identify those reporting a test with a positive result? Farm it out to Amazon's "Mechanical Turk" to pay a small amount for each report reviewed?

      https://www.mturk.com

      I agree with Phil that this task does not play to Stata's strengths.

      Comment


      • #4
        I agree with the other posters, but as a long shot, you might check out Word Scores by Michael Laver, Kenneth Benoit, and John Garry.

        https://www.tcd.ie/Political_Science.../software.html
        David Radwin
        Senior Researcher, California Competes
        californiacompetes.org
        Pronouns: He/Him

        Comment


        • #5
          Here's a Stata text analysis command, whose description impressed me when it first came out. I'd *really* like to hear a report of whether it's useful in the current case, which seems like a good test:
          .
          Code:
           ssc describe txttool
          
          ------------------------------------------------------------------------------------------------------------------------------------------------------
          package txttool from http://fmwww.bc.edu/repec/bocode/t
          ------------------------------------------------------------------------------------------------------------------------------------------------------
          
          TITLE
                'TXTTOOL': module providing utilities for text analysis
          
          DESCRIPTION/AUTHOR(S)
                
                 txttool provides a set of tools for managing and analyzing
                free-form text. The program integrates    several built-in Stata
                functions with new text capabilities, including a utility to
                create a    bag-of-words representation of text and an
                implementation of Porter's word stemming algorithm.
          It's documented in Stata Journal, volume 14, number 4.

          Comment

          Working...
          X