Free text processing of medical notes

AKS Kumar

Join Date: Mar 2019
Posts: 1

Free text processing of medical notes

28 Mar 2019, 05:47

I am attempting to search pathology reports, which are in string format, to identify only those which have positive results for a bacteria, H.pylori. The issue with searching is that the results are not done in a uniform manner, there are misspellings, and the term "H.pylori" is often present when it is tested for, not just when it is positive.

What I have tried so far includes removing spaces, converting all to lower, and using regexm in a number of iterations: replace hp_present = 0 if regexm (`path_report', "h.pylori is not seen), replace hp_present = 0 if regexm (`path_report', "no h.pylori), replace hp_present = 0 if regexm (`path_report', "negative for h.pylori) .... and so on

Then replace hp_present = 1 if regexm (`path_report', "h.pylori is seen), replace hp_present = 1 if regexm (`path_report', "h.pylori positive) , and so on

The issue becomes that when doing an internal validation, sensitivity was only at 50% (I missed a lot of those with H pylori). I am wondering if anyone has any advice on how to approach this issue where spelling errors, non uniformity, and content need to be taken into account.

Examples of strings of the path_report are given below.

Very appreciated.

Code:

nal diagnosis 1. "Duodenum biopsies: histologically unremarkable duodenal mucsa with slight vascular congestion.

2. Stomach biopsies: Achtive chronic H. pylori gastritis.

3. Tranverse colon polyp" polypectomy: TUbular adenoma fragmented.

4. Sessile cecum polyp polpyectomy: Hyperplastic poylpfragments.

es PATHOLOGIST,MD
Date Jun 08 2009

BRIEF CLINICAL HISTORY: GERD, hx of H pylori chronic gastritis OPERATIVE FINDINGS: POSTOPERATIVE DIAGNOSIS: Surgeon: Surgeon MD
GROSS DESCRIPTIO: Specimen is submitted in formalin and labeled biopsies gastric antrum. The one fragment shows mucosal lymphoid aggregate with herminal center. Giemsa stains show organisms the morphology of which is consistent with H. pylori.

Gastric antrum biopsies: Active chronic gastritis associated with H. pylori.

Gastric antrum biopsies: Active chronic gastritis not seen, no H. pylori.

Gastric antrum biopsies: Active chronic gastritis not seen, no H. pylori.
ADDENDUM: POSITIVE HPYLORI

MEDICAL RECORD: 78907689
SURGEONPHYSICIAN: Surgeon MD
PREOPERATIVE DX: r/o Helicobacter pylori

Final diagnosis: rare comma shaped organisms seen consistent with H pylori

MEDICAL RECORD: 78907689
SURGEONPHYSICIAN: Surgeon MD
PREOPERATIVE DX: r/o Helicobacter pylori

Final diagnosis: no comma shaped organisms seen consistent with H pylori

Tags: None

Phil Bromiley

Join Date: Apr 2014

Posts: 4348
#2

29 Mar 2019, 11:34

While Stata does text processing, there are packages specifically written for text processing that might be better.

The only think I can think of is to take your examples, and write specific recognition procedures for each of the cases your current program didn't find.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#3

29 Mar 2019, 12:41

How many reports do you have? Of those, what fraction do you think include some variant of "H. pylori" in the text?

I just don't see how you're going to build something smart enough to distinguish "no comma shaped organisms seen consistent with H pylori" from a positive report in a general sense - the "no" is so far from the "consistent with H pylori" and it could have been "no indications seen that were not consistent with H pylori" - a double-negative.

My thought is, can you focus your work on building an initial filter to restrict the reports to those that mention H pylori (in all the many spellings and misspellings) and then visually review that subset and identify those reporting a test with a positive result? Farm it out to Amazon's "Mechanical Turk" to pay a small amount for each report reviewed?

https://www.mturk.com

I agree with Phil that this task does not play to Stata's strengths.
Comment
David Radwin

Join Date: Mar 2014

Posts: 369
#4

29 Mar 2019, 13:50

I agree with the other posters, but as a long shot, you might check out Word Scores by Michael Laver, Kenneth Benoit, and John Garry.

https://www.tcd.ie/Political_Science.../software.html

David Radwin
Senior Researcher, California Competes
californiacompetes.org
Pronouns: He/Him
1 like
Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2425

29 Mar 2019, 14:01

Here's a Stata text analysis command, whose description impressed me when it first came out. I'd *really* like to hear a report of whether it's useful in the current case, which seems like a good test:
.

Code:

 ssc describe txttool

------------------------------------------------------------------------------------------------------------------------------------------------------
package txttool from http://fmwww.bc.edu/repec/bocode/t
------------------------------------------------------------------------------------------------------------------------------------------------------

TITLE
      'TXTTOOL': module providing utilities for text analysis

DESCRIPTION/AUTHOR(S)
      
       txttool provides a set of tools for managing and analyzing
      free-form text. The program integrates    several built-in Stata
      functions with new text capabilities, including a utility to
      create a    bag-of-words representation of text and an
      implementation of Porter's word stemming algorithm.

It's documented in Stata Journal, volume 14, number 4.

Announcement

Free text processing of medical notes

Comment

Comment

Comment

Comment