Information Extraction from Semistructured Patient Records presentation

About This Presentation

Title:

Information Extraction from Semistructured Patient Records

Description:

Automatically extract information from semi-structured patient records. ... Each patient is either current smoker, former smoker, or nonsmoker. Texts ' ... –

Number of Views:97

Avg rating:3.0/5.0

Slides: 22

Provided by: NanZ1

Category:

more less

Transcript and Presenter's Notes

Title: Information Extraction from Semistructured Patient Records

1
Information Extraction from Semi-structured
Patient Records

Davis Zhou
College of Information Science Technology
Drexel University

2
Agenda

Problem Addressed
Methods
Approach to numeric values
Approach to medical terms
Approach to categorical values
Implementation
Evaluation
Future Work

3
Problem Addressed

Descriptions
Automatically extract information from
semi-structured patient records.
Three types of information
Number blood pressure, weight, pulse, etc.
Medical terms past medical history
Classification smoking behavior, alcohol use,
appearance, etc.
Each record consists of multiple sections
beginning with fixed strings. Each section is
written in natural language.

4
Problem Addressed (cont.)

Examples

5
Problem Addressed (cont.)

Examples

6
Approach to Numeric Values (1)

Number Identification
Tokenization
Named Entity Recognition
Concept Identification
String Match
Synonym Expansion
Association
Pattern based approach
Linkage based approach ( our approach)

7
Approach to Numeric Values (2)

Pattern Approach
Examples
CONCEPT is NUMBER
CONCEPT of NUMBER
CONCEPT, NUMBER
CONCEPT NUMBER
Very simple but has generalization problem.
Linkage-based Association Approach
Convert linkage diagram to graph
Calculate the shortest distance of any pair of
concept and number in a sentence.

8
Approach to Numeric Values (3)

Link Grammar Parser
Converts word to node, link to (weighted) edge
Assume that if a number is the value of a certain
concept, the numbers shortest distance from the
concept must be less than from any other concept
in the sentence.

9
Approach to Medical Terms (1)

State of the Art
Current NER algorithms dont work well for
medical terms identification
Ontology is important to achieve high accuracy of
medical term extraction.
Search of any combination of sequence in sentence
through ontology is not efficient.
Solution
POS-based Ordered Patterns Search

10
Approach to Medical Terms (2)

Flow
Part of speech tagging
Ordered Patterns Matching, for example
JJ NN NN
NN NN
JJ NN
NN
Normalization of the candidate term.
Search candidate term through Ontology (e.g.
UMLS).

11
Approach to Categorical Values (1)

Available Methods
Analytic approach
Machine learning
Decision tree is frequently used in natural
language understanding
Examples
Each patient is either current smoker, former
smoker, or nonsmoker.
Texts
She quitted smoking five years ago (former)
She is currently a smoker (current)
None (never)

12
Approach to Categorical Values (2)

Word-based Boolean Feature Extraction
Choose one or multiple part of speeches verb,
noun, adjective, and adverb.
Choose one or multiple sentence constituents
subject, verb, object, and supplement.
Head noun or head adjective only. If this option
is enabled, for noun phrase or adjective phrase,
only head word is extracted.
Use lemma (uninfected form) of any word. If this
option is enabled, denies, denied and deny
will be treated as the same feature.
ID3-based Decision Tree
The criteria for feature selection is maximum
Information Gain (mutual information)
ID3 yield fewer features than other algorithms

13
Approach to Categorical Values (3)

Example ID3-based Decision Tree for
Classification of Smoking Behavior.

14
Implementation
15
Evaluation

50 semi-structured patient records
The goal is to extract 24 attributes (18 fields),
4 medical terms, 8 numbers, and 12 categorical
attributes.
Measures
Precision is defined as the proportion of
correctly extracted instances of those extracted.
Recall is the proportion of correctly extracted
instances of total instances.

16
Evaluation of Numeric Attributes

The precisions (recall) for eight numeric
attributes are all 100.
By examining all 50 records manually, we find
that the extremely high precision is in part
attributed to the very consistent writing style.
If the size of data set increases and diversified
writing styles are introduced, the performance
may be degraded.

17
Evaluation of Smoking Behavior

45 cases, 5 former smokers, 12 current smokers,
and 28 nonsmokers.
5-folder cross-validation
Run experiments for 10 rounds. (For each round,
data set is randomly shuffled.)
Average precision (recall) is 92.2
The number of features used ranges from 4 to 7)

18
Evaluation of Medical Terms (1)

Each attribute can have multiple values (medical
terms).
Where
ETruei number of extracted true terms in i-th
subject.
ETotali number of extracted terms in i-th
subject.
TInsti number of total true terms in i-th
subject.

19
Evaluation of Medical Terms (2)

Extracted false terms and unextracted true terms
are mainly caused by the incompleteness of domain
ontology
The low recall of predefined past surgical
history and low precision of other past surgical
history are due to failure to recognize the
synonyms of predefined surgical terms and
improper recognition of them as other surgical
terms.

20
Future Work