Title: Information Extraction from Semistructured Patient Records
1Information Extraction from Semi-structured
Patient Records
- Davis Zhou
- College of Information Science Technology
Drexel University
2Agenda
- Problem Addressed
- Methods
- Approach to numeric values
- Approach to medical terms
- Approach to categorical values
- Implementation
- Evaluation
- Future Work
3Problem Addressed
- Descriptions
- Automatically extract information from
semi-structured patient records. - Three types of information
- Number blood pressure, weight, pulse, etc.
- Medical terms past medical history
- Classification smoking behavior, alcohol use,
appearance, etc. - Each record consists of multiple sections
beginning with fixed strings. Each section is
written in natural language.
4Problem Addressed (cont.)
5Problem Addressed (cont.)
6Approach to Numeric Values (1)
- Number Identification
- Tokenization
- Named Entity Recognition
- Concept Identification
- String Match
- Synonym Expansion
- Association
- Pattern based approach
- Linkage based approach ( our approach)
7Approach to Numeric Values (2)
- Pattern Approach
- Examples
- CONCEPT is NUMBER
- CONCEPT of NUMBER
- CONCEPT, NUMBER
- CONCEPT NUMBER
- Very simple but has generalization problem.
- Linkage-based Association Approach
- Convert linkage diagram to graph
- Calculate the shortest distance of any pair of
concept and number in a sentence.
8Approach to Numeric Values (3)
- Link Grammar Parser
- Converts word to node, link to (weighted) edge
- Assume that if a number is the value of a certain
concept, the numbers shortest distance from the
concept must be less than from any other concept
in the sentence.
9Approach to Medical Terms (1)
- State of the Art
- Current NER algorithms dont work well for
medical terms identification - Ontology is important to achieve high accuracy of
medical term extraction. - Search of any combination of sequence in sentence
through ontology is not efficient. - Solution
- POS-based Ordered Patterns Search
10Approach to Medical Terms (2)
- Flow
- Part of speech tagging
- Ordered Patterns Matching, for example
- JJ NN NN
- NN NN
- JJ NN
- NN
- Normalization of the candidate term.
- Search candidate term through Ontology (e.g.
UMLS).
11Approach to Categorical Values (1)
- Available Methods
- Analytic approach
- Machine learning
- Decision tree is frequently used in natural
language understanding - Examples
- Each patient is either current smoker, former
smoker, or nonsmoker. - Texts
- She quitted smoking five years ago (former)
- She is currently a smoker (current)
- None (never)
12Approach to Categorical Values (2)
- Word-based Boolean Feature Extraction
- Choose one or multiple part of speeches verb,
noun, adjective, and adverb. - Choose one or multiple sentence constituents
subject, verb, object, and supplement. - Head noun or head adjective only. If this option
is enabled, for noun phrase or adjective phrase,
only head word is extracted. - Use lemma (uninfected form) of any word. If this
option is enabled, denies, denied and deny
will be treated as the same feature. - ID3-based Decision Tree
- The criteria for feature selection is maximum
Information Gain (mutual information) - ID3 yield fewer features than other algorithms
13Approach to Categorical Values (3)
- Example ID3-based Decision Tree for
Classification of Smoking Behavior.
14Implementation
15Evaluation
- 50 semi-structured patient records
- The goal is to extract 24 attributes (18 fields),
4 medical terms, 8 numbers, and 12 categorical
attributes. - Measures
- Precision is defined as the proportion of
correctly extracted instances of those extracted.
- Recall is the proportion of correctly extracted
instances of total instances.
16Evaluation of Numeric Attributes
- The precisions (recall) for eight numeric
attributes are all 100. - By examining all 50 records manually, we find
that the extremely high precision is in part
attributed to the very consistent writing style. - If the size of data set increases and diversified
writing styles are introduced, the performance
may be degraded.
17Evaluation of Smoking Behavior
- 45 cases, 5 former smokers, 12 current smokers,
and 28 nonsmokers. - 5-folder cross-validation
- Run experiments for 10 rounds. (For each round,
data set is randomly shuffled.) - Average precision (recall) is 92.2
- The number of features used ranges from 4 to 7)
18Evaluation of Medical Terms (1)
- Each attribute can have multiple values (medical
terms). - Where
- ETruei number of extracted true terms in i-th
subject. - ETotali number of extracted terms in i-th
subject. - TInsti number of total true terms in i-th
subject.
19Evaluation of Medical Terms (2)
- Extracted false terms and unextracted true terms
are mainly caused by the incompleteness of domain
ontology - The low recall of predefined past surgical
history and low precision of other past surgical
history are due to failure to recognize the
synonyms of predefined surgical terms and
improper recognition of them as other surgical
terms.
20Future Work
- Test our work on larger data set
- Medical Terms Extraction
- Ontology selection
- The use of synonym
- Text Classification
- How to deal with categories containing numeric
threshold information
21Questions