Medical Document Categorization - PowerPoint PPT Presentation

1 / 14

About This Presentation

Title:

Medical Document Categorization

Description:

1 Department of Biomedical Informatics, Children's Hospital Research Foundation, Cincinnati, OH, USA ... Department of Informatics, Nicolaus Copernicus ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 15

Provided by: lukasz2

Category:

more less

Transcript and Presenter's Notes

Title: Medical Document Categorization

1

Medical Document Categorization
Using a Priori Knowledge

L. Itert1,2, W. Duch2,3, J. Pestian1
1 Department of Biomedical Informatics,
Childrens Hospital Research Foundation,
Cincinnati, OH, USA 2 Department of Informatics,
Nicolaus Copernicus University, Torun, Poland 3
School of Computer Engineering, Nanyang
Technological University, Singapore ICANN 2005,
Warsaw, 10-14 Sept. 2005
2
Outline

Goals questions
Medical data
Data preparation
Model of similarity
Computational experiments and results

3
Goals Questions

What are the key clinical descriptors for a given
disease?
In what sense are the records describing patients
with the same diseases similar?
Can we capture experts intuition evaluating
documents similarity and diversity?
Include a priori knowledge in document
categorization important especially for rare
disease.
Use UMLS ontology and NLM lexical tools.

4
Example of clinical summary discharges

Jane is a 13yo WF who presented with CF
bronchopneumonia. She has noticed increasing
cough, greenish sputum production, and fatique
since prior to 12/8/03. She had 2 febrile
epsiodes, but denied any nausea, vomiting,
diarrhea, or change in appetite. Upon admission
she had no history of diabetic or liver
complications. Her FEV1 was 73 12/8 and she was
treated with 2 z-paks, and on 12/29 FEV1 was 72
at which time she was started on Cipro. She noted
no clinical improvement and was admitted for a 2
week IV treatment of Tobramycin and Meropenem.

5
Unified Medical Language System (UMLS)

semantic types
Virus" causes "Disease or Syndrome"
semantic relation
Other relations interacts with, contains,
consists of , result of, related to,
Other types Body location or region, Injury
or Poisoning, Diagnostic procedure,

6
UMLS Example (keyword virus)

Metathesaurus
Concept Virus, CUI C0042776, Semantic
Type Virus
Definition (1 of 3)
Group of minute infectious agents
characterized by a lack of independent metabolism
and by the ability to replicate only within
living host cells have capsid, may have DNA or
RNA (not both). (CRISP Thesaurus)
Synonyms Virus, Vira Viridae
Semantic Network
"Virus" causes "Disease or Syndrome"

7
Data
Disease name Clinical Data Clinical Data Reference Data size bytes
Disease name No. of records Average size bytes Reference Data size bytes
Pneumonia 609 1451 23583
Asthma 865 1282 36720
Epilepsy 638 1598 19418
Anemia 544 2849 14282
UTI 298 1587 13430
JRA 41 1816 27024
Cystic fibrosis 283 1790 7958
Cerebral palsy 177 1597 35348
Otitis media 493 1420 32416
Gastroenteritis 586 1375 9906
JRA - Juvenile Rheumatoid Arthritis UTI -
Urinary tract infection
8
Data processing/preparation
MMTx discovers UMLS concepts in text
Reference Texts
MMTx
ULMS concepts /feature prototypes/
Filtering /focus on 26 semantic types/
Features /UMLS concept IDs/
Clinical Documents
MMTx
UMLS concepts
Filtering using existing space
Final data
9
Semantic types used
Values indicate the actual numbers of concepts
found inI clinical textsII reference texts
10
Data - statistics

10 classes
4534 vectors
807 features (out of 1097 found in reference
texts)
Baseline
Majority 19.1 (asthma class)
Content based 34.6 (frequency of class name in
text)
Remarks
Very sparse vectors
Feature values represent term frequency (tf) i.e.
the number of occurrences of a particular concept
in text

11
Model of similarity I

Intuitions
Initial distance between document D and the
reference vectors Rk should be proportional to
d0k D Rk ? 1/p(Ck) - 1
If a term i appears in Rk with frequency Rik gt 0
but does not appear in D the distance d(D,Rk)
should increase by ?ik a1Rik
If a term i does not appear in Rk but it has
non-zero frequency Di the distance d(D,Rk)
should increase by ?ik a2Di
If a term i appears with frequency Rik gt Di gt 0
in both vectors the distance d(D,Rk) should
decrease by ?ik -a3Di
If a term i appears with frequency 0 lt Rik Di
in both vectors the distance d(D,Rk) should
decrease by ?ik -a4Rik

12
Model of Similarity II
Given the document D, a reference vector Rk and
probability p(iCk) probability that the class of
D is Ci should be proportional to
where ?ik depends on adaptive parameters a1,,a4
which may be specific for each class. Linear
programming technique can be used to estimate ai
by maximizing similarity between documents and
reference vectors

with the constrains

where k indicates the correct class.
13
Results
M0 M1 M2 M3 M4 M5
kNN 48.9 50.2 51.0 51.4 49.5 49.5
SSV 39.5 40.6 31.0 39.5 39.5 42.3
MLP (300 neur.) 66.0 56.5 60.7 63.2 72.3 71.0
SVM (C opt.) 59.3 (1.0) 60.4 (0.1) 60.9 (0.1) 60.5 (0.1) 59.8 (0.01) 60.0 (0.01)
10 Ref. vectors 71.6 - 71.4 71.3 70.7 70.1
10-fold crossvalidation accuracies in for
different feature weightings. M0 tf
frequencies M1 binary data
14
Conclusions
Medical text contain a large number of rare,
specific concepts. Vector representation using
standard td x idf weighting leads to poor
results A priori knowledge was introduced using
single reference vector (this certainly needs
improvement). Expert intuitions were formalized
in a model to measure similarity of text, with
only 4 parameters per class. Linear programming
has been used to optimize parameters. Results are
quite encouraging. Finding best set of reference
vectors and similarity measures for medical
documents is an interesting challenge.

Write a Comment

User Comments (0)