Title: Literature Based Discovery
1Literature Based Discovery
- Dimitar Hristovski dimitar.hristovski_at_mf.uni-lj.
si - Institute of Biomedical Informatics, Faculty of
Medicine, University of Ljubljana, Ljubljana,
Slovenia -
-
2Let me introduce myself Research and Development
- BS Biomedicina Slovenica database
- Research Evaluation Decision Support System
- Medical Information Systems
- Surgical clinics
- Genetic laboratory
- Biochemical laboratory
- Web User Behaviour Analysis
- Data warehousing and OLAP
3Motivation
- Overspecialization
- Information overload
- Large databases
- For many diseases the chromosomal region known,
but not the exact gene
4Background
- Literature-based discovery (Swanson)
New Relation?
Concept X(Disease)
Concepts Y(Pathologycal or Cell Function, )
Concepts Z(Genes)
5Biomedical Discovery Support System (BITOLA)
- Goal
- discover potentially new relations (knowledge)
between biomedical concepts - to be used as research idea generator and/or as
- an alternative way to search Medline
- System user (researcher or intermediary)
- interactively guides the discovery process
- evaluates the proposed relations
6Extending and Enhancing Literature Based
Discovery
- Goal
- Make literature based discovery more suitable for
disease candidate gene discovery - Decrease the number of candidate relations
- Method
- Integrate background knowledge
- Chromosomal location of diseases and genes
- Gene expression location
- Disease manifestation location
7Usage Scenarios
- For a disease with known chromosomal location,
find a candidate gene - For a gene, find a disease that might be
influenced - For a disease and gene found to be related by
linkage study, find the mechanism of the relation
(intermediate concepts should help)
8System Overview
Knowledge Base
Concepts
Background Knowledge (Chromosomal Locations, )
Discovery Algorithm
Association Rules
User Interface
Knowledge Extraction
Databases (Medline, LocusLink, HUGO, OMIM, )
9Databases
- Medline source of known relationships between
biomedical concepts - Set of concepts
- MeSH (Medical Subject Headings) Controlled
dictionary and thesaurus used for indexing and
searching the Medline database - HUGO official gene symbols, names and aliases
- LocusLink gene symbols, aliases and
chr.locations - OMIM genetic diseases
- UMLS (Unified Medical Language System)
- Entrez used to search PubMed, GenBank, ...
- UniGene gene expression
10Knowledge Extraction
- Build master set of concepts (MeSH terms and gene
symbols) - Extract occurrence of concepts from each Medline
record (MeSH terms from MH field, gene symbols
from Title and Abstract) - Association rule mining (concept co-occurrence)
- Chromosomal location extraction (from LocusLink
and HUGO) - Load into knowledge base
11Terminology Problems during Knowledge Extraction
- Gene names
- Gene symbols
- MeSH and genetic diseases
12Detected Gene Symbols by Frequency
- type666548
- II552584
- III201776
- component179643
- CT175973
- AT151337
- ATP147357
- IV123429
- CD499657
- p5389357
- MR88682
- SD85889
- GH84797
- LPS68982
- 5967272
- E264616
- 8263521
- AMP61862
- TNF59343
- RA58818
- CD857324
- O256847
- ACTH54933
- CO253171
- PKC51057
- EGF50483
- T349632
- MS46813
- A244896
- ER43212
- upstream41820
- PRL41599
13Gene Symbol Disambiguation
- Find MEDLINE docs in which we can expect to find
gene symbols - JD indexing (Susanne Humphrey) as possible
solution - Identifies the semantic context of docs
- If semantic context not genetic, then gene symbol
probably false positive - Example of false positive
- Ethics in a twist "Life Support", BBC1. BMJ 1999
Aug 7319(7206)390 - breast basic conserved 1 (BBC1) gene, v.s. BBC1
television station featuring new drama series
Life Support
14JD Indexing
- JDs are 127 Journal Descriptors (e.g., JDs for
journal Hum Mol Genet Cytogenetics Genetics,
Medical) - Training set docs (435,000) inherit JDs from
journals - Training set provides co-occurrence data between
inherited JDs and - indexing terms assigned to docs directly
- words in docs
- Docs having indexing terms/words occurring often
with genetics JDs in tr. set assumed to have
genetics context - Extended to indexing by 134 UMLS semantic types
(e.g. Gene or Genome, Gene Function,)
15System Overview
Knowledge Base
Concepts
Background Knowledge (Chromosomal Locations, )
Discovery Algorithm
Association Rules
User Interface
Knowledge Extraction
Databases (Medline, LocusLink, HUGO, OMIM, )
16Binary Association Rules
- X?Y (confidence, support)
- If X Then Y (confidence, support)
- Confidence of docs containing Y within the X
docs - Support number (or ) of docs containing both X
and Y - The relation between X and Y not known.
- Examples
- Multiple Sclerosis ? Optic Neuritis (2.02, 117)
- Multiple Sclerosis ? Interferon-beta (5.17, 300)
17Discovery Algorithm
Candidate Gene?
Concepts Y(Pathologycal or Cell Function, )
Concept X(Disease)
Concepts Z(Genes)
Chromosomal Region
Chromosomal Location
Match
Manifestation Location
Expression Location
Match
18Discovery Algorithm
- Let X be starting concept of interest.
- Find all Y for which X ? Y.
- Find all Z for which Y ? Z.
- Eliminate those Z for which X-gtZ already exists.
- Eliminate those Z that do not match the
chromosomal region of X - Eliminate those Z that do not match the
expression location of X - Remaining Z are candidates for new relation
between X and Z. - In general
- X ? Y1 ? ? Yn ? Z, but not X ? Z
- Example
- X disease
- Y (pato)physiology of X
- Z (de)regulators of Y (drugs, proteins, genes)
- New relation example Z is candidate gene for
disease X
19Ranking Concepts Z
20Results Concepts in Medline
- Full Medline (end 2001) analyzed (11,226,520
recs) - Looking for 19,781 MeSH terms and 22,252 human
genes (14,659 from HUGO and 7,593 from
LocusLink). 24,613 alias gene symbols added - Gene symbols found in 2,689,958 Medline recs.
Most frequent ambiguous symbols (CT, MR, CO2,)
or format errors
21Results Co-occurring Concepts in Medline
- 29,851,448 distinct pairs of co-occurring
concepts - In 7,106,099 at least one gene symbol appeared
- In 679,159 pairs both concepts are gene symbols
- Total co-occurrence frequency 798,366,684
- 59,702,986 association rules calculated and
stored
22Bilateral Perisylvian Polymicrogiria - BPP (OMIM
300388)
- Polymicrogyria of the cerebral cortex is a
developmental abnormality characterized by
excessive surface convolution - Clinical characteristics
- Mental retardation
- Epilepsy
- Pseudobulbar palsy (paralysis of the face,
throat, tongue and the chewing process) - X linked dominant inheritance
23(No Transcript)
24BPP - pathogenesis
- It is considered a disorder of neuronal migration
(unlayered type) or a consequence of intrauterine
ischemia (layered type)
25Finding Candidate Genes for Polymicrogyria,
bilateral perisylvan
26237 genes in Xq28
relation between semantic types Cell Movement and
Gene or gene products
18 gene candidates
Sublocalisation in the Xq28
15 gene candidates
Tissue specific expression
2 gene candidates L1CAM and FLNA
27(No Transcript)
28User Interface cgi-bin version
29Automatically search for supporting Medline
Citations
30Cleft Palate Predicting Candidate Genes
31(No Transcript)
32(No Transcript)
33Summary and Conclusions
- We extend and enhance an existing discovery
support system (BITOLA) - The system can be used as
- Research idea generator, or
- Alternative method of searching Medline
- Genetic knowledge about the chromosomal locations
of diseases and genes included to make BITOLA
more suitable for disease candidate gene discovery
34Further Work
- Increase the number of concepts
- Gene symbol disambiguation
- Semantic relations extraction
- System evaluation
- Improve the Web version of the system
35System Availability
- URL www.mf.uni-lj.si/bitola/
36Related work SemGenTom Rindflesch et al
- Extract semantic predications on genetic basis of
disease - Deletions of INK4 occur in malignant tumors
- INK4ASSOCIATED_WITHMalignant Tumors
- Evaluation and visualization of SemGen output
37Semantic Structures
ltgenetic phenomenongt
ETIOLOGY_OF
ltdisordergt
CAUSE
PREDISPOSE
ASSOCIATED_WITH
38(No Transcript)
39Statistical Evaluation
- Assoc. rule base divided into 2 segments older
(1990-1995) and newer (1996-1999) - The system predicts new relations based on the
older segment - Predictions compared with actual new relations in
the newer segment
40Summary Statistical Evaluation Results
41Statistical Evaluation Results
- With no assoc. rules constraints
- predicts almost all new relations, but too many
candidate relations - With constraints
- predicts new relations 6.9 times better than
random predictions - tighter the constraints, better (correct / all
predictions) ratio (6.5)