Title: Vladimir Bajic
1Deeper insights from text-mining Dragon
Exploratory System
Vladimir Bajic InCoB 2009 Singapore
2Information exploratory systemsMotivation
3Search for complete Information is tedious
Biomedical field
Multitude of information types
Multiple structures of information records
Tools for Information search and retrieval need
improvements
Exploratory systems
Extreme volume of information
Distributed Information repositories
Variety of modes to access information
4Information exploratory systemsCharacteristics
5Can generate reports
Biomedical field
Integrate Information from various resources
Allow for exploring information from
various viewpoints
Links with other information repositories
Exploratory systems
Navigate easily through information retrieved
Convenient graphical representation
Convenient tabular representation
6Knowledge extraction Information exploratory
systems
Knowledge extraction
Exploratory systems
Computers
7Knowledge Extraction Why ?
- In the biomedical field the number of published
scientific reports increases dramatically per
year (PubMed 19 million documents increases by
more than 0.5 million per year) - The volume of experimentally generated and
computationally derived data is tremendous - Still the amount of facts related to a specific
topic provided in a summarized format is very
small - For this reason databases of curated information
of specialized topics are still easily publishable
8Knowledge Extraction Why (2) ?
- Summarized collections of accurate
information/facts on a specialized topic are rare - These are difficult to compile
- require curation
- slow
- costly
- What can we do about it?
- We can automate and semi-automate by the
computerized extraction of KNOWLEDGE from various
resources, most frequently textual
9Knowledge Extraction How ?
- We define what we mean by knowledge in a way
suitable for computer analysis - By KNOWLEDGE we will consider the information
that accurately links two concepts A and B. - Knowledge Edge is the pair
- A relates to B in a specific way under specific
conditions
10Knowledge Edge
Specific conditions
concept
concept
A
B
relation
11Knowledge Edge Examples
A gene EntrezGeneID 9821 B 2h after
stimulation Relation highly express
Other examples A binds to B A blocks activity of
B etc.
12Knowledge Extraction Methods
- Most rewarding text analysis
Find concepts of interest in textual data
Textual data
Determine how are concepts linked
Organize results for convenient utilization
Dictionaries of concepts of interest
13Knowledge Extraction Methods (2)
Data resources
AI system for refining results
Text-mining/ Data-mining tools
Preliminary results
Final results
Curators
Other data resources
14Knowledge Extraction
How is the relation between A and B concepts
derived? (in text-mining) Location document
set document part of document sentence Form/ru
le defined template patterns Artificial
intelligence trained system Possibly curator
assessment
15Knowledge Extraction
These provide different levels of accuracy of
derived Knowledge Edges from exploratory, i.e.
potential to accurate Sets of knowledge edges
can be used as building blocks in generation of
knowledge networks This supports various fields
of life sciences
16High-accuracy knowledge edge sets
Knowledge base for links to other resources
Biomedical field
Knowledge networks generation
Knowledge extraction systems
Automatic generation of biological databases
Hypotheses generation
Curator Support systems
17How does it look in practice?
18Bookshelf of US Surgeon General Joseph Lovell
- In 1818 Joseph Lovell became the surgeon general
of the US army medical department. - Occasionally he purchased medical journals and
books for his office bookshelf.
19General Thomas Lawson - the first catalog
- Lovell's successor, Surgeon General Thomas
Lawson, continued and expanded the office
collection. - In 1840 he wrote a small catalog containing 134
titles and calling the collection as the "Library
of the Surgeon General's Office."
20The times of John Shaw Billings
- Early in the 1870s Billings started a card file.
- Years later, on 1895 he completed sixteen volumes
index catalog. - This was the most comprehensive guide to the
medical literature of the 19th century.
21MEDLARS
- The catalog evolved from first printed, then
photographic in the 1950s, to computerized
Medical Literature Analysis and Retrieval
System (MEDLARS) system in the 1960s. - For each requested abstract the entire set of
magnetic tapes had to be searched sequentially
which in 1964 took about 40 minutes, and a
summary mailed back to the library member. - In 1971 the library started providing online
access and in 1993 through the website.
22National Library of Medicine
- Today, the National Library of Medicine (NLM) is
the biggest repository of the biomedical
information worldwide. -
- Via the National Center for Biotechnology
Information (NCBI) NLM houses and provides access
not only to bibliography but also to a number of
biomedical databases.
23Entrez
- The access to biomedical literature is provided
by NCBIs PubMed service. - Pubmed uses the cross-database search and
retrieval system Entrez. - The Entrez query system is also used for services
including NLM catalog, nucleotide and protein
database, genome sequences, and many others. - Information from published documents only is
insufficient
http//www.ncbi.nlm.nih.gov/sites
24PubMed
- The main PubMed component is Medical Literature
Analysis and Retrieval System Online (MEDLINE) - MEDLINE is a bibliographic database that contains
citations and abstracts of journal articles in
life sciences with a focus on biomedicine. - In addition PubMed contains citations and links
to full text articles from other biomedical
related life science journals. - In the year 2000 a new service, PubMed Central
(PMC) was established as a digital archive of
full text journal articles. - Today PubMed contains more than 19,000,000
citations.
25Text Mining System Components
26Text Mining with the Dragon Exploration System
- Six design principles
- Submit
- Annotate
- Explore
- Visualize
- Hypothesize
- Mine
27Text Mining with the Dragon Exploration System
- Six design principles
- Submit
- Annotate
- Explore
- Visualize
- Hypothesize
- Mine
28Text Mining with the Dragon Exploration System
- Six design principles
- Submit
- Annotate
- Explore
- Visualize
- Hypothesize
- Mine
29Text Mining with the Dragon Exploration System
- Six design principles
- Submit
- Annotate
- Explore
- Visualize
- Hypothesize
- Mine
30Text Mining with the Dragon Exploration System
- Six design principles
- Submit
- Annotate
- Explore
- Visualize
- Hypothesize
- Mine
Tobacco and tobacco smoke (inhibits Monoamine Oxidase) Tobacco and tobacco smoke (inhibits Monoamine Oxidase)
Controls aggression in obsessive-compulsive disorder. ?
Beneficial for Parkinsons disease. ?
Beneficial for bipolar depression. ?
Link to Borderline Mental Retardation and Idiopathic Epilepsy. ?
31Text Mining with the Dragon Exploration System
- Six design principles
- Submit
- Annotate
- Explore
- Visualize
- Hypothesize
- Mine
32or how we did it?
33Main resources
34- Set of curated dictionaries on various topics
- Pre-indexed PubMed
- Pre-indexed EntrezGene, Reactome
- Pre-indexed UniProt (SwissProt, Trembl)
- Internal annotated human promoter database
(200,000 TSS 2 sources of experimental
evidence, 300 TFs, association of TSSs with
expression libraries) - Pre-indexed and annotated human regulatory
regions for SNPs and affected TFBSs - PPI data
- Hypotheses generation module
- Rule-based knowledge extraction module
35Text mining characteristics
36- Synonym resolution module
- Acronym resolution module
- Free selection of index dictionaries
- Text is re-indexed on the fly
- Color-coded index presentation in text
- Link to the source
37List of associations per concept
38Search for an explicit term
39Text Mining with the Dragon Exploration System
- Six design principles
- Submit
- Annotate
- Explore
- Visualize
- Hypothesize
- Mine
Frequency of entities table.
40Tabular presentation of all terms found with
frequencies
41Find documents with pair of concepts
42Text Mining with the Dragon Exploration System
- Six design principles
- Submit
- Annotate
- Explore
- Visualize
- Hypothesize
- Mine
Frequency of pairs table.
43Text Mining with the Dragon Exploration System
- Six design principles
- Submit
- Annotate
- Explore
- Visualize
- Hypothesize
- Mine
Recommended reading top documents in annotated
and original PubMed format.
44Text Mining with the Dragon Exploration System
- Six design principles
- Submit
- Annotate
- Explore
- Visualize
- Hypothesize
- Mine
Links to external databases e.g. GO
45Text Mining with the Dragon Exploration System
- Six design principles
- Submit
- Annotate
- Explore
- Visualize
- Hypothesize
- Mine
Links to external databases e.g. Reactome
46Text Mining with the Dragon Exploration System
- Six design principles
- Submit
- Annotate
- Explore
- Visualize
- Hypothesize
- Mine
Links to external databases genes, proteins,
pathways,
47Text Mining with the Dragon Exploration System
- Six design principles
- Submit
- Annotate
- Explore
- Visualize
- Hypothesize
- Mine
Entities are linked to external databases, e.g.
Genes, Proteins, Pathways ,
48Text Mining with the Dragon Exploration System
- Six design principles
- Submit
- Annotate
- Explore
- Visualize
- Hypothesize
- Mine
Document clustering weight of the entity in the
cluster frequency / number of document.
49Hypotheses generation
50Enter the Dragon Swanson ABC model
- Sets of publications A and C have no articles in
common, but they are linked through intermediate
articles. - This structure may contain unnoticed information
that can be obtained by combining pairs of
intersections ABi and BiC.
Magnesium literature Migraine literature
Mg is calcium channel blocker. Calcium channel blockers can prevent migraine attacks
Stress can lead to loss of Mg. Stress is associated with migraine
Mg has anti-inflammatory properties. Migraine may involve inflammation of the cerebral blood vessels.
51Enter the Dragon Open and closed discovery
process
Closed discovery process
The researcher is searching for interesting
concepts (B) that links the researched topic (A)
with concept (C). A Magnesium literature. B
Stress. C Migraine.
The search starts in both direction resulting in
overlapping concepts. A Magnesium
literature. C Migraine literature. B Stress.
52Open discovery process with intermediate links
The researcher is searching for interesting
concepts (D) that links the researched topic (A)
with concept (D) via intermediate links (B C).
53Hypotheses generation
54Rule-based extraction of knowledge
55Enter the Dragon
Concept based knowledge discovery
Real world problem description search
for transcription factor binding to genes
promoter
Concept world concept description protein/gene
interactionprotein/gene promoter
56Text Mining with the Dragon Exploration System
- Six design principles
- Submit
- Annotate
- Explore
- Visualize
- Hypothesize
- Mine
Knowledge finder set template and mine for
knowledge.
57Transcription regulation
Depending on the ways how TFs interact, different
PEs participate in the transcription initiation
process Some combination of events will initiate
transcription, and some will not These show the
combinatorial nature of transcription activation
and promoters ability to address different
requirements for timing, tissue specificity, and
transcription rate and levels
58TF-1
Distant unrelated pathways
59- The key information to deal with the issue of how
transcription is controlled is to find
bind
TF
gene
promoter
Human 2000 TFs 30,000 genes 200,000
promoters Less than 8000 known
TF-(promoter)-gene edges
60Question Find TFs that bind to promoter of a gene
61Find TFs that bind to promoter of a gene
Knowledge Extraction
62Find TFs that bind to promoter of a gene
Knowledge Extraction
- For one day a curator can verify 200
associations of the type - In this way we can quickly build repositories of
curated (thus accurate) information on specific
types of knowledge edges
bind
TF
gene
promoter
63Enter the Dragon Machine learning
- Supply ML algorithm with classified examples.
- ML algorithm learns a prediction rule.
- Classify new instance of an unknown class by
using the rule.
Transcription factor binding to genes promoter? Class
Activated PEA3 binds to MMP-13 promoter and activates its expression. Yes
A HNF-1 binding site was identified in the NNMT basal promoter region. Yes
Runx2 directly binds to the OSE2 elements and transactivates the human NELL-1 promoter. No
64Enter the Dragon ML learning examples
145,168 abstract produced 1,049,949 sentences
Activated PEA3 binds to MMP-13 promoter
and activates its expression.
3,321 sentences matched the concept
pr/gninteraction pr/gn
promoterActivated PEA3 binds to MMP-13
promoter and activates its expression.
Content evaluation
Expert classification
Transcription factor binding to genes promoter? Class
Activated PEA3 binds to MMP-13 promoter and activates its expression. Yes
A HNF-1 binding site was identified in the NNMT basal promoter region. Yes
Runx2 directly binds to the OSE2 elements and transactivates the human NELL-1 promoter. No
65From real world to abstract feature vectors space
Transcription factor binding to genes promoter? Class
Activated PEA3 binds to MMP-13 promoter and activates its expression. Yes
A HNF-1 binding site was identified in the NNMT basal promoter region. Yes
Runx2 directly binds to the OSE2 elements and transactivates the human NELL-1 promoter. No
66The best performing feature
Levenshtein distance
67ML algorithms comparison
Algorithm Precision Recall F-measure AUC
K 0.709 0.710 0.710 0.741
MLP-NN 0.713 0.712 0.713 0.787
J-48 0.735 0.739 0.735 0.775
Naïve Bayes 0.786 0.788 0.785 0.841
Random Forest 0.795 0.796 0.795 0.837
68K with Synthetic Minority Oversampling
Experiment Precision Recall F-measure AUC
7 features 0.899 0.892 0.893 0.962
57 features 0.912 0.908 0.909 0.970
68 features 0.963 0961 0.961 0.993
117 features 0.960 0.959 0.959 0.988
Chowdhary et al., 2009 0.920 0.710 0.740 0.870
69Knowledge discovery
70Creation of biology related databasesFor a set
of documents and for selected set of
dictionariesAlso, for template related sentence
typesSome publications (different aspects of
DES)Essack M et al. DDEC Dragon database of
genes implicated in esophageal cancer, BMC Cancer
2009 Kaur M et al. Database for exploration of
functional context of genes implicated in ovarian
cancer, Nucleic Acids Research, 2009 Sagar S et
al. DDESC Dragon Database for Exploration of
Sodium Channels in Human, BMC Genomics, 2008
71Text Mining with the Dragon Exploration System
- Six design principles
- Submit
- Annotate
- Explore
- Visualize
- Hypothesize
- Mine
Create database by filtering sentences by
dictionary and selected keywords.
72