Title: Real-time Text Mining for the Biomedical Literature
1Real-time Text Mining for the Biomedical
Literature a collaboration between Discovery Net
myGrid
Rob Gaizauskas Department of Computer
Science University of Sheffield
Moustafa M. Ghanem Department of
Computing Imperial College London
2Outline
- Context
- Workflows, Services and Text Mining
- Discovery Net myGrid
- Aims and Objectives of New Project
- Architecture of New System
- Integration of Existing Components
- Approach to Text Mining
- Data Resources Evaluation
- Techniques for Go Tagging
- Interface and Results Presentation
- Lessons Learnt So far, Conclusions and Broader
Applicability of Work
3Workflows, Web Services and Text Mining for
Bioinformatics
- Workflows
- useful computational models for processes that
require repeated execution of a series of complex
analytical tasks - e.g. biologist researching genetic basis of a
disease repeatedly - maps reactive spot in microarray data to gene
sequence - uses a sequence alignment tool to find
proteins/DNA of similar structure - mines info about these homologues from remote DBs
- annotates unknown gene sequence with this
discovered info
4Workflows, Web Services and Text Mining for
Bioinformatics
- Web services
- Processing resources that are
- available via the Internet
- use standardised messaging formats, such as XML
- enable communication between applications without
being tied to a particular operating
system/programming language - Useful for bioinformatics where data used in
research is - heterogeneous in nature DB records, numerical
results, NL texts - distributed across the internet in research
institutions around the world - available on a variety of platforms and via
non-uniform interfaces
5Workflows, Web Services and Text Mining for
Bioinformatics
- Text mining
- any process of revealing information
regularities, patterns or trends in textual
data - includes more established research areas such as
information extraction (IE), information
retrieval (IR), natural language processing
(NLP), knowledge discovery from databases (KDD)
and traditional data mining (DM) - relevant to bioinformatics because of
- explosive growth of biomedical literature
- availability of some information in textual form
only, e.g. clinical records
6Workflows, Web Services and Text Mining for
Bioinformatics
7Discovery Net myGrid
- Discovery Net An e-Science testbed for High
Throughput Informatics - 2.2M EPSRC Pilot Project
- Started Oct 01, Ended in March 05
- Service-based infrastructure/workflow model for
Life Sciences, Environmental Modelling and
Geo-hazard Modelling - Infrastructure for mixed data mining / text
mining - Machine learning methods for text mining
- myGrid Directly Supporting the e-Scientist
- 3.5M EPSRC Pilot Project
- Started Oct 01, Ends June 05
- Service-based infrastructure/workflow model for
Life Sciences - Infrastructure for Text Collection Server, Text
Services Workflow Server and Interface/Browsing
Client - Service-based Terminology Servers
8myGrid
- Overall aim develop an e-biologists workbench
a platform allowing biologists to execute,
analyze, repeat multi-stage in silico experiments
involving distributed data, code and processing
resources - Workflow model for composing/executing processing
components - Web services for distribution
- Problem how to integrate text mining into a
biological workflow? - Most text mining runs off-line and supports
interactive browsing of results - Most workflows run end to end with no user
intervention - What are the inputs to text mining to be?
- Solution tap off result of a workflow step and
treat as implicit query
9A myGrid example studying the Genetic Basis of
Disease
- Graves Disease
- an autoimmune condition affecting tissues in the
thyroid and orbit - being investigated using the micro-array methods
- micro-array shows which genes are differentially
expressed in normal patients vs patients with
the disease candidate genes - sequence alignment search (e.g. BLAST) finds
genes/proteins with similar structure - function of these homologues may suggest
function of candidate gene - key step for text mining follows BLAST search
- for homologous proteins BLAST report contains
references to proteins in SWISSPROT protein
database - Swissprot records contain ids of abstracts
describing the protein in Medline abstract
database - abstracts can be mined directly or used as
seed'' documents to assemble a set of related
abstracts
10myGrid Text Services Architecture
11myGrid Text Services Architecture
- 3-way division of labour sensible way to deliver
distributed text mining services - Providers of e-archives, such as Medline, will
make archives available via web-services
interface - Cannot offer tailored sevices for every
application - Will provide core, common services
- Specialist workflow designers will add value to
basic services from archive to meet their
organizations needs - Users will prefer to execute predefined workflows
via standard light clients such as a browser - Architecture appropriate for many research areas,
not just bioinformatics
12myGrid Interface/Browsing Client
13Discovery Net Adding text mining to e-Science
workflows
- DNet Workflow server executes DPML workflow and
uses Discovery Nets InfoGrid data access and
integration wrappers and web services
14Text Mining in e-Science workflows
- Problem how to develop new distributed text
mining applications using a workflow? - Most text mining applications require the
integration of a mixture of components (Services)
for text processing tasks (e.g. parsing and
cleaning), natural language processing (e.g.
named entity recognition), statistics and data
mining (e.g. classification, clustering, etc). - There are many design alternatives and end users
may want to prototype and compare alternative
implementations. - Once application developed, most workflows run
end to end with no user intervention - Solution Extend service infrastructure to allow
composition of text mining services.
15Building text mining applications from workflows
Using workflow technologies to build text mining
applications and services using finer grain
components/services
Text Mining Pipelines
Features are summarized into vector forms which
are suitable for data mining
Results can be document characterization or
hidden relationship extraction
Pre-process documents to enhance the ease of
feature extraction
Retrieve and organize relevant documents
16Simplified Document Classification Workflow
Predictive Accuracy of Relevance prediction,
using Support Vector Machine classification Ove
rall accuracy 84.5 Precision 78.11 Recall
73.40
17Text Meta Data Model
Build Classifier training phase using workflow
co-ordinating distributed services Build
Prediction phase using workflow co-ordinating
distributed services Metadata Model Service
Interfaces only tell you how to invoke remote
service but it is up to you to decide what
information flows between services !
18Aims Objectives of New Project
- Aim to develop a unified real-time e-Science
text-mining infrastructure that leverages the
technologies and methods developed by both
Discovery Net and myGrid - Software engineering challenge integrate
complementary service-based text mining
capabilities with different metadata models into
a single framework - Application challenge annotate biomedical
abstracts with semantic categories from the Gene
Ontology - Deliverables
- D1 A GO Annotation Service
- D2 A Generic Shared Infrastructure for
Grid-enabled Biomedical Document Categorization - D3 Infrastructure for Semantic Document
Annotation - D4 A Detailed Case Study (analysing/evaluating
the GO annotator) - D5 Developing a common framework for
representing exchanging information about - 1. Data biomedical documents/doc collections
metadata, biomedical dictionaries - 2. Intermediate data Document indexes and
Document feature vectors - 3. Text Analysis Results
19Go TAG A Novel Application
- The GO TAG Application Automatic Assignment of
GO (Gene Ontology) Codes to Medline Documents
20A Machine Learning Approach
21Run-time System
22GO Annotator Version 1
- Version 1a
- Direct search for GO Annotation descriptions and
synonyms in document text - If description is found, document is labelled
with this GO Annotation - Description is also marked-up in document
- Version 1b
- 1a search for gene names extracted from yeast
genome DB - If gene name found, document labelled with GO
annotation(s) associated with gene in DB - Gene name also marked up in document
- Termino web-service, hosted at Sheffield,
provides lookup capability - This is wrapped in a DiscoveryNet workflow to
include PubMed query, results visualization and
performance calculations - Workflow is deployed as a web application for end
users which includes applet to interactively
browse results
23GO Annotator Version 1Underlying Discovery Net
Workflow
24GO Annotator Version 1Underlying Discovery Net
Workflow
Enter query and retrieve abstracts from PubMed.
25GO Annotator Version 1Underlying Discovery Net
Workflow
Use Termino to mark-up abstracts with GO
Annotations when match for GO Annotation
description is found.
26GO Annotator Version 1Underlying Discovery Net
Workflow
Tabulate GO Annotations by PMID.
27GO Annotator Version 1Underlying Discovery Net
Workflow
Join PMIDs and matching GO Annotations with
abstracts and titles.
28Workflow Deployment
29GO Annotator Version 2
- Use Saccharomyces (Yeast) Genome Database as
source of papers expertly curated with GO
Annotations - Train classifier using these papers
- Hierarchical classification
- Training data sufficient to classify over 2000 GO
Annotations - Classifier is then applied to assign unseen
papers with GO Annotations - Main Issues
- Choice of features to be extracted from the
training documents - Choice of feature reduction methods to produce
accurate classification - Choice of classification algorithm to be used?
30GO Annotator Version 2Underlying Discovery Net
Workflow
31GO Annotator Version 2Underlying DiscoveryNet
Workflow
Papers expertly curated with GO Annotations from
SGD database.
32GO Annotator Version 2Underlying Discovery Net
Workflow
Generate vector of features (frequent phrases)
for each paper. This is used to train classifier.
33GO Annotator Version 2Underlying Discovery Net
Workflow
Generate a Naïve Bayesian classification model.
34GO Annotator Version 2Underlying Discovery Net
Workflow
Generate vector of features (frequent phrases)
for each paper in test data set. This is used to
test the classifier.
35GO Annotator Version 2Underlying Discovery Net
Workflow
Apply classification model to test data to
evaluate classification accuracy.
36Interface Results Presentation
37Achievements to date
- Infrastructure Interoperability
- More than just remote web service invocation
interoperable metadata models - Mark 1 System Implemented
- Annotation based on terminology lookups
- 15 Recall 5 Precision (Exact matches for
18,000 GO terms) - Measures inadequate due to incompleteness of gold
standard - In process of Finalising Training Data Sets and
Evaluation Metrics - 4,922 papers referencing 2,455 GO Terms
- Mark 2 Systems in Progress
- Naïve Bayesian Approach
- 41 Recall and 27 Precision
- User Interfaces
- Mark 3, 4, Systems and Evaluation
38Implementation Options
- Feature Vector Options
- Bag of words
- Frequent Phrases
- Key Phrases (Gene Names, Protein Names, MeSH
terms, etc). - Classifier Options
- Bayesian Classifiers
- Support Vector Machines
- Drag Push (a novel centroid based method)
39Lessons Learnt and Challenges to Face
- Infrastructure
- Interoperability Issues
- Performance Issues
- Communication vs Persistence of remote server
- Off-line vs on-line feature extraction
- Text Mining
- Usability Issues
- Evaluation Issues