Title: The Research Assistant for Biological Text Mining
1The Research Assistant for Biological Text Mining
- Luc Dehaspe
- Other Members of the BioMinT Consortium
2Text Mining in the biological domain
- Emerging field of research and development
- 40 articles in Bioinformatics 2004
- Dedicated workshops, competitions and interest
groups - Information retrieval and extraction to deal with
information overflow - 12 million citations in Medline from 4600
journals - Many more resources on the web
- Essential link in the semantic integration of the
numerous biological resources.
3Use of text mining for database annotation
- curated protein sequence database
- high level of annotation of proteins
- high level of integration with other databases
Swiss-Prot Entry Creation Flowchart
4Use of database annotations for text mining
- Tools for information retrieval, filtering,
classification, extraction rely on - Corpora of examples used by machine learning
methods - Linguistic analysis and controlled vocabularies,
(ontologies, thesauri, biological dictionaries). - Databases provide semi-structured information
that could be used - for corpus elaboration
- as specific vocabulary resources
5- 3 year FP5 European Project, started in January
2003 - Official web site www.biomint.org
- Interdisciplinary consortium
6The goals of BioMinT
- To develop a generic text mining tool that
- interprets different types of queries
- retrieves relevant documents from the biological
literature - extracts the required information
- outputs the result as a database slot filler or
as a structured report - The tool thus provides two essential research
support services - Curator's Assistant accelerate, by partially
automating, the annotation and update of
databases - Researcher's Assistant generate readable reports
in response to queries from biological
researchers.
7Curators Assistant forSwiss-Prot Annotation
8Curators Assistant for PRINTS annotation
- PRINTS deals with groups of proteins
- Annotation of 3 types of protein fingerprints
Extracted Information
9The Biological Research Assistant
- Overlap with Curators Assistant
- All biologists occasionally in the curators seat
- Keep ahead of Swiss-Prot in research area of
interest - Include private (confidential) document
collections
10Information retrieval and extraction modules
11Information retrieval and extraction modules
G U I
IR
Query expansion
PubMed search
Document filtering/ranking
Document organisation
IE
Sentence extractor
NLP tools
Case frame generator
12Information Retrieval
- A meta-query engine built round PubMed
- Expansion of the initial query with synonyms
using a gene/protein synonym database (GPSDB) - the goal being to retrieve an exhaustive set of
documents containing information on a protein. - Filtration and ranking of the retrieved documents
- Pre-classification according to information
topics.
13GPSDB
- Database for synonym expansion of gene and
protein names - Populated by the main resources on model
organisms - Contains 559294 synonyms referring to 292472
proteins
14GPSDB
- Cross-reference links are used to connect
database entries that refer to a same
gene/protein entity, thus pointing out the
problem of homonymy when it occurs
15GPSDB screenshot
lap2 is a synonym of three separate protein
entities
Erbin
HSP 86
Thymopoietin
16GPSDB screenshot
17GPSDB used for query expansion
lap2
Original user query
Query expansion based on GPSDB
18Document filtering and ranking
- Interactive modules which permit a flexible
selection of relevant documents for the IE
process. - Algorithmic approaches
- Query dependent
- Lucene Ranker java-based indexing engine giving
a ranked output of queried documents - Query independent
- Naive Bayes Ranker using pre-trained
classification of relevant documents on specific
topics
19Document filtering and ranking
Output of query dependent ranking
20Document filtering and ranking
Output of query independent ranking with respect
to topic Disease
21Information retrieval and extraction modules
G U I
IR
Query expansion
PubMed search
Document filtering/ranking
Document organisation
IE
Sentence extractor
NLP tools
Case frame generator
22Sentence extractor
- Goal extract sentences with information relevant
for protein annotation - Method machine learning from corpora with
manually labeled sentences - Data representation bag-of-words approach
- Best results with Support Vector Machines
(linear/Radial Basis Function)
23Sentence extractorSample output
- set of sentences extracted from the top 5 ranked
papers - query-terms are highlighted
- sentences classified according to topics
(function, structure, disease) - sentences linked to the PubMed abstract they
originate from
24Case frame generator
A protein containing the N-terminal domain with
the first transmembrane segment of MAN1 is
retained in the inner nuclear membrane.
TARGETED_TO X MAN1 Y inner nuclear membrane
25Case frame generator
- Goal Automatic identification of selected types
of entities, relations, or events in free text - Methods
- Given a set of pre-labeled sentences, learn IE
templates with Inductive Logic Programming (ILP) - Background knowledge
- Syntactic semantic information from
shallow-parser - Ontologies providing entities in a given domain
- Text analysis tools
- Shallow Parser (MBSP) based on Machine Learning
(TiMBL) - Shallow parser adapted to biomedical field using
Genia corpus
26Case frame generatorSample output shallow parser
- The mouse lymphoma assay (MLA) utilizing the Tk
gene is widely used to identify chemical mutagens.
Cell-line
The mouse lymphoma assay
MLA
DNA part
to identify
utilizing
chemical mutagens
the TK gene
27Case frame generatorSample output
- Information extracted by the Case Frame
Generator, which applied machine learned IE rules
to output of the Shallow Parser
28Summary
- The BioMinT prototype is a working unified system
for Biological Text Mining - Information Retrieval
- query expansion
- doc filtering/ranking
- Information extraction
- Extraction of sentences on user-specified topics
- Extraction of relationships between entities
(Case frames) - Based on variety of resources/technologies/experti
ses - Biological sciences corpus annotation, database
annotation, fingerprints, ontologies, - Artificial intelligence IR, machine learning
(SVM, ILP, ), Natural Language Processing
(Shallow Parser), Case Frames, - Software development databases, web-server,
GUI,
29Future BioMinT developments
- Integration of BioMinT prototype in the future
annotation environment of Swiss-Prot PRINTS - Release Q4-2005
- Free web-based version, with restrictions on
- Simultaneous users
- Resources per user (computing storage)
- Customization services provided by PharmaDM
- Integration into researchers IT environment
(E-mail alerts ) - Mining in-house document collections
- Combination with DMax data analysis software
- Incorporation of highly specialized background
knowledge (ontologies, thesauri, biological
dictionaries, etc) - Custom reports and GUI, etc
30WWW
- BioMinT home page http//www.biomint.org
- GPSDB synonyms database http//biomint.oefai.at
- BioMinT prototype Quick Tour
- http//biomint-server.pharmadm.com8080/xwiki/b
in/view/BioMinT/ProtopQuickTour
31Acknowledgements
Artificial Intelligence
Biological Sciences
Interested? Demo? Leave your card at POSTER 49