The Research Assistant for Biological Text Mining

About This Presentation

Title:

The Research Assistant for Biological Text Mining

Description:

Use of database annotations for text mining ... TWIST. H-twist. TWIST1. FACL3 BioMinT - 2005 Knowledge For Growth 3 June 2005. GPSDB screenshot ... – PowerPoint PPT presentation

Number of Views:102

Avg rating:3.0/5.0

Slides: 32

Provided by: lievede

Category:

more less

Transcript and Presenter's Notes

Title: The Research Assistant for Biological Text Mining

1
The Research Assistant for Biological Text Mining

Luc Dehaspe
Other Members of the BioMinT Consortium

2
Text Mining in the biological domain

Emerging field of research and development
40 articles in Bioinformatics 2004
Dedicated workshops, competitions and interest
groups
Information retrieval and extraction to deal with
information overflow
12 million citations in Medline from 4600
journals
Many more resources on the web
Essential link in the semantic integration of the
numerous biological resources.

3
Use of text mining for database annotation

curated protein sequence database
high level of annotation of proteins
high level of integration with other databases

Swiss-Prot Entry Creation Flowchart
4
Use of database annotations for text mining

Tools for information retrieval, filtering,
classification, extraction rely on
Corpora of examples used by machine learning
methods
Linguistic analysis and controlled vocabularies,
(ontologies, thesauri, biological dictionaries).
Databases provide semi-structured information
that could be used
for corpus elaboration
as specific vocabulary resources

3 year FP5 European Project, started in January
2003
Official web site www.biomint.org
Interdisciplinary consortium

6
The goals of BioMinT

To develop a generic text mining tool that
interprets different types of queries
retrieves relevant documents from the biological
literature
extracts the required information
outputs the result as a database slot filler or
as a structured report
The tool thus provides two essential research
support services
Curator's Assistant accelerate, by partially
automating, the annotation and update of
databases
Researcher's Assistant generate readable reports
in response to queries from biological
researchers.

7
Curators Assistant forSwiss-Prot Annotation
8
Curators Assistant for PRINTS annotation

PRINTS deals with groups of proteins
Annotation of 3 types of protein fingerprints

Extracted Information
9
The Biological Research Assistant

Overlap with Curators Assistant
All biologists occasionally in the curators seat
Keep ahead of Swiss-Prot in research area of
interest
Include private (confidential) document
collections

10
Information retrieval and extraction modules
11
Information retrieval and extraction modules
G U I
IR
Query expansion
PubMed search
Document filtering/ranking
Document organisation
IE
Sentence extractor
NLP tools
Case frame generator
12
Information Retrieval

A meta-query engine built round PubMed
Expansion of the initial query with synonyms
using a gene/protein synonym database (GPSDB)
the goal being to retrieve an exhaustive set of
documents containing information on a protein.
Filtration and ranking of the retrieved documents
Pre-classification according to information
topics.

13
GPSDB

Database for synonym expansion of gene and
protein names
Populated by the main resources on model
organisms
Contains 559294 synonyms referring to 292472
proteins

14
GPSDB

Cross-reference links are used to connect
database entries that refer to a same
gene/protein entity, thus pointing out the
problem of homonymy when it occurs

15
GPSDB screenshot
lap2 is a synonym of three separate protein
entities
Erbin
HSP 86
Thymopoietin
16
GPSDB screenshot
17
GPSDB used for query expansion
lap2
Original user query
Query expansion based on GPSDB
18
Document filtering and ranking

Interactive modules which permit a flexible
selection of relevant documents for the IE
process.
Algorithmic approaches
Query dependent
Lucene Ranker java-based indexing engine giving
a ranked output of queried documents
Query independent
Naive Bayes Ranker using pre-trained
classification of relevant documents on specific
topics

19
Document filtering and ranking
Output of query dependent ranking
20
Document filtering and ranking
Output of query independent ranking with respect
to topic Disease
21
Information retrieval and extraction modules
G U I
IR
Query expansion
PubMed search
Document filtering/ranking
Document organisation
IE
Sentence extractor
NLP tools
Case frame generator
22
Sentence extractor

Goal extract sentences with information relevant
for protein annotation
Method machine learning from corpora with
manually labeled sentences
Data representation bag-of-words approach
Best results with Support Vector Machines
(linear/Radial Basis Function)

23
Sentence extractorSample output

set of sentences extracted from the top 5 ranked
papers
query-terms are highlighted
sentences classified according to topics
(function, structure, disease)
sentences linked to the PubMed abstract they
originate from

24
Case frame generator
A protein containing the N-terminal domain with
the first transmembrane segment of MAN1 is
retained in the inner nuclear membrane.
TARGETED_TO X MAN1 Y inner nuclear membrane
25
Case frame generator

Goal Automatic identification of selected types
of entities, relations, or events in free text
Methods
Given a set of pre-labeled sentences, learn IE
templates with Inductive Logic Programming (ILP)
Background knowledge
Syntactic semantic information from
shallow-parser
Ontologies providing entities in a given domain
Text analysis tools
Shallow Parser (MBSP) based on Machine Learning
(TiMBL)
Shallow parser adapted to biomedical field using
Genia corpus

26
Case frame generatorSample output shallow parser

The mouse lymphoma assay (MLA) utilizing the Tk
gene is widely used to identify chemical mutagens.

Cell-line
The mouse lymphoma assay
MLA
DNA part
to identify
utilizing
chemical mutagens
the TK gene
27
Case frame generatorSample output

Information extracted by the Case Frame
Generator, which applied machine learned IE rules
to output of the Shallow Parser

28
Summary

The BioMinT prototype is a working unified system
for Biological Text Mining
Information Retrieval
query expansion
doc filtering/ranking
Information extraction
Extraction of sentences on user-specified topics
Extraction of relationships between entities
(Case frames)
Based on variety of resources/technologies/experti
ses
Biological sciences corpus annotation, database
annotation, fingerprints, ontologies,
Artificial intelligence IR, machine learning
(SVM, ILP, ), Natural Language Processing
(Shallow Parser), Case Frames,
Software development databases, web-server,
GUI,

29
Future BioMinT developments

Integration of BioMinT prototype in the future
annotation environment of Swiss-Prot PRINTS
Release Q4-2005
Free web-based version, with restrictions on
Simultaneous users
Resources per user (computing storage)
Customization services provided by PharmaDM
Integration into researchers IT environment
(E-mail alerts )
Mining in-house document collections
Combination with DMax data analysis software
Incorporation of highly specialized background
knowledge (ontologies, thesauri, biological
dictionaries, etc)
Custom reports and GUI, etc

30
WWW

BioMinT home page http//www.biomint.org
GPSDB synonyms database http//biomint.oefai.at
BioMinT prototype Quick Tour
http//biomint-server.pharmadm.com8080/xwiki/b
in/view/BioMinT/ProtopQuickTour

31
Acknowledgements
Artificial Intelligence
Biological Sciences
Interested? Demo? Leave your card at POSTER 49

Write a Comment

User Comments (0)

About PowerShow.com

The Research Assistant for Biological Text Mining - PowerPoint PPT Presentation

The Research Assistant for Biological Text Mining

Use of database annotations for text mining ... TWIST. H-twist. TWIST1. FACL3 BioMinT - 2005 Knowledge For Growth 3 June 2005. GPSDB screenshot ... – PowerPoint PPT presentation