DIAMONDS MidTerm - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

DIAMONDS MidTerm

Description:

Sequence high quality, large amount, constant flow, consistent ... PRINTS. ProDom. Integration: Data Fusion. InterPro 11,000 entries. Based on UniProt DB ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 30

Provided by: noamk

Category:

more less

Transcript and Presenter's Notes

Title: DIAMONDS MidTerm

1
DIAMONDS Mid-Term
Michal Linial The Hebrew University of Jerusalem
Brussels 27 Sep 2006
2
Classification Knowledge integrationConcepts
Survey what we have (D.2)How to reach the
experimentalistsQuantitative aspects of
knowledge integration Advances
contribution
3
In quantitative terms
Concepts Problem definition
Sequence high quality, large amount, constant
flow, consistent Structure high quality , low
amounts, sparse, partially consistent Function -
low/ medium quality, large amounts, False
positive, not consistent
Functional homologues often share low identity
( similarity) Function is ambiguous term,
context dependent Structure 3D is a good
intermediate, but only sparse information Inferenc
e limitation lt40 sequence identity
Solution (DIAMONDS)

Knowledge integration from different sources and
different methodologies
Incorporating NON SEQUENCE based information
(P2P interaction, gene expression, pathways)
Add significant scores and evaluation for any
functional inference

4
Survey The advances of Classifications
Increasing quality, improved functional inference
, detection of hidden connections, Overcoming the
inherent noise in the data, good scheme for data
updating and adding new knowledge The challenge
Defining the Accuracy Annotation must be based
on (accurate) classification Integration Quali
ty and confidence Validation
5
Classification for Protein Familieswhy ?

For Function prediction annotation
For Evolutionary view - orthologs and paralogs
For new protein folds
For constructing the protein space scaffold
For reducing the redundancy and determining the
real building blocks
For a rich source of evolutionary relationship
(i.e., set of proteins in Yeast that
participate in cell cycle immediate association
with their homologues in human and vice versa)

6
Avoiding confusion Motif / Domain / Signature /
Profile / Family / Cluster / Seed / Clan /

These terms are used interchangeably,
They are (too) flexible

Our survey (D2.1) 29 different main protein
families systems were compared Pro and Con and
the source for inconsistency were identified
7
A reminder Protein Sequence Classifications
ProtoMap ProtoMap (http//protomap.cs.cornell.ed
u, Yona et al., 2000a) ProtoNet ProtoNet
(http//www.protonet.cs.huji.ac.il, Sasson et
al., 2002) PIR-ALN PIR-ALN (http//www-nrbf.georg
etown.edu/pirwww/dbinfo/piraln.html, Srinivasarao
et al., 1999) SYSTERS Systers
(http//systers.molgen.mpg.de, Krause et al.,
2002) ProClust (http//promoter.mi.uni-koeln.de
/proclust, Pipenbacher et al., 2002)
CluSTr CluSTr (http//www.ebi.ac.uk/clustr,
Kriventseva et al., 2001) Picasso Picasso
(http//www.ebi.ac.uk/picasso, Heger and Holm,
2001) TribeMCL TribeMCL (http//www.ebi.ac.uk/re
search/cgg/tribe, Enright et al., 2002) PIR SF,
Panther, (integrated iProClass, MetaFam)
8
Protein Sequence Domain Classifications
DOMO ADDA EVEREST InterPro CDD MetaFam Pfam
Blocks
ProSite Profile SBASE TigrFam eMotif SMART P
RINTS ProDom
9
Integration Data Fusion InterPro 11,000
entries Based on UniProt DB
10
DIAMONDS platform associated with alternative
resource of Classification

Methods that are based on
A. Sequence (motif, proteins)
B. Structure
C. Function (annotation)
D. Evolution

The Goal New Annotation, New Family, Family
connections (sub/ super) Predicting power
(given a new unknown sequence) Focusing on
Functional Map (i.e., Cell Cycle Related)
11
Challenges to be addressedGlobally and in
C-Cycle

Many families are very easy to detect
BLAST search can be used to detect many protein
families
A classical 80-20 situation 80 of families
can be identified with 20 of the effort
But Sensitivity is low, remote homologues are
lost
Validation tools are missing
Solution (DIAMONDS -HUJI)
High quality classifications and navigation tools
Annotation based integration

12
DIAMONDS platform (2) associated with
alternative resource of Classification

Sequence based (motif, proteins) (HUJI)
ProtoNet (proteins)
EVEREST (including external sources Pfam, SCOP,
CATH)
Annotation based scheme (including structure,
motif, phylogenetic, function) (HUJI)
PANDORA

13
ProtoNet 4.5 (August 2006)
Includes over 1 million proteins, UniProt based

Combining methodologies for best performance
(dev)
Condense to only 27,000 most significant
clusters Only 3 are singletons
Built in Quality tools
ProtoNet annotations for Families (ProtoName)
Visualization - unique features for
Experimentalists AN OPTION to add to the
system ANY new sequence that was experimentaly
discovered
14
For the experimentalist Visualization tools
Pfam, Prosite, SMART, PRINTS, BLOCKS,
ProDom.
15
and more
Automatically added A NAME for a cluster (when
coherent with its annotations)
16
Challenge 2 Quality of prediction

Compare with Knowledge Based DB (i.e. InterPro)
Take a supervised approach
For each family, look for the best match in the
clustering
Analyze the correspondence between the cluster
and the respective InterPro class (or any other
expert view class)

How much of the Functional knowledge is captured
by any classification system ?

We define a matching score that allows
performance comparison
Measures the correspondence between an expert
class K and a cluster C

17
InterPro
ProtoNet matching 83.5 of InterPro
matching 85 if ENZYME entries
18
Integration of Annotation
Challenge 3

High quality annotations reaching the biologists
PANDORA concept

A web-base tool aimed at biological analysis of
protein sets.
Biological information is shown through
intersection and inclusion.
Goal provide a biological roadmap of the genes
or protein set.

19
Annotations included
20
BASIC SET
InterPro
Number of proteins
Sensitivity TP/(TPFN) red FN white TP
21
Challenge 3 Analyzing test case A message to
the biologists1. Provide a set of genes /
proteins originated by any Omics technology to
PANDORA(without any pre-knowledge) OutPut
1. Functional Maps2. A ranked list of
functions scored by their significance and a
statistical P-value3. Detecting mistakes and
mis-annotations
22
Beyond the gene/protein setThe added value of
the biologist commentsInput Add your comments
and any quantitative properties
Examples Gene Expression levels
Degradation time Viability and
toxicity levels Tumorgeniety
score Quality of your RNA in the
experiment
A platform in which COMMENTS, EXPERIMENTAL
values And Personal Unformulated Observation are
incorporated and analyzed!
23
Example From the biologist notebook to PANDORA
knowledge
Protein quantitative binary
multiple binary
comments
24
Graphical results
25
Current additionAddressing the one to Many
problem
Many genes are alternatively spliced resulting in
one to many association Many proteins are
multi-domains
26
The Modular Nature of Proteins
27
False Transitivity of Local Alignment
BLAST values
Pairwise similarities better than 1e-40 EScore
If we cluster these proteins, assuming
transitivity of local alignment scores, we will
cluster K6A1_MOUSE with MPP3_HUMAN
28
On the Web
29
Evaluate any reference domain resources
30
Two that became one Examples EVEREST Detecting
new connections
PFAM (OLD) Taurine catabolism dioxygenase TauD,
TfdA family Pfam (NEW) a composed entry TauD
31
Expanding Diamonds Platform

We provide an automated framework for
identification and classification of new protein
domains. EVEREST cover almost 3 million proteins
in UniProt and all PDB
Manual inspection of families scoring low w.r.t.
Pfam suggested that many of those are valid
families.
Incorporating EVEREST families to DIAMONDS
platform
Covering the exhaustive cover of all Structural
knowledge and Domains in one resource.

32
Summary conclusions

Enhance validation essential
Enhanced visualization / interactivity
Biologist information being analyzed
The limit of automatic methods in classification
(sequence, domain, structure, function)
ProtoNet-PANDORA and EVEREST as part of the
Platform
Thanks to DIAMONDS and to
PANDORA team - Noam Kaplan
ProtoNet EVEREST team Elon Portugaly, Ori
Sasson, Menachem Fromer
Cell Cylcle Analysis set Roy Varshvsky
Evaluation DBA and Web Michael Dvorkin, Alex
Savanok