Title: DIAMONDS MidTerm
1DIAMONDS Mid-Term
Michal Linial The Hebrew University of Jerusalem
Brussels 27 Sep 2006
2Classification Knowledge integrationConcepts
Survey what we have (D.2)How to reach the
experimentalistsQuantitative aspects of
knowledge integration Advances
contribution
3In quantitative terms
Concepts Problem definition
Sequence high quality, large amount, constant
flow, consistent Structure high quality , low
amounts, sparse, partially consistent Function -
low/ medium quality, large amounts, False
positive, not consistent
Functional homologues often share low identity
( similarity) Function is ambiguous term,
context dependent Structure 3D is a good
intermediate, but only sparse information Inferenc
e limitation lt40 sequence identity
Solution (DIAMONDS)
- Knowledge integration from different sources and
different methodologies - Incorporating NON SEQUENCE based information
(P2P interaction, gene expression, pathways) - Add significant scores and evaluation for any
functional inference -
4Survey The advances of Classifications
Increasing quality, improved functional inference
, detection of hidden connections, Overcoming the
inherent noise in the data, good scheme for data
updating and adding new knowledge The challenge
Defining the Accuracy Annotation must be based
on (accurate) classification Integration Quali
ty and confidence Validation
5Classification for Protein Familieswhy ?
- For Function prediction annotation
- For Evolutionary view - orthologs and paralogs
- For new protein folds
- For constructing the protein space scaffold
- For reducing the redundancy and determining the
real building blocks - For a rich source of evolutionary relationship
- (i.e., set of proteins in Yeast that
participate in cell cycle immediate association
with their homologues in human and vice versa)
6Avoiding confusion Motif / Domain / Signature /
Profile / Family / Cluster / Seed / Clan /
- These terms are used interchangeably,
- They are (too) flexible
Our survey (D2.1) 29 different main protein
families systems were compared Pro and Con and
the source for inconsistency were identified
7A reminder Protein Sequence Classifications
ProtoMap ProtoMap (http//protomap.cs.cornell.ed
u, Yona et al., 2000a) ProtoNet ProtoNet
(http//www.protonet.cs.huji.ac.il, Sasson et
al., 2002) PIR-ALN PIR-ALN (http//www-nrbf.georg
etown.edu/pirwww/dbinfo/piraln.html, Srinivasarao
et al., 1999) SYSTERS Systers
(http//systers.molgen.mpg.de, Krause et al.,
2002) ProClust (http//promoter.mi.uni-koeln.de
/proclust, Pipenbacher et al., 2002)
CluSTr CluSTr (http//www.ebi.ac.uk/clustr,
Kriventseva et al., 2001) Picasso Picasso
(http//www.ebi.ac.uk/picasso, Heger and Holm,
2001) TribeMCL TribeMCL (http//www.ebi.ac.uk/re
search/cgg/tribe, Enright et al., 2002) PIR SF,
Panther, (integrated iProClass, MetaFam)
8Protein Sequence Domain Classifications
DOMO ADDA EVEREST InterPro CDD MetaFam Pfam
Blocks
ProSite Profile SBASE TigrFam eMotif SMART P
RINTS ProDom
9Integration Data Fusion InterPro 11,000
entries Based on UniProt DB
10 DIAMONDS platform associated with alternative
resource of Classification
- Methods that are based on
- A. Sequence (motif, proteins)
- B. Structure
- C. Function (annotation)
- D. Evolution
The Goal New Annotation, New Family, Family
connections (sub/ super) Predicting power
(given a new unknown sequence) Focusing on
Functional Map (i.e., Cell Cycle Related)
11Challenges to be addressedGlobally and in
C-Cycle
- Many families are very easy to detect
- BLAST search can be used to detect many protein
families - A classical 80-20 situation 80 of families
can be identified with 20 of the effort - But Sensitivity is low, remote homologues are
lost - Validation tools are missing
- Solution (DIAMONDS -HUJI)
- High quality classifications and navigation tools
- Annotation based integration
12 DIAMONDS platform (2) associated with
alternative resource of Classification
- Sequence based (motif, proteins) (HUJI)
- ProtoNet (proteins)
- EVEREST (including external sources Pfam, SCOP,
CATH) - Annotation based scheme (including structure,
motif, phylogenetic, function) (HUJI) - PANDORA
-
13ProtoNet 4.5 (August 2006)
Includes over 1 million proteins, UniProt based
Combining methodologies for best performance
(dev)
Condense to only 27,000 most significant
clusters Only 3 are singletons
Built in Quality tools
ProtoNet annotations for Families (ProtoName)
Visualization - unique features for
Experimentalists AN OPTION to add to the
system ANY new sequence that was experimentaly
discovered
14For the experimentalist Visualization tools
Pfam, Prosite, SMART, PRINTS, BLOCKS,
ProDom.
15and more
Automatically added A NAME for a cluster (when
coherent with its annotations)
16Challenge 2 Quality of prediction
- Compare with Knowledge Based DB (i.e. InterPro)
- Take a supervised approach
- For each family, look for the best match in the
clustering - Analyze the correspondence between the cluster
and the respective InterPro class (or any other
expert view class)
How much of the Functional knowledge is captured
by any classification system ?
- We define a matching score that allows
performance comparison - Measures the correspondence between an expert
class K and a cluster C
17InterPro
ProtoNet matching 83.5 of InterPro
matching 85 if ENZYME entries
18Integration of Annotation
Challenge 3
- High quality annotations reaching the biologists
- PANDORA concept
- A web-base tool aimed at biological analysis of
protein sets. - Biological information is shown through
intersection and inclusion. - Goal provide a biological roadmap of the genes
or protein set.
19Annotations included
20BASIC SET
InterPro
Number of proteins
Sensitivity TP/(TPFN) red FN white TP
21Challenge 3 Analyzing test case A message to
the biologists1. Provide a set of genes /
proteins originated by any Omics technology to
PANDORA(without any pre-knowledge) OutPut
1. Functional Maps2. A ranked list of
functions scored by their significance and a
statistical P-value3. Detecting mistakes and
mis-annotations
22Beyond the gene/protein setThe added value of
the biologist commentsInput Add your comments
and any quantitative properties
Examples Gene Expression levels
Degradation time Viability and
toxicity levels Tumorgeniety
score Quality of your RNA in the
experiment
A platform in which COMMENTS, EXPERIMENTAL
values And Personal Unformulated Observation are
incorporated and analyzed!
23Example From the biologist notebook to PANDORA
knowledge
Protein quantitative binary
multiple binary
comments
24Graphical results
25Current additionAddressing the one to Many
problem
Many genes are alternatively spliced resulting in
one to many association Many proteins are
multi-domains
26The Modular Nature of Proteins
27False Transitivity of Local Alignment
BLAST values
Pairwise similarities better than 1e-40 EScore
If we cluster these proteins, assuming
transitivity of local alignment scores, we will
cluster K6A1_MOUSE with MPP3_HUMAN
28On the Web
29Evaluate any reference domain resources
30Two that became one Examples EVEREST Detecting
new connections
PFAM (OLD) Taurine catabolism dioxygenase TauD,
TfdA family Pfam (NEW) a composed entry TauD
31Expanding Diamonds Platform
- We provide an automated framework for
identification and classification of new protein
domains. EVEREST cover almost 3 million proteins
in UniProt and all PDB - Manual inspection of families scoring low w.r.t.
Pfam suggested that many of those are valid
families. - Incorporating EVEREST families to DIAMONDS
platform - Covering the exhaustive cover of all Structural
knowledge and Domains in one resource.
32Summary conclusions
- Enhance validation essential
- Enhanced visualization / interactivity
- Biologist information being analyzed
- The limit of automatic methods in classification
(sequence, domain, structure, function) - ProtoNet-PANDORA and EVEREST as part of the
Platform - Thanks to DIAMONDS and to
- PANDORA team - Noam Kaplan
- ProtoNet EVEREST team Elon Portugaly, Ori
Sasson, Menachem Fromer - Cell Cylcle Analysis set Roy Varshvsky
- Evaluation DBA and Web Michael Dvorkin, Alex
Savanok