Title: Proteomics: Analyzing proteins space
1Proteomics Analyzing proteins space
2Protein families
- Why proteins?
- Shift of interest from Genomics to Proteomics
- Classification of proteins to groups/families -
what is it good for? - Explosion in biological sequence data gt need to
organize! - Understanding relations/hierarchy of groups is
interesting as is, e.g. in evolutionary research. - For applied research
- Annotation of new proteins predicting their
function, structure, cellular localization etc. - Looking for new folds
3Sequence-based classification
- By sequence similarity (domains, motifs or
complete proteins) Pfam, PROSITE, SMART,
InterPro etc. - InterPro Synthesizes the data from Pfam,
PROSITE, Prints, ProDom, and SMART. Considered as
best domain-based classification available
4Other kinds of classification
- Global classification
- Systers, Protomap, CLUSTr
- MetaFam synthesizes global classification data
- By structure similarity SCOP etc.
- By function Albumin, RetNet, TumorGenes etc.
5http//www.protonet.cs.huji.ac.il
- A long-term project in HUJI led by Michal Nati
Linial. - Provides automatic global classification of the
known proteins. - Performs hierarchical clustering on
sequence-based metric space of proteins. - Allows to place an external protein into the
hierarchy.
6Why clustering?
- We want to refine the similarity notion,
compared to e.g. BLAST - Exploit transitivity to improve grouping
- Can use a low threshold on similarity
- - uses vast information from low similarities
- - allowable because clustering filters noise
7Why hierarchical?
Vertical Perspective
Horizontal Perspective
8ProtoNet Pre-Computation
- All-against-all gapped BLAST using BLOSUM62
- SwissProt release 40.28 database (114,033
proteins) - BLAST identified 2107 relations between these
proteins with relatively high sequence
similarity E-Score of 100 or less - Dont want to lose information gt very
permissive! - But still less then 6.5109 gt infeasible
9Clustering Method
- First, each cluster is considered a singleton
10Clustering Method
- Next, we iteratively merge the pairs of clusters
- We choose to merge the most similar pair of
clusters.
11Clustering Method
- Next, we iteratively merge the pairs of clusters
- We choose to merge the most similar pair of
clusters.
12Clustering Method
- Next, we iteratively merge the pairs of clusters
- We choose to merge the most similar pair of
clusters.
13Clustering Method
- As we progress the number of singletons drops
14Clustering Method
- The clustering process gradually generates a tree
of clusters - Stop whenever we like
15How to merge?
- The potential merging score is calculated for
each pair of clusters relevant for merging at
each level - At the bottom equals
- Higher, designed to reflect the similarity of
clusters. - Depends on the inter-cluster similarities of
pairs of proteins, each from a different cluster.
16Potential Merging Score of
- Arithmetic Mean
- VI
- Geometric Mean
- VI
- Harmonic Mean
17Missing Data Treatment
- For very low similarity pair (outside of 2107
), its length is defined as - Practically, the merging process should finish,
when the weight of the infinite lengths in
calculation of the score between new clusters is
very large (losing signal)
18Results ProtoNet top 20
- Why clustering at all?
- We want to extend the range of similarity,
compared to e.g. BLAST - Exploit transitivity to improve grouping
- Can use a low threshold on similarity
- - uses vast information from low similarities
- - allowable because clustering filters noise
20 largest clusters in the ProtoNet (Arithmetic)
tree at a preselected level
19Problem of result assessment what is a good
cluster?
- Contains all proteins in the family, does not
contain proteins not in family - But what is family? Does any keyword define a
family? - Stable as the merging events occur (long
life-time)?
20Problem of result assessment what is a good
tree?
- Should we trust the resulting forest?
- Which clustering technique is better? Combined?
- Bootstrap?
- Do the clusters correspond to meaningful families
of proteins? - Validation against InterPro, SCOP etc.
- Lack of will to automatically reconstruct them!!!
- What is the right level/cut to look at the
forest?
21Interpro Validation
- Interpro annotation allows systematic validation
of the generated clustering - The geometric method exhibits high cluster
purity - Corresponds to low FP
22The Domain Problem
- Many proteins are composed of several domains
- The sequence similarity tools used are therefore
local in nature - The score of comparing two sequences is the edit
distance of the most similar subsequences of them - This creates a false similarity problem
23The Modular Nature of Proteins
24False Transitivity of Local Alignment
We ran BLASTusing default parameters
All these pairwise similarities havebetter than
1e-40 EScore
If we cluster these proteins, assuming
transitivity of local alignment scores, we will
cluster K6A1_MOUSE with MPP3_HUMAN
25Alternative methods
- Different types of clustering
- Non-binary
- Goal-oriented gt semi-guided
- Graph theory insights
- Non-clustering ways of exploring the space of
proteins - Why BLAST E-score???
- Enrichment of the metric using structure