Proteomics: Analyzing proteins space - PowerPoint PPT Presentation

About This Presentation
Title:

Proteomics: Analyzing proteins space

Description:

Proteomics: Analyzing proteins space – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 26
Provided by: 6801
Category:

less

Transcript and Presenter's Notes

Title: Proteomics: Analyzing proteins space


1
Proteomics Analyzing proteins space
2
Protein families
  • Why proteins?
  • Shift of interest from Genomics to Proteomics
  • Classification of proteins to groups/families -
    what is it good for?
  • Explosion in biological sequence data gt need to
    organize!
  • Understanding relations/hierarchy of groups is
    interesting as is, e.g. in evolutionary research.
  • For applied research
  • Annotation of new proteins predicting their
    function, structure, cellular localization etc.
  • Looking for new folds

3
Sequence-based classification
  • By sequence similarity (domains, motifs or
    complete proteins) Pfam, PROSITE, SMART,
    InterPro etc.
  • InterPro Synthesizes the data from Pfam,
    PROSITE, Prints, ProDom, and SMART. Considered as
    best domain-based classification available

4
Other kinds of classification
  • Global classification
  • Systers, Protomap, CLUSTr
  • MetaFam synthesizes global classification data
  • By structure similarity SCOP etc.
  • By function Albumin, RetNet, TumorGenes etc.

5
http//www.protonet.cs.huji.ac.il
  • A long-term project in HUJI led by Michal Nati
    Linial.
  • Provides automatic global classification of the
    known proteins.
  • Performs hierarchical clustering on
    sequence-based metric space of proteins.
  • Allows to place an external protein into the
    hierarchy.

6
Why clustering?
  • We want to refine the similarity notion,
    compared to e.g. BLAST
  • Exploit transitivity to improve grouping
  • Can use a low threshold on similarity
  • - uses vast information from low similarities
  • - allowable because clustering filters noise

7
Why hierarchical?
Vertical Perspective
Horizontal Perspective
8
ProtoNet Pre-Computation
  • All-against-all gapped BLAST using BLOSUM62
  • SwissProt release 40.28 database (114,033
    proteins)
  • BLAST identified 2107 relations between these
    proteins with relatively high sequence
    similarity E-Score of 100 or less
  • Dont want to lose information gt very
    permissive!
  • But still less then 6.5109 gt infeasible

9
Clustering Method
  • First, each cluster is considered a singleton

10
Clustering Method
  • Next, we iteratively merge the pairs of clusters
  • We choose to merge the most similar pair of
    clusters.

11
Clustering Method
  • Next, we iteratively merge the pairs of clusters
  • We choose to merge the most similar pair of
    clusters.

12
Clustering Method
  • Next, we iteratively merge the pairs of clusters
  • We choose to merge the most similar pair of
    clusters.

13
Clustering Method
  • As we progress the number of singletons drops

14
Clustering Method
  • The clustering process gradually generates a tree
    of clusters
  • Stop whenever we like

15
How to merge?
  • The potential merging score is calculated for
    each pair of clusters relevant for merging at
    each level
  • At the bottom equals
  • Higher, designed to reflect the similarity of
    clusters.
  • Depends on the inter-cluster similarities of
    pairs of proteins, each from a different cluster.

16
Potential Merging Score of
  • Arithmetic Mean
  • VI
  • Geometric Mean
  • VI
  • Harmonic Mean

17
Missing Data Treatment
  • For very low similarity pair (outside of 2107
    ), its length is defined as
  • Practically, the merging process should finish,
    when the weight of the infinite lengths in
    calculation of the score between new clusters is
    very large (losing signal)

18
Results ProtoNet top 20
  • Why clustering at all?
  • We want to extend the range of similarity,
    compared to e.g. BLAST
  • Exploit transitivity to improve grouping
  • Can use a low threshold on similarity
  • - uses vast information from low similarities
  • - allowable because clustering filters noise

20 largest clusters in the ProtoNet (Arithmetic)
tree at a preselected level
19
Problem of result assessment what is a good
cluster?
  • Contains all proteins in the family, does not
    contain proteins not in family
  • But what is family? Does any keyword define a
    family?
  • Stable as the merging events occur (long
    life-time)?

20
Problem of result assessment what is a good
tree?
  • Should we trust the resulting forest?
  • Which clustering technique is better? Combined?
  • Bootstrap?
  • Do the clusters correspond to meaningful families
    of proteins?
  • Validation against InterPro, SCOP etc.
  • Lack of will to automatically reconstruct them!!!
  • What is the right level/cut to look at the
    forest?

21
Interpro Validation
  • Interpro annotation allows systematic validation
    of the generated clustering
  • The geometric method exhibits high cluster
    purity
  • Corresponds to low FP

22
The Domain Problem
  • Many proteins are composed of several domains
  • The sequence similarity tools used are therefore
    local in nature
  • The score of comparing two sequences is the edit
    distance of the most similar subsequences of them
  • This creates a false similarity problem

23
The Modular Nature of Proteins
24
False Transitivity of Local Alignment
We ran BLASTusing default parameters
All these pairwise similarities havebetter than
1e-40 EScore
If we cluster these proteins, assuming
transitivity of local alignment scores, we will
cluster K6A1_MOUSE with MPP3_HUMAN
25
Alternative methods
  • Different types of clustering
  • Non-binary
  • Goal-oriented gt semi-guided
  • Graph theory insights
  • Non-clustering ways of exploring the space of
    proteins
  • Why BLAST E-score???
  • Enrichment of the metric using structure
Write a Comment
User Comments (0)
About PowerShow.com