Title: ProClust:
1ProClust
- Improved clustering of protein sequences with an
extended graph-based approach
Ying Jin, Jonathan Michael Nowacki Nov. 21, 2003
2What in this presentation
- Papers
- SCOP a Structural Classification of Proteins
- database
- link
- Clustering protein sequences structure
prediction by transitive homology - link
- Improved clustering of protein sequences with an
extended graph-based approach - link
3Part I
- SCOP a Structural Classification of Proteins
database
4The main idea
- A database that provides a detailed and
comprehensive description of all known protein
structures
5The Novel
- The distinction between evolutionary
relationships and those that arise from the
physics and chemistry of proteins - The classification of proteins in SCOP has been
constructed by visual inspection and comparison
of structures. Believed better than purely
automatic methods.
6The organizational Basics
- By three traits
- Family near evolutionary relationships
- Based on one of two criteria that imply having
common evolutionary origin significant sequence
similarity, and functional/structural similarity. - Super Family far evolutionary relationships
- Low sequence identity, but whose structures and
in many cases, functional features suggest that a
common evolutionary origin is probable, i.e.
variable and constant domains of immunoglobulins. - Fold geometrical relationships
- If proteins have the same major secondary
structures in the same arrangement and with the
same topological connections. - Others classes. Domain, PDB, literature reference
7More on folds
- All-alpha essentially all alpha
- All-beta essentially all beta
- Alpha/beta mix of alpha and beta
- Alpha beta helices and strands are segregated
- Multi-domain no known homologues
8PDB at a Glance
- The PDB structure entries, consisting of a
collection of files having nondescript names,
cannot be easily grasped in a biochemically
meaningful context. Manually organizing the
structures based on the descriptive information
in the files is becoming less and less practical
as the database expands. A chemically or
biologically meaningful context can be provided
by the user in the form of a search keyword (e.g.
hemoglobin), but the range of available contexts
cannot be predetermined from the database
itself--users must know, in general, what they
are looking for. Although searching is an
extremely useful approach for locating specific
PDB entries, the scope of the database is best
ascertained by browsing a set of predetermined
contexts. Useful contexts include molecular
classes (e.g. "cytochrome"), secondary/tertiary
structural classes (e.g. "globin fold")
functional classes (e.g. "binding protein"),
species of origin, and experimental determination
method. The descriptive information in the PDB
files is distributed between a set of fields
(e.g. "HEADER").
9Other advantages of PDB
- PDB entry viewer links PDB entries to various
graphical view, external databases and SCOP
itself. - Links to
- images of structure
- Interactive molecular views
- Atomic co-ordinates
- Data on functional conformational changes
- Sequence data
- Homologues
- MEDLINE abstracts
10Access Methods
- Main url
- http//scop.mrc-lmb.cam.ac.uk/scop/index.html
- Numerous mirrors
- Europe
- East Coast USA
- Japan
- Isreal
- Taiwan
- China
- Australia
11The Root Down Method
12Example Pic
13Chime
14Search Engine
153d Search
16In Conclusion
- SCOP is an easy way to access data and images.
- SCOP has a powerful generic purpose interface to
the PDB - Excellent overview of the diversity of protein
structures which can aid researchers and students
alike.
17Part II
- Clustering protein sequences structure
prediction by transitive homology
18Main Idea of the Paper
- A graph-based clustering approach using
transitivity handling multi-domain proteins and
cluster comparison algorithms. - - determined all pair-wise similarities for
the sequences in the SwissProt database using the
Smith-Waterman local alignment algorithm - - transformed the data into a directed graph
- vertices protein sequence
- directed edges sequence A to B if
- score(A, B)/ score(A, A) gt T
- - the clustering process using transitivity
- SCOP was used as an evaluation data set
19Motivation
- Finding the three-dimensional structure of
proteins is one of the fundamental problems in
molecular biology. - X-ray diffraction analysis cant keep up with the
ever-increasing speed at which proteins are
sequenced. - Desirable method predict structure from the
sequence data. The main idea - The sequence similarity gt homology
- gt similar structure
- gt function virtue
- (Note same structure or function does not imply
a common ancestor)
20Motivation (cont.)
- The relation of sequence similarity obtained by
pair-wise alignment. - Rule-of-thumb is that 30 identity over aligned
regions (T) - A widely accepted approach
- Score(A, B) gt T, implies structural similarity of
sequence A and B - This is a sufficient, but not a necessary
condition - Example
21- Histogram of pair-wise alignment scores for
all pairs from the same super-family in the SCOP1
data set
22- Detecting those distant homologues, bringing
light into the so-called twilight zone of low
similarity. - ? What other criteria can be used to identify
remote homologues -
-
23Graph-based Approach
- A graph-based clustering approach using the
transitivity concept. -
- Transitivity
- In mathematics if AB and BC then AC
- In biology for given three sequences A, B and C,
if A and B as well as B and C have a common
ancestor, then A and C have a common ancestor -
24Use of Transitivity
- The concept of transitivity can be used to detect
remote homologues.
However - It is not fully understood
if transitivity always holds and whether
transitivity can be extended ad infinitum. -
Multi-Domain Problem
25Multi-Domain Problem
If use an undirected graph, then solid black
edges provide a path from 1-4. In the directed
case, the grey edges avoid this possible problem.
26Algorithm (1)
- Computing pair-wise similarities
- A complete undirected graph G
- Given edge between sequence P and Q,
- the weight of the edge raw(P, Q)
- raw(P, Q) is the raw Smith-Waterman local
alignment score - As mentioned above, there is the multi-domain
problem with this approach the unwanted
bridges connecting clearly unrelated proteins
27Algorithm (2)
- Directing the edges
- Aim to solve the multi-domain problem
- there has to be a difference in length between
sequences if multi-domain proteins cause a
problem.
G
Gd
Note Raw self similarity score raw(P, P) is
approximately proportional to the length of P
28Algorithm (3)
- Clustering in a threshold graph
- Remove all the edges from Gd if w(P, Q) lt T,
resulting graph Gd(t) - Using SCCs as clusters
- Definition 1 of SCC In a directed graph G, a
Strongly Connected Component (SCC) is a maximal
set C of nodes of G, such that for every pair of
nodes p and q in C there is one directed path in
G from p and one from q to p. - Complexity O(n e), while n is the number of
nodes and e the number of edges
29An example of a SCC in SwissProt
- The grey nodes are not part of the SCC, but are
clearly related. - No edge present between nodes P03480 and P03475.
The transitivity applied. - Threshold 32
30Implementation and Evaluation
- The algorithm implemented in C
- Own implementation of the Simith-Waterman local
alignment algorithm for computing sequence
similarity. - The substitution matrix BLOSUM80
- Gap opening (gop) 90
- Gap extension penalties (gep) 9
31Data (1)
- SwissProt (SP) excluded all sequences with less
than 40 amino acids (a.a.), resulting in a set of
86494 protein sequences - The total running time for the pair-wise
Smith-Waterman alignment was on the order of
14000 cpu-days - The evaluation data SCOP database
- Three levels are used family, super-family and
fold -
32Data (2)
- SCOP1 set of 2692 sequences
- Contains all non-identical sequences from SCOP
- No sequences shorter than 40 a.a.
- No sequences from classes 8, 9, 10
- 65464 pairs of homologue sequences i.e. pairs
where both sequences are in the same super-family
and 3556622 pairs where the sequences are in
distinct super-families. - SCOP1 SP 85961 sequences
- All sequences are from SCOP1 and SwissProt
- SCOP2 609 randomly chosen sequences from SCOP
- Including sequences shorter than 40 a.a.
- no sequences from classes 8, 9, 10.
33Performance measure
- Sensitivity specifies the proportion of
identified homologue pairs
- Specificity the proportion of errors among the
pairs predicted to be homologues
Sens spec 1 means the most highly desired
performance
34Discussion (1)
Threshold 32 sens 55.6 spec 100 TP
due to intermediate linking 8 noise floor
lifting off at threshold 23
- Sensitivity, specificity and the percentage of
indirectly linked true positives versus
clustering threshold for the SCOP1 data set
35Discussion (2)
Threshold 32 sens 57.9 spec
99.8 Indirect TP 11.6 absolute increase in
sens 2.3 relative increase in sens
4.1 absolute increase in indirect TP 3.6
The noise floor is higher
- Sensitivity, specificity and the percentage of
indirectly linked true positives versus
clustering threshold for the SCOP1 SP data set
36Discussion (3)
SCOP1
SCOP1 SP
of SCOP super-families
- Total number of SCC clusters, and SCC clusters
of sizes 1, 2-5, 6-10 etc. for varying thresholds
from 25 to 50.
37Discussion (4)
Comparison with algorithm by Arvestad employs
only pair-wise sequence comparisons, their
approach uses a more involved scoring method,
optimized substitution matrices, and gap
penalties, to achieve a substantial improvement
over straight-forward pair-wise sequence
comparisons. 24 better sensitivity at
virtually equal specificity.
- Sensitivity versus specificity for the SCOP2
data set on the fold, super-family and family
level
38Links
- http//promoter.mi.uni-koeln.de/proclust/
39Part III
- Improved clustering of protein sequences with an
extended graph-based approach
40The goal
- To detect structural homology through sequence
similarity, by increasing sensitivity through
transitive homology heuristics.
41Some Alternatives
- Altenatve approaches using the concept of
transitivity for large scale analysis of protein
sequences - Iterated BLAST or FASTA search for computing
clusters, which are subsequently merged and
processed further. Does not explicitly deal with
multi-domain problems. - Protomap Graph based approach, uses a combination
of BLAST, FASTA and Smith-Waterman E-Values to
create a hierarchy of clusters. Has problems
with multi-domain proteins which cause cluster
splitting. - All against All BLAST search and ignore all hits
below a specified threshold yielding a (0,1)
similarity matrix. Extensive post processing is
required to symmetrize the matrix and to deal
with multi-domain proteins. - Build clusters of orthologous groups (COGs)
starting with proteins from seven different
species. Tries to compensate for multi-domain
proteins with an iterative merging process.
42A new solution
- Extended graph-based approach is designed to
provide clustering as an aid in finding remote
homologues the multi-domain problem is directly
addressed, although is not fully solved.
Sensitivity is increased without a significant
loss of specificity.
43Different Symmetries
- Symmetric similary
- Does not distinguish between two proteins being
globally similar and one protein being similar to
an individual domain of a multi-domain protein.
Can lead to incorrect links - Asymmetric similarity
- Can be employed to distinguish between global and
non-global similarity.
44Limiting factors
- Large random similarities can cause
super-clusters, which will connect large parts of
the sequence space. This can be countered by
using more stringent criteria. - Multi-domain proteins
- Domains are the compact semi-independent
structural units of proteins, which often appear
highly conserved in a number of multi-domain
proteins.
45An example run
- The dataset
- SCOP v1.53
- All sequences with less than 40 amino acids were
removed - Filtered for low complexity regions using seg
with the parameters of 12, 1.8, 2.0 x - Sequences containing masked amino acids as well
as duplicate sequences were removed. - SPROT
- Release 39
- Processed analogously to SCOP
46The Filtering by Significance
- Extremal value distribution
- The maximum scores of a large number of
alignments between random sequences of equal
length tends to have an extreme value
distribution. Used to estimate maximal scores
observable with the Smith-Waterman alogrithm for
random sequences of given lengths. - Pruning consists of removing edges (P,Q) from
graph if the significance of the score w(P,Q) was
below the chosen significance threshold.
47The Algorithm
- Compute a complete undirected graph
- Replace each undirected edge with two directed
edges
- Proceed to threshold graph by removing all edges
of weight less than the threshold - Compute all strongly connected components SCCs
48Post Processing Merging Clusters
- Clusters with at least 20 sequences were selected
- Multiple alignment was built for each set of
sequences with the ClustalW. - Profiles were built and calibrated with the HMMER
package using default parameters. - For each such cluster profile all sequences not
contained in the cluster were scored using the
profile and the E-value was recorded. - If a profile of one cluster resulted in an
E-value below threshold against another cluster,
those clusters are merged.
49Complexity
- Using C software on a Comaq ES40 running Tru64
Unix V5.1 - Smith-Waterman computations needed 70 CPU days.
- Clustering needed 30 seconds.
- Cluster merging using HMMs needed 21 CPU days.
50Psi-Blast Flexibility
51Multi-Domain Problem
52Path Length FP. vs. TP.
53Abundance of multi-domain proteins
54Multi-domain Problem
Present at 13.1 threshold but disappears if
threshold is raised above 15.4 since d1m1da1
vanishes
55Larger Multi-Domain Problem
Threshold of 21.3 and no edges between P12715
and P33497
56More Laddering of Proteins
- False positives caused by just the right increase
in length of proteins. None of the edges are
removed when going over to threshold graph.
57Extended Graph vs. PSI-BLAST
58Conlcusion
- Sensitivity 63.5 _at_ 99.0 specificity
- Improvement of 34 upon PSI-Blasts performance
of 47.5 sensitivity and 99.0 specificity. - Performance is gained at the expense of a much
larger computational effort. - Performance can be further improved by taking
length and position of conserved regions into
account.