ProClust:

About This Presentation

Title:

ProClust:

Description:

ProClust: Improved clustering of protein sequences with an ... The Novel ... homologues, bringing light into the so-called twilight zone of low similarity. ... – PowerPoint PPT presentation

Number of Views:23

Avg rating:3.0/5.0

Slides: 59

Provided by: yin4

Category:

Tags: proclust

more less

Transcript and Presenter's Notes

Title: ProClust:

1
ProClust

Improved clustering of protein sequences with an
extended graph-based approach

Ying Jin, Jonathan Michael Nowacki Nov. 21, 2003
2
What in this presentation

Papers
SCOP a Structural Classification of Proteins
database
link
Clustering protein sequences structure
prediction by transitive homology
link
Improved clustering of protein sequences with an
extended graph-based approach
link

3
Part I

SCOP a Structural Classification of Proteins
database

4
The main idea

A database that provides a detailed and
comprehensive description of all known protein
structures

5
The Novel

The distinction between evolutionary
relationships and those that arise from the
physics and chemistry of proteins
The classification of proteins in SCOP has been
constructed by visual inspection and comparison
of structures. Believed better than purely
automatic methods.

6
The organizational Basics

By three traits
Family near evolutionary relationships
Based on one of two criteria that imply having
common evolutionary origin significant sequence
similarity, and functional/structural similarity.
Super Family far evolutionary relationships
Low sequence identity, but whose structures and
in many cases, functional features suggest that a
common evolutionary origin is probable, i.e.
variable and constant domains of immunoglobulins.
Fold geometrical relationships
If proteins have the same major secondary
structures in the same arrangement and with the
same topological connections.
Others classes. Domain, PDB, literature reference

7
More on folds

All-alpha essentially all alpha
All-beta essentially all beta
Alpha/beta mix of alpha and beta
Alpha beta helices and strands are segregated
Multi-domain no known homologues

8
PDB at a Glance

The PDB structure entries, consisting of a
collection of files having nondescript names,
cannot be easily grasped in a biochemically
meaningful context. Manually organizing the
structures based on the descriptive information
in the files is becoming less and less practical
as the database expands. A chemically or
biologically meaningful context can be provided
by the user in the form of a search keyword (e.g.
hemoglobin), but the range of available contexts
cannot be predetermined from the database
itself--users must know, in general, what they
are looking for. Although searching is an
extremely useful approach for locating specific
PDB entries, the scope of the database is best
ascertained by browsing a set of predetermined
contexts. Useful contexts include molecular
classes (e.g. "cytochrome"), secondary/tertiary
structural classes (e.g. "globin fold")
functional classes (e.g. "binding protein"),
species of origin, and experimental determination
method. The descriptive information in the PDB
files is distributed between a set of fields
(e.g. "HEADER").

9
Other advantages of PDB

PDB entry viewer links PDB entries to various
graphical view, external databases and SCOP
itself.
Links to
images of structure
Interactive molecular views
Atomic co-ordinates
Data on functional conformational changes
Sequence data
Homologues
MEDLINE abstracts

10
Access Methods

Main url
http//scop.mrc-lmb.cam.ac.uk/scop/index.html
Numerous mirrors
Europe
East Coast USA
Japan
Isreal
Taiwan
China
Australia

11
The Root Down Method
12
Example Pic
13
Chime
14
Search Engine
15
3d Search
16
In Conclusion

SCOP is an easy way to access data and images.
SCOP has a powerful generic purpose interface to
the PDB
Excellent overview of the diversity of protein
structures which can aid researchers and students
alike.

17
Part II

Clustering protein sequences structure
prediction by transitive homology

18
Main Idea of the Paper

A graph-based clustering approach using
transitivity handling multi-domain proteins and
cluster comparison algorithms.
- determined all pair-wise similarities for
the sequences in the SwissProt database using the
Smith-Waterman local alignment algorithm
- transformed the data into a directed graph
vertices protein sequence
directed edges sequence A to B if
score(A, B)/ score(A, A) gt T
- the clustering process using transitivity
SCOP was used as an evaluation data set

19
Motivation

Finding the three-dimensional structure of
proteins is one of the fundamental problems in
molecular biology.
X-ray diffraction analysis cant keep up with the
ever-increasing speed at which proteins are
sequenced.
Desirable method predict structure from the
sequence data. The main idea
The sequence similarity gt homology
gt similar structure
gt function virtue
(Note same structure or function does not imply
a common ancestor)

20
Motivation (cont.)

The relation of sequence similarity obtained by
pair-wise alignment.
Rule-of-thumb is that 30 identity over aligned
regions (T)
A widely accepted approach
Score(A, B) gt T, implies structural similarity of
sequence A and B
This is a sufficient, but not a necessary
condition
Example

Histogram of pair-wise alignment scores for
all pairs from the same super-family in the SCOP1
data set

Detecting those distant homologues, bringing
light into the so-called twilight zone of low
similarity.
? What other criteria can be used to identify
remote homologues

23
Graph-based Approach

A graph-based clustering approach using the
transitivity concept.
Transitivity
In mathematics if AB and BC then AC
In biology for given three sequences A, B and C,
if A and B as well as B and C have a common
ancestor, then A and C have a common ancestor

24
Use of Transitivity

The concept of transitivity can be used to detect
remote homologues.

However - It is not fully understood
if transitivity always holds and whether
transitivity can be extended ad infinitum. -
Multi-Domain Problem
25
Multi-Domain Problem
If use an undirected graph, then solid black
edges provide a path from 1-4. In the directed
case, the grey edges avoid this possible problem.
26
Algorithm (1)

Computing pair-wise similarities
A complete undirected graph G
Given edge between sequence P and Q,
the weight of the edge raw(P, Q)
raw(P, Q) is the raw Smith-Waterman local
alignment score
As mentioned above, there is the multi-domain
problem with this approach the unwanted
bridges connecting clearly unrelated proteins

27
Algorithm (2)

Directing the edges
Aim to solve the multi-domain problem
there has to be a difference in length between
sequences if multi-domain proteins cause a
problem.

G
Gd
Note Raw self similarity score raw(P, P) is
approximately proportional to the length of P
28
Algorithm (3)

Clustering in a threshold graph
Remove all the edges from Gd if w(P, Q) lt T,
resulting graph Gd(t)
Using SCCs as clusters
Definition 1 of SCC In a directed graph G, a
Strongly Connected Component (SCC) is a maximal
set C of nodes of G, such that for every pair of
nodes p and q in C there is one directed path in
G from p and one from q to p.
Complexity O(n e), while n is the number of
nodes and e the number of edges

29
An example of a SCC in SwissProt

The grey nodes are not part of the SCC, but are
clearly related.
No edge present between nodes P03480 and P03475.
The transitivity applied.
Threshold 32

30
Implementation and Evaluation

The algorithm implemented in C
Own implementation of the Simith-Waterman local
alignment algorithm for computing sequence
similarity.
The substitution matrix BLOSUM80
Gap opening (gop) 90
Gap extension penalties (gep) 9

31
Data (1)

SwissProt (SP) excluded all sequences with less
than 40 amino acids (a.a.), resulting in a set of
86494 protein sequences
The total running time for the pair-wise
Smith-Waterman alignment was on the order of
14000 cpu-days
The evaluation data SCOP database
Three levels are used family, super-family and
fold

32
Data (2)

SCOP1 set of 2692 sequences
Contains all non-identical sequences from SCOP
No sequences shorter than 40 a.a.
No sequences from classes 8, 9, 10
65464 pairs of homologue sequences i.e. pairs
where both sequences are in the same super-family
and 3556622 pairs where the sequences are in
distinct super-families.
SCOP1 SP 85961 sequences
All sequences are from SCOP1 and SwissProt
SCOP2 609 randomly chosen sequences from SCOP
Including sequences shorter than 40 a.a.
no sequences from classes 8, 9, 10.

33
Performance measure

Sensitivity specifies the proportion of
identified homologue pairs

Specificity the proportion of errors among the
pairs predicted to be homologues

Sens spec 1 means the most highly desired
performance
34
Discussion (1)
Threshold 32 sens 55.6 spec 100 TP
due to intermediate linking 8 noise floor
lifting off at threshold 23

Sensitivity, specificity and the percentage of
indirectly linked true positives versus
clustering threshold for the SCOP1 data set

35
Discussion (2)
Threshold 32 sens 57.9 spec
99.8 Indirect TP 11.6 absolute increase in
sens 2.3 relative increase in sens
4.1 absolute increase in indirect TP 3.6
The noise floor is higher

Sensitivity, specificity and the percentage of
indirectly linked true positives versus
clustering threshold for the SCOP1 SP data set

36
Discussion (3)
SCOP1
SCOP1 SP
of SCOP super-families

Total number of SCC clusters, and SCC clusters
of sizes 1, 2-5, 6-10 etc. for varying thresholds
from 25 to 50.

37
Discussion (4)
Comparison with algorithm by Arvestad employs
only pair-wise sequence comparisons, their
approach uses a more involved scoring method,
optimized substitution matrices, and gap
penalties, to achieve a substantial improvement
over straight-forward pair-wise sequence
comparisons. 24 better sensitivity at
virtually equal specificity.

Sensitivity versus specificity for the SCOP2
data set on the fold, super-family and family
level

38
Links

http//promoter.mi.uni-koeln.de/proclust/

39
Part III

Improved clustering of protein sequences with an
extended graph-based approach

40
The goal

To detect structural homology through sequence
similarity, by increasing sensitivity through
transitive homology heuristics.

41
Some Alternatives

Altenatve approaches using the concept of
transitivity for large scale analysis of protein
sequences
Iterated BLAST or FASTA search for computing
clusters, which are subsequently merged and
processed further. Does not explicitly deal with
multi-domain problems.
Protomap Graph based approach, uses a combination
of BLAST, FASTA and Smith-Waterman E-Values to
create a hierarchy of clusters. Has problems
with multi-domain proteins which cause cluster
splitting.
All against All BLAST search and ignore all hits
below a specified threshold yielding a (0,1)
similarity matrix. Extensive post processing is
required to symmetrize the matrix and to deal
with multi-domain proteins.
Build clusters of orthologous groups (COGs)
starting with proteins from seven different
species. Tries to compensate for multi-domain
proteins with an iterative merging process.

42
A new solution

Extended graph-based approach is designed to
provide clustering as an aid in finding remote
homologues the multi-domain problem is directly
addressed, although is not fully solved.
Sensitivity is increased without a significant
loss of specificity.

43
Different Symmetries

Symmetric similary
Does not distinguish between two proteins being
globally similar and one protein being similar to
an individual domain of a multi-domain protein.
Can lead to incorrect links
Asymmetric similarity
Can be employed to distinguish between global and
non-global similarity.

44
Limiting factors

Large random similarities can cause
super-clusters, which will connect large parts of
the sequence space. This can be countered by
using more stringent criteria.
Multi-domain proteins
Domains are the compact semi-independent
structural units of proteins, which often appear
highly conserved in a number of multi-domain
proteins.

45
An example run

The dataset
SCOP v1.53
All sequences with less than 40 amino acids were
removed
Filtered for low complexity regions using seg
with the parameters of 12, 1.8, 2.0 x
Sequences containing masked amino acids as well
as duplicate sequences were removed.
SPROT
Release 39
Processed analogously to SCOP

46
The Filtering by Significance

Extremal value distribution
The maximum scores of a large number of
alignments between random sequences of equal
length tends to have an extreme value
distribution. Used to estimate maximal scores
observable with the Smith-Waterman alogrithm for
random sequences of given lengths.
Pruning consists of removing edges (P,Q) from
graph if the significance of the score w(P,Q) was
below the chosen significance threshold.

47
The Algorithm

Compute a complete undirected graph
Replace each undirected edge with two directed
edges

Proceed to threshold graph by removing all edges
of weight less than the threshold
Compute all strongly connected components SCCs

48
Post Processing Merging Clusters

Clusters with at least 20 sequences were selected
Multiple alignment was built for each set of
sequences with the ClustalW.
Profiles were built and calibrated with the HMMER
package using default parameters.
For each such cluster profile all sequences not
contained in the cluster were scored using the
profile and the E-value was recorded.
If a profile of one cluster resulted in an
E-value below threshold against another cluster,
those clusters are merged.

49
Complexity

Using C software on a Comaq ES40 running Tru64
Unix V5.1
Smith-Waterman computations needed 70 CPU days.
Clustering needed 30 seconds.
Cluster merging using HMMs needed 21 CPU days.

50
Psi-Blast Flexibility
51
Multi-Domain Problem
52
Path Length FP. vs. TP.
53
Abundance of multi-domain proteins
54
Multi-domain Problem
Present at 13.1 threshold but disappears if
threshold is raised above 15.4 since d1m1da1
vanishes
55
Larger Multi-Domain Problem
Threshold of 21.3 and no edges between P12715
and P33497
56
More Laddering of Proteins

False positives caused by just the right increase
in length of proteins. None of the edges are
removed when going over to threshold graph.

57
Extended Graph vs. PSI-BLAST
58
Conlcusion

Sensitivity 63.5 _at_ 99.0 specificity
Improvement of 34 upon PSI-Blasts performance
of 47.5 sensitivity and 99.0 specificity.
Performance is gained at the expense of a much
larger computational effort.
Performance can be further improved by taking
length and position of conserved regions into
account.

Write a Comment

User Comments (0)