Title: Protein structure introduction
1Protein structure introduction Bioinformatics
genes, proteins and computers Orengo, Jones and
Thornton (2003).
2Secondary structure elements
3Tertiary structure protein fold
complete 3-dimensional structure
why is it interesting? isnt the sequence enough?
- the structure is more conserved!
- detection of distant evolutionary
- relationships
- a key to understand protein
- function
- Structure-based drug design
4Fold classification
classification clustering proteins into
structural families
motivation?
- profound analysis of evolutionary mechanisms
- constraints on secondary structure packing?
- classification at domain level
5CATH Protein Structure Classification
- hierarchical classification of protein domain
structures in the Brookhaven Protein Databank
(PDB). - domains are clustered at four major levels
- Class
- Architecture
- Topology
- Homologous superfamily
- Sequence family
6CATH hierarchical classification
- Classsecondary structure content mainly
a,mainly b,a b, low 2nd structure content. - Architecturegross orientation of secondary
structures, independent of connectivity. - Topology ( fold)clusters structures according
to their topological connections.
7CATH architectures
8CATH architectures (cont.)
9CATH hierarchical classification
- Homologous superfamily
- homologous domains identified by sequence
similarity, and structure similarity - Sequence family
- domains clustered in the same sequence
families, with sequence identitygt35
- other classification schemes SCOP, FSSP
- partial disagreement between them.
10Growing demand for protein structures!
- PDB contains 20,868
- structures
- X-Ray and NMR have
- limitations.
WE NEED FASTER METHODS!
11Protein Structure Prediction
- Limited to very short peptides!
12Can known structures assist prediction?
the number of possible folds seems to be limited!
- CATH inspection more then 36,000 domains, but...
- only 800 topology groups
13Template-based prediction (fold recognition)
II) Comparative modeling (homology modeling) -
alignment with homologous sequence of known
structure. - high sequence identity areas
similar structure - variable areas must be built
- cant be used if no sequence similarity found!
- III) Threading
- - alignment with structure sequences in fold
library - - sophisticated scoring function finds most
similar fold - - Threading aligns target sequence onto
template structure
14What are the baselines for protein fold
recognition? McGuffin, Bryson and Jones (2001)
- Goals
- what constitutes a baseline level of success for
protein - fold recognition methods, above random guesswork?
2. can simple methods that make use of 2nd
structure information assign folds more
reliably?
- how valuable might these methods be in the rapid
- construction of a useful hierarchical
classification?
15The methods evaluated (ordered by complexity and
runtime)
- shorten 2nd structure stringsCCCHHHHCCCEEECCHHHC
CC ? HCECH. - pairwise alignment
- scoring function also considers length of elements
16A representative set of protein domains
- a set of 1087 domains representing different
- Sequence Families was selected from CATH.
- generate an informative file for each domain
1. gt1atx00 2. GAAaLbKSDGPNTRGNSMSGTIWVFGcPSGWNNbE
GRAIIGYacKQ 3.  EEE TTS S TTSSEEEEEESS  TT
EEEÂ SSSSSEEEE 4. CEEEEEHHECEEEECCCECEEEECCCEECCE
ECEEECCEECEEEEC
17First evaluation true positive percentage
compare true positive percentage, at a fixed 3
false positive.
run each method on all possible pairs from the
1087 set (a,b) (a,c) (a,d) ... (g,d) (g,e)
... (k,f) ... (r,s) .... 590,000 pairs
STOP! 3 false positives reached. true positive
for this method 2
1
2
2
1
3
18We need lower,upper controls to compare with
lower control intelligent guesswork 1. randomly
assign CATH topology codes according to
frequency 2. calculate true positive, false
positive percentage
19Optimisation of similarity scoring methods
Class pre-filter
20- partial agreement between classification schemes
- FSSP compared with SCOP 61.1, FSSP compared
with CATH 46.7
- most accurate is method number 5 Alignment of
secondary structure - elements without additional scoring, with
27.18 true positive.
- accuracy ordering of methods doesnt correspond
to their relative complexity
21Second evaluation CASP-like sensitivity
similarly to CASP we measure the sensitivity of
each method what is the probability of a method
correctly assigning a fold?
lower control a random proportional fold
assignment
upper control FSSP was used as a scoring method
22Sensitivity results
- method 5 wins again 31.8 sensitivity.
- other 2nd structure based methods with small gap.
- sensitivity order of the methods true positive
percentage order.
23Similarity trees - can we construct
classification?
Best methods similarity scores for all pairs
were clustered into a tree.
- globin-like ltgt
- casein kinase
b. immunoglobulin-like ltgt thrombin
subunit H
whole tree generally disordered
24Conclusions
- Baseline level to be exceeded by fold
recognition methods - 27 true positive assignments allowing 3
false positive - sensitivity level of 32.
- methods which make use of 2nd structure
information - seem more accurate and sensitive than those who
dont.
- simple 2nd structure alignments alone can not
construct - reliable classification hierarchy.
- the agreement between FSSP, SCOP and CATH
- classification schemes is surprisingly low.