Biological Databases for Protein Sequence Analysis

About This Presentation

Title:

Biological Databases for Protein Sequence Analysis

Description:

Analyses can be pursued with decreasing certainty towards the Twilight Zone ... around this threshold is the Twilight Zone, where alignments may appear ... – PowerPoint PPT presentation

Number of Views:806

Avg rating:4.0/5.0

Slides: 121

Provided by: attw

Category:

more less

Transcript and Presenter's Notes

Title: Biological Databases for Protein Sequence Analysis

1
Biological Databasesfor Protein Sequence Analysis

Terri Attwood
School of Biological Sciences
University of Manchester, Oxford Road
Manchester M13 9PT, UK
http//www.bioinf.man.ac.uk/dbbrowser/

2
Overview

Introduction
Web practical, science fact fiction
the Twilight Zone, the Midnight Zone
Biological databases
primary, secondary pattern, composite, etc.
Pattern recognition
regular expressions, fingerprints, profiles,
etc.
Building a search protocol
combining results, estimating significance

3
The practical - BioActivity

BioActivity is intended to support the lectures
you begin with a DNA sequence fragment
try to find out what protein this codes for, the
family to which it belongs, whether its function
structure are known, etc.
The practical is entirely Web-based
largely uses local servers, but also links to
external sites
be patient mindful of traffic - don't waste
time on slow links
Most important of all
please read the instructions!
The Web is constantly evolving....
please report dead links (otherwise theyll stay
dead)!

4
(No Transcript)
5
The stuff you have to know

Single- three-letter amino acid codes
G Glycine Gly P Proline Pro
A Alanine Ala V Valine Val
L Leucine Leu I Isoleucine Ile
M Methionine Met C Cysteine Cys
F Phenylalanine Phe Y Tyrosine Tyr
W Tryptophan Trp H Histidine His
K Lysine Lys R Arginine Arg
Q Glutamine Gln N Asparagine Asn
E Glutamic Acid Glu D Aspartic Acid Asp
S Serine Ser T Threonine Thr
Additional codes
B Asn/Asp Z Gln/Glu X Any amino acid

6
(No Transcript)
7
Basic definitions

Primary structure
the linear sequence of amino acids in a protein
Secondary structure
regions of local regularity
i.e., a-helices, b-strands, -sheets -turns

8
Definitions contd.

Super-secondary structure
the packing of secondary structure elements into
stable units
e.g., b-barrels, bab units, Greek keys, etc..

9
Definitions contd.

Tertiary structure
the overall chain fold that results from packing
of secondary structure elements

10
Definitions contd.

Quaternary structure
the arrangement of separate chains within a
protein that has more than one subunit
e.g., haemoglobin

11
Definitions contd.

Quinternary structure
the arrangement of separate molecules, such as in
protein-protein or protein-nucleic acid
interactions

12
Definitions contd.

Bioinformatics
broadly, Information Technology applied to
biology
this can mean anything from AI robotics to
genome analysis!
boundaries with computational biology now
blurred
originally coined in the 80s to mean
bio-sequence analysis
with increasing availability of protein
structures, the term now also encompasses
structure analysis
but the scale of the problem here is vastly
different.....

13
Importance of sequence analysis

694,000 sequences available in public databases
millions more (including ESTs) in proprietary
databases
these s will snowball with completion of more
genomes
so what?
Locked up in sequences is a huge amount of
structural, functional evolutionary info
they're a highly valuable resource
By contrast, the of unique protein structures
is 2000
this represents a huge information deficit

14
Sequence-structure deficit

Non-redundant growth of sequences during
1988-1998 ( ) the corresponding growth in
the number of structures ( ).

15
Challenges for bioinformatics

Spurred on by the sequence/structure deficit, the
challenges are to
rationalise the mass of sequence data
derive more efficient means of data storage
design more incisive reliable analysis tools
The imperative - to convert sequence information
into biochemical biophysical knowledge
to decipher the structural, functional
evolutionary clues encoded in the language of
biological sequences

16
The Holy Grail of bioinformatics

...to be able to understand the words in a
sequence sentence that form a particular protein
structure

17
The reality of sequence analysis

...isn't so glamorous....but means we can
recognise words that form characteristic
patterns, even if we don't know the precise
syntax to build complete protein sentences

18
Pattern recognition prediction

In investigating the meaning of sequences, 2
distinct analytical approaches have emerged
pattern recognition is used to detect similarity
between sequences hence to infer related
structures functions
ab initio prediction is used to deduce structure,
to infer function, directly from sequence
These methods are different shouldnt be
confused
Sequence- structure-based pattern recognition
methods demand that some characteristic has been
seen before housed in a db
Prediction methods remove the need for template
dbs because deductions are made directly from
sequence

19
Science fact fiction

Sequence pattern recognition is easier to
achieve, is much more reliable, than fold
recognition
which is 40-50 reliable even in expert hands
Prediction is still not possible
is unlikely to be so for decades to come (if
ever)
Structural genomics will yield representative
structures for more proteins in future
structures of new sequences will be determined by
modelling
prediction will become an academic exercise
But, to debunk a popular myth, knowing structure
alone does not inherently tell us function

20
A reality check

What is the function of this structure?

What is the function of this sequence?

What is the function of this motif?
the fold provides a scaffold, which can be
decorated in different ways by different
sequences to confer different functions - knowing
the fold function allows us to rationalise how
the structure effects its function at the
molecular level

21
A test case for structural genomics
Structure-based assignment of the biochemical
function of hypothetical protein mj0577
(Zarembinski et al., PNAS 95 1998)
Although the structure co-crystallised with ATP,
the biochemical function of the protein is
unknown

22
The Twilight Zone

Prediction methods dont work because we dont
fully understand the Folding Problem
we cant read the language sequences use to
create their folds
But, with sequence analysis techniques, we can
try to find similarities between new sequences
those in dbs
whose structures functions we hope have been
elucidated
This is straightforward at high levels of
identity, but below 50 it is difficult to
establish relationships reliably
Analyses can be pursued with decreasing certainty
towards the Twilight Zone
20 identity, where results may look plausible
to the eye, but are no longer statistically
significant

23
Application areas of analysis tools

The scale indicates identity between aligned
sequences
Alignment of 2 random seqs can produce 20
identity
less than 20 does not constitute a significant
alignment
around this threshold is the Twilight Zone,
where alignments may appear plausible to the eye,
but cant be proved by conventional methods

24
Homology analogy

The term homology is confounded abused in the
literature!
sequences are homologous if theyre related by
divergence from a common ancestor
analogy relates to the acquisition of common
features from unrelated ancestors via convergent
evolution
e.g., b-barrels occur in soluble serine proteases
integral membrane porins chymotrypsin
subtilisin share groups of catalytic residues,
with near identical spatial geometries, but no
other similarities
Homology is not a measure of similarity is not
quantifiable
it is an absolute statement that sequences have a
divergent rather than a convergent relationship
the phrases "the level of homology is high" or
"the sequences show 50 homology", or any like
them, are strictly meaningless!
This is not just a semantic issue
loose use muddies thinking about evolutionary
relationships

25
A terminology muddle

In comparing 3D structures, exactly the same
arguments apply
structures may be similar, as denoted by RMS
positional deviation between compared atomic
positions
common evolutionary origin remains a hypothesis,
until supported by other evidence
homology among similar structures is a
hypothesis
This may be correct or mistaken, but their
similarity is a fact, no matter how it is
interpreted
Similarity of sequence or structure is just that
- similarity
Homology connotes a common evolutionary origin
Reeck, G.R., de Haen, C., Teller, D.C.,
Doolittle, R.F., Fitch, W.M., Dickerson, R.E.,
Chambon, P., McLachlan, A.D., Margoliash, E.,
Jukes, T.H. Zuckerkandl, E. (1987) Homology
in proteins and nucleic acids a terminology
muddle and a way out of it. Cell, 50, 667.

26
Orthology paralogy

Among homologous sequences we can distinguish
orthologues - largely perform the same function
in different species
paralogues - perform different but related
functions in one organism
Studying orthologues opens the way to molecular
palaeontology
e.g., using phylogenetic trees to show
cross-species relationships
Paralogues shed light on underlying evolutionary
mechanisms
paralogous proteins are thought to have arisen
from single genes via successive duplication
events
duplicated genes follow separate evolutionary
pathways new specificities evolve through
variation adaptation
Such complexity presents real challenges for
sequence analysis

27
Challenges for sequence analysis

Much of the challenge is in getting the biology
right
complicated by orthology vs paralogy
Following a db search, it may be unclear how much
functional annotation can be legitimately
inherited by a query
source of numerous annotation errors in dbs
propagation could lead to an error catastrophe
Further complications result from the modular
nature of proteins
modules are autonomous folding units, used as
protein building blocks - like Lego bricks, they
can confer a variety of functions on the parent
protein, either by multiple combinations of the
same module, or via different modules to form
mosaics
Automatic systems dont distinguish orthologues
from paralogues dont consider the modular
nature of proteins

28
(No Transcript)
29

Monkeys are exploited in different Goldberg
machines, where they perform different functions
- here, we couldnt predict a monkey in that
spot, even with total knowledge of the rest of
the machine
Similarity searches are just like this
identifying the presence of a module tells little
of the function of the complete system
knowing most components of a mosaic, we cant
predict a missing one
modules (monkeys) in different proteins dont
always perform exactly the same function

30
The Midnight Zone

Identifying evolutionary links between sequences
is useful
this often implies a shared function
Arguably, prediction of function from sequence is
of more immediate value than the prediction of
structure
However, between distantly-related proteins,
structure is more conserved than the underlying
sequences
thus, some relationships are only apparent at the
structural level
Such relationships can't be detected by even the
most sensitive sequence comparison methods
the region of identity where sequence comparisons
fail completely to detect structural similarity
is the Midnight Zone - there is thus a
theoretical limit to the effectiveness of
sequence analysis methods

31
Ground rules for bioinformatics

Don't always believe what programs tell you
they're often misleading sometimes wrong!
Don't always believe what databases tell you
they're often misleading sometimes wrong!
Don't always believe what lecturers tell you
they're often misleading sometimes wrong!
In short, don't be a naive user
when computers are applied to biology, it is
vital to understand the difference between
mathematical biological significance
computers dont do biology
they do sums
quickly!

32
Significance

Appreciating that mathematical biological
significance are different is crucial
It is especially important in understanding the
limitations of
database search algorithms
multiple sequence alignment algorithms
pattern recognition techniques
functional site structure prediction tools
Contrary to popular opinion, there is currently
still
no biologically-reliable automatic multiple
alignment algorithm
no infallible pattern-recognition technique
no reliable gene, function or structure
prediction algorithm

33
(No Transcript)
34
(No Transcript)
35
Biological Databases

Overview
Primary data sources
GenBank, SWISS-PROT TrEMBL
Composite sequence databases
NRDB, OWL, SPTrEMBL
Secondary pattern databases
PROSITE, PRINTS, Profiles, Pfam, BLOCKS,
IDENTIFY
Composite pattern databases
BLOCKS, InterPro

36
Primary sequence databases

In the '80s, when sequences started to
accumulate, several labs saw advantages to
establishing central repositories
trouble is, many labs thought the same made
their own
Nucleic Protein
EMBL SWISS-PROT
GenBank PIR
DDBJ MIPS
TrEMBL
NRL-3D
The proliferation of dbs causes problems
do they have the same format? Which is the most
accurate? The most up-to-date? The most
comprehensive? Which should we use?

37
Composite sequence databases

A solution to proliferating dbs is to compile a
composite
these render searches very efficient, especially
if non-redundant
Trouble is, there are now several composites,
each with their own format redundancy criteria
NRDB OWL SPTrEMBL
PDB SWISS-PROT SWISS-PROT
SWISS-PROT PIR TrEMBL
PIR GenBank
GenPept NRL-3D
GenPept updates
NRDB SPTrEMBL are non-identical, not
non-redundant
but which is best? Which the most comprehensive?
The most up-to-date? Which should we use?

38
Secondary pattern databases

As well as 1' resources, there are also many 2'
pattern dbs derived from them
trouble is, they use different 1' sources
different analysis methods, all have different
formats!
But it isn't all bad - SWISS-PROT is emerging as
a standard, most of the 2' dbs use it as their
basis
PROSITE SWISS-PROT Regular expressions
(patterns)
PRINTS SWISS-PROT/TrEMBL Aligned motifs
(fingerprints)
Pfam SWISS-PROT/TrEMBL Hidden Markov Models
(HMMs)
Profiles SWISS-PROT Weight matrices (profiles)
BLOCKS PRINTS/InterPro/Domo Weighted motifs
(blocks)
IDENTIFY PRINTS/InterPro Permissive regular
expressions

39
Why create pattern databases?

Arise from the need to make more specific
functional diagnoses than are possible by just
searching the 1's
Theyre built on the principle that homologous
sequences may be gathered into alignments, within
which are regions (motifs) that show little
variation
these usually reflect vital structural or
functional roles
Motifs are exploited in different ways to build
diagnostic patterns for protein families
new sequences can be searched against dbs of such
patterns to see if they can be assigned to known
families
hence they offer a fast track to the inference of
function

40
What's in a sequence?
41
Single motif methods
Fuzzy regex (IDENTIFY)
Exact regex (PROSITE)
Full domain alignment methods
Profiles (PROFILE LIBRARY)
HMMs (Pfam)
Identity matrices (PRINTS)
Multiple motif methods
Weight matrices (BLOCKS)
42
The challenge of family analysis
43
Know your family
44
The problem with domains
45
PROSITE

This was the first pattern database
protein families characterised by single motifs
Sequence information in motifs is reduced to
consensus or regular expressions the seed
pattern used to search SP
results are checked by hand to determine true
false matches
noisy patterns are revised to achieve optimal
results
Some families cant be characterised by single
motifs
here, additional patterns are created refined
until an optimal set of patterns is achieved that
capture most or all of the family
results are then manually annotated for inclusion
in the db

46
(No Transcript)
47
(No Transcript)
48
PRINTS

Most protein families are characterised by 1
motif
it is sensible to use them all to build a
diagnostic signature
This is the principle of fingerprints
these offer improved diagnostic reliability by
virtue of the biological context provided by
motif neighbours
Motifs are excised from alignments by hand
encoded as ungapped, unweighted local alignments
residue information is augmented via iterative
searches
sequences matching all motifs that weren't in the
original alignment are added to the motifs, the
db searched again
The process is repeated until convergence
results are manually annotated prior to inclusion
in the db

49
(No Transcript)
50
SUMMARY INFORMATION 37 codes involving 8 el
ements 0 codes involving 7 elements
0 codes involving 6 elements
0 codes involving 5 elements
0 codes involving 4 elements
1 codes involving 3 elements
0 codes involving 2 elements
COMPOSITE FINGERPRINT INDEX 8 37 37
37 37 37 37 37 37
7 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
3 1 0 0 0 1 1 0 0
2 0 0 0 0 0 0 0 0
------------------------------------------
1 2 3 4 5 6 7 8
True positives.. PRIO_COLGU PRIO_MACFA PRIO_C
EREL PRIO_ODOHE PRIO_GORGO PRIO_PANTR PRIO_
HUMAN O46648 PRIO_SHEEP PRIO_CALJA PRIO_BOV
IN PRP2_BOVIN PRIO_ATEPA PRIO_SAISC PRIO_PR
EFR PRIO_PONPY O75942 PRIO_CAPHI PRIO_
CEBAP PRIO_CAMDR PRIO_FELCA PRP1_TRAST PRI
O_RABIT PRP2_TRAST PRIO_PIG PRIO_CANFA P
RIO_CRIGR PRIO_CRIMI Q15216 PRIO_RAT
PRIO_CERAE PRIO_MUSPF PRIO_MUSVI PRIO_MESAU
PRIO_MOUSE O46593 PRIO_TRIVU Subfamily Co
des involving 3 elements Subfamily True positive
s.. PRIO_CHICK
51
(No Transcript)
52
Profiles Pfam

An alternative to motif-based methods exploits
regions between motifs, which contain valuable
information
the full alignment effectively becomes the
discriminator
A complex scoring scheme allowing for
substitutions INDELs is used to create
family-specific profiles
These profiles can be used to detect distant
relation-ships, where only few residues are
conserved
this is the basis of the Profile library
In an extension of this approach, alignments are
encoded as probabilistic models termed HMMs
this is the basis of Pfam

53
BLOCKS IDENTIFY

There are advantages to storing motifs in a raw
form
no information is lost
different scoring schemes may be used to confer
different diagnostic potentials on the same data
Additional pattern databases have arisen in this
way
BLOCKS - processed PROSITE families automatically
(BLOCKS includes many other sources)
BLOCKS-format PRINTS - PRINTS motifs with BLOCKS
scoring
IDENTIFY - creates fuzzy expressions from PRINTS
InterPro
These databases are derived fully automatically,
hence offer
no family annotation (they link back to PRINTS
InterPro)
no further family coverage

54
Composite pattern databases

To simplify sequence analysis, the pattern
databases are being integrated to create a
unified protein family resource - InterPro
this is a central annotation resource (derived
from PRINTS PROSITE documentation), with
pointers to its satellite databases
release 3.0 contains 3591 entries
current partners are PRINTS, PROSITE, Profiles,
Pfam ProDom
future partners will include SMART, TigrFam
hopefully others (BLOCKS, MetaFam, etc.)
lags behind its sources

55
(No Transcript)
56
(No Transcript)
57
Pattern Recognition

Overview
Pattern recognition methods
regular expressions, fingerprints, blocks,
profiles HMMs
Which method is best?

58
Pattern recognition methods

These methods classify proteins into families
the basis of the methods is multiple sequence
alignment
They depend on developing a representation of
conserved elements of alignments that may be
diagnostic of structure or function, whether
from
homologous sequence families
sequences that share some structural/functional
domains

59
Determining significance of database matches

When searching a db, the challenge for analysis
methods is to determine if matches are related
(true-positive) or unrelated (true-negative)
At a given scoring threshold, it is likely that
unrelated sequences will be matched erroneously
(false-positives) some correct matches will be
missed (false-negative)
The aim is to improve the resolution between the
curves - in the overlap, it is difficult or
impossible to establish if matches are
significant
Different methods tackle this problem in
different ways

60
Regular expressions/patterns

These are derived from single conserved regions,
which are reduced to consensus expressions for db
searches
they are minimal expressions, so sequence
information is lost
the more divergent the sequences used, the more
fuzzy poorly discriminating the pattern
becomes
Alignment Pattern
GAVDFIALCDRYF
GPIDFVCFCERFY G-X-IV-DE-F-IVL-X2-C-DE-R-
FY2
GRVEFLNRCDRYY
Patterns do not tolerate similarity
sequences either match or not, regardless of how
similar they are
matching is a binary on-off event frequently
misses true matches
single-motif methods are very hit-or-miss - how
do you know if you've encoded the best region?

61
In the beginning was PROSITE

G_PROTEIN_RECEPTOR PATTERN
PS00237
G-protein coupled receptor signature
GSTALIVMYWC-GSTANCPDE-EDPKRH-X(2)-LIVMNQGA
-
X(2)-LIVMFT-GSTANC-LIVMFYWSTAC-DENH-R
/TOTAL919(919)/POS869(869)/FALSE_POS50(50)/F
ALSE_NEG70
/PARTIAL49 UNKNOWN0(0)

This represents an apparent 18 error rate
the actual rate is probably higher
Thus, a match to a pattern is not necessarily
true
a mis-match is not necessarily false!
False-negatives are a fundamental limitation to
this type of pattern matching
if you don't know what you're looking for, you'll
never know you missed it!

62
R-Y-x-DT-W-x-LIVM-ST-T-P-LIVM(3)
63
(No Transcript)
64
Regular expressions/rules

Regular expression patterns are most effective
when applied to highly-conserved, family-specific
motifs
It is often possible to identify, shorter generic
patterns that are characteristic of common
functional sites
Functional site Rule
N-glycosylation N-P-ST-P
Protein kinase C phosphorylation ST-X-RK
Casein kinase II phosphorylation ST-X2-DE
Such features result from convergence to a common
property
glycosylation sites, phosphorylation sites, etc.
They cannot be used for family diagnosis don't
discriminate
they can only be used to suggest whether a
certain functional site might exist (which must
then be tested by experiment)
such patterns are termed rules

65
Diagnostic limitations of short motifs

Consider the sequence motif Asp-Ala-Val-Ile-Asp
(DAVID)
results of db searching for such a sequence will
differ, depending on whether we search for exact
or permissive fuzzy matches
Pattern Matches
D-A-V-I-D 71 (99)
D-A-V-I-DEQN 252
DEQN-A-V-I-DEQN 925
DEQN-A-VLI-I-DEQN 2,739
DEQN-AG-VLI-VLI-DEQN 51,506
D-A-V-E 1,088 (1,493)
(number of matches in OWL29.6 ( OWL31.1))
Use of fuzzy regular expressions has the
potential advantage of being able to recognise
more distant relationships
the inherent disadvantage that more matches
will be made by chance, making it difficult to
separate out true matches from noise

66
Residue groups for fuzzy patterns

It is possible to assign residues to groups
corresponding to various biochemical properties -
e.g., charge size
using such groups to create fuzzy expressions
theoretically ensures that resulting motifs have
sensible biochemical interpretations
small Ala, Gly
small hydroxyl Ser, Thr
basic His, Lys, Arg
aromatic Phe, Tyr, Trp
aliphatic Val, Leu, Ile, Met
acidic/amide Asp, Glu, Asn, Gln
small/polar Ala, Gly, Ser, Thr, Pro
This is more flexible than exact regular
expression matching
but the inherent permissiveness of the fuzzy
approach brings an inevitable signal-to-noise
trade-off

67
Fingerprints

Fingerprints are groups of motifs excised from
alignments used for iterative db searching
no weighting scheme is used
searches depend only on residue frequencies
resulting scoring matrices are thus sparse
Each motif trawls the database independently
search results are correlated to determine which
sequences match all the motifs which match only
partially
no information is thrown away
Iteration refines the fingerprint increases its
potency
fingerprints are diagnostically more powerful
than regular expressions

68
TM domain
TM domain
69
loop region
70
A fingerprinting overview
71

T C A G N S P F L Y H Q V K D E
I W R M B X Z
0 0 2 0 0 0 0 0 0 7 0 0 1 0 0 0
0 2 0 0 0 0 0
0 0 3 0 0 0 0 0 2 0 0 0 4 0 0 0
3 0 0 0 0 0 0
6 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0
1 0 1 1 0 3 0 0 1 0 0 0 3 0 0 0
0 0 0 2 0 0 0
2 0 1 1 0 0 0 0 0 0 0 3 0 4 0 0
0 0 1 0 0 0 0
4 0 1 0 0 1 0 2 0 1 3 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 12 0 0
0 0 0 0 0 0 0
0 0 0 0 0 6 0 0 0 0 0 0 0 6 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 12 0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0
0 0 10 0 0 0 0
9 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 12 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 4 0 0 2 0 0 6 0 0 0 0 0 0 0
0 0 0 0 0 0 0
(b)
T C A G N S P F L Y H Q V K D E
I W R M B X Z
0 0 4 0 0 0 0 8 4 34 0 0 15 0 0 0
1 7 0 0 0 0 0
0 4 15 0 0 0 0 0 7 0 0 0 37 0 0 0
10 0 0 0 0 0 0
50 0 0 0 0 3 0 18 0 0 0 0 0 0 0 0
0 0 0 2 0 0 0

YVTVQHKKLRTPL
YVTVQHKKLRTPL
YVTVQHKKLRTPL
AATMKFKKLRHPL
AATMKFKKLRHPL
YIFATTKSLRTPA
VATLRYKKLRQPL
YIFGGTKSLRTPA
WVFSAAKSLRTPS
WIFSTSKSLRTPS
YLFSKTKSLQTPA
YLFTKTKSLQTPA
(a)
Key
(a) motif, with 3 conserved positions
(b) corresponding frequency matrix
(c) same matrix, but after 3 iterations
(d) same matrix, with PAM250 weighting

72
Fingerprint visualisation

Full potency of fingerprinting is gained from the
mutual context provided by motif neighbours
Important, as it inherently implies a biological
context to motifs matched in the correct order,
with appropriate distances between them
results are thus biologically more meaningful
than those from single motifs
Allows sequence identification even when parts of
the fingerprint are absent
such matches are best visualised graphically

73
(No Transcript)
74
(No Transcript)
75
Blocks

Blocks are groups of motifs derived automatically
from families identified in PRINTS InterPro
sequences are aligned automatically motifs are
automatically identified by searching for spaced
residue triplets (e.g., AxxxVxxC)
a block score is calculated using the BLOSUM62
matrix
validity of blocks is confirmed with a 2nd
motif-finding algorithm
blocks found by both methods are considered
reliable
Sequences within motifs are clustered to reduce
contributions to residue frequencies from sets of
closely-related sequences
each cluster is treated as a single sequence
given a score that gives a measure of its
relatedness
the higher the weight, the more dissimilar the
segment from others in the block, the most
distant being given a score of 100
segments

76
(No Transcript)
77
(No Transcript)
78
Profiles

Profiles are scoring tables derived from full
alignments
these define which residues are allowed at given
positions
which positions are conserved which degenerate
which positions, or regions, can tolerate
insertions
the scoring system is intricate, may include
evolutionary weights, results from structural
studies, data implicit in the alignment
variable penalties are specified to weight
against INDELs occurring in core 2' structure
elements
Within a profile, the I M fields contain
position-specific scores for insert match
positions
in conserved regions, INDELs aren't totally
forbidden, but are strongly impeded by large
penalties defined in the DEFAULT field
these are superseded by more permissive values in
gapped regions
the inherent complexity of profiles renders them
highly potent discriminators, but they are
time-consuming to derive

79
(No Transcript)
80
(No Transcript)
81
Hidden Markov Models

HMMs are similar in concept to profiles
they are probabilistic models consisting of
inter-connecting states
essentially, linear chains of match, delete or
insert states
Match states are assigned to conserved columns in
an alignment
insert states allow for insertions relative to
match states
delete states allow match positions to be
skipped
thus, building an HMM requires each position in
an alignment to be assigned to match, delete or
insert states
HMMs usually perform well, but can be
over-trained
they may also suffer if created from automatic
iterative processes
if it once accepts a false match, an HMM becomes
corrupt

82
An HMM
C
L
Y
E
C
L
W
D
83
Which method is best?

The range of methods available leads to familiar
problems
which should we use?
which is the most reliable?
which is the most comprehensive?
None of the pattern-recognition techniques is
infallible
each has its optimum area of application
None of the resulting pattern databases is
complete
none is the best
bearing in mind the diagnostic strengths
weaknesses of the different approaches, keeping
biological significance in mind, the best
strategy is to use them all

84
Current status of pattern databases

PROSITE (SIB) - 1034 entries
single motifs (regexs) - best with small highly
conserved sites
Profile library (ISREC) - 300 entries
weight matrices - good with divergent domains
superfamilies
PRINTS (Manchester) - 1500 entries
multiple motifs (fingerprints) - best for
families and sub-families
Pfam (Sanger Centre) - 2727 entries
HMMs - good with divergent domains
superfamilies
InterPro (EBI) - 3591 entries
derived from PRINTS, PROSITE, Profiles, Pfam,
ProDom, etc.
BLOCKS (FHCRC) - 2433 entries
multiple motifs (derived from PRINTS, InterPro,
Domo etc.)
IDENTIFY (Stanford)
permissive regexs (derived from PRINTS InterPro)

85
Tools for predicting protein function from
sequence
86
Building a search protocol

Overview
The usual starting point
searching the primary data sources
NRDB, SPTR, etc.
Pattern recognition methods
searching the secondary sources
patterns, profiles, blocks, fingerprints HMMs
Estimating significance
when do we believe a result?

87
A practical approach

A central goal is to predict protein function
from sequence
Given a newly-determined sequence, we want to
know
what is my protein?
to what family does it belong?
what is its function?
how can we explain its function in structural
terms?
By searching pattern dbs fold libraries, we may
recognise patterns that allow us to infer
relationships with previously-characterised
families folds
Given the variety of dbs to search, how do we use
them to build a sensible search protocol?

Protein sequence
database identity search
e.g., for short fragments, pinpoints
identical matches
to probe - may identify correct reading
frame
Protein sequence database similarity search
e.g., nrdb, OWL, SPSPTrEMBL - identifies
homologues to
probe
Protein pattern database search
e.g., PROSITE, profiles, PRINTS, BLOCKS,
Pfam - identifies
family relationships or pin-
points key
structural or functional sites
Known structure No known
structure
Structure classification database query
Protein fold pattern library search
e.g., scop, CATH, FSSP - provides details
e.g., threading - identifies compatible
of structural class, secondary structure
folds for the probe sequence
information, ligand-binding, etc.

89
Similarity searching

Whether or not an identity search finds a match,
the next step is to look for similar sequences
e.g., you may wish to know if a wider family
exists
The most rapid option is to use BLAST (Best Local
Alignment Search Tool), flavours of it, or
FastA
In BLAST output, look for
high scores with low P-values (unlikely to be
random)
clusters of high scores at the top of the hitlist
(a family?)
trends in the type of sequences matched
To ensure a comprehensive search, identity
similarity searches are best performed on
composite databases
e.g., NRDB, SPSP-TrEMBL

90
Ideal results show high scores low E-values
91
Why bother with pattern searches?

Primary searches won't always allow outright
diagnosis
BLAST FASTA are not infallible
often can't assign mathematically significant
scores
results may be complicated by modules, domains or
compositionally-biased regions
annotations of retrieved hits may be incorrect
Pattern databases contain potent descriptors
so, distant relationships missed by BLAST may be
captured by one or more of the family or
functional site distillations

92
(No Transcript)
93
(No Transcript)
94
(No Transcript)
95
(No Transcript)
96
(No Transcript)
97
(No Transcript)
98
(No Transcript)
99
(No Transcript)
100
Structural functional interpretation

Running db searches often does little more than
identify a protein family
this only scratches the surface - we still want
to know what our protein does what it might
look like
The first step is to examine the detailed family
documentations in PROSITE, PRINTS InterPro
these should help to elucidate the function of
the protein
The next step is to examine the fold
classification structure summary resources
e.g., SCOP, CATH PDBsum (assuming the structure
is known)

101
(No Transcript)
102
(No Transcript)
103
(No Transcript)
104
(No Transcript)
105
(No Transcript)
106
Estimating significance

When do we believe a result?
A real example.....

107
(No Transcript)
108
(No Transcript)
109
(No Transcript)
110
(No Transcript)
111
(No Transcript)
112
(No Transcript)
113
(No Transcript)
114
(No Transcript)
115
(No Transcript)
116
(No Transcript)
117
(No Transcript)
118
(No Transcript)
119
Conclusions

Gene prediction, structure function prediction
are non-trivial
structure function prediction tools are, at
best, 70 accurate
What are the lessons for sequence analysis?
when searching for distant homologues, several
dbs should be searched
different methods provide different perspectives
dbs arent complete their contents dont fully
overlap
The more dbs searched, the more difficult it can
be to interpret results
The more computers are involved in automating
genome annotation, the greater the need for
collaboration with biologists
The more data we have to handle, the more
rigorous we must be in our thinking ( writing)
if we are to make sense of the complexities
We are still a long way from having reliable
tools for deducing protein function from
sequence
but with the right approach, there is hope