Biological Databases for Protein Sequence Analysis

About This Presentation

Title:

Biological Databases for Protein Sequence Analysis

Description:

Biological Databases. for Protein Sequence Analysis. Teresa K. Attwood ... Appreciating that mathematical & biological significance are different is ... – PowerPoint PPT presentation

Number of Views:205

Avg rating:3.0/5.0

Slides: 145

Provided by: attw

Category:

more less

Transcript and Presenter's Notes

Title: Biological Databases for Protein Sequence Analysis

1
Biological Databasesfor Protein Sequence Analysis

Teresa K. Attwood
School of Biological Sciences
University of Manchester, Oxford Road
Manchester M13 9PT, UK
http//www.bioinf.man.ac.uk/dbbrowser/

2
Overview

Introduction
Web practical, science fact fiction, some
reality checks
Biological databases
sequence, family, composite, etc.
Pattern recognition
regular expressions, fingerprints, profiles, etc.
Building a search protocol
a real example

3
Introduction

Single- three-letter amino acid codes
G Glycine Gly P Proline Pro
A Alanine Ala V Valine Val
L Leucine Leu I Isoleucine Ile
M Methionine Met C Cysteine Cys
F Phenylalanine Phe Y Tyrosine Tyr
W Tryptophan Trp H Histidine His
K Lysine Lys R Arginine Arg
Q Glutamine Gln N Asparagine Asn
E Glutamic Acid Glu D Aspartic Acid Asp
S Serine Ser T Threonine Thr
Additional codes
B Asn/Asp Z Gln/Glu X Any amino acid

4
(No Transcript)
5
Basic definitions

Primary structure
the linear sequence of amino acids in a protein
Secondary structure
regions of local regularity
i.e., a-helices, b-strands, -sheets -turns

6
Definitions contd.

Super-secondary structure
the packing of secondary structure elements into
stable units
e.g., b-barrels, bab units, Greek keys, etc..

7
Definitions contd.

Tertiary structure
the overall chain fold that results from packing
of secondary structure elements

8
Definitions contd.

Quaternary structure
the arrangement of separate chains within a
protein that has more than one subunit
e.g., haemoglobin

9
Definitions contd.

Quinternary structure
the arrangement of separate molecules, such as in
protein-protein or protein-nucleic acid
interactions

10
The practical - BioActivity

BioActivity sequence analysis in action
begin with a fragment of a DNA sequence
try to find out what protein this codes for, the
family to which it belongs, whether its
function structure are known
The practical is entirely Web-based
be mindful of traffic don't waste time on slow
links
Most important of all
read the instructions!
The Web is constantly evolving....
please report dead links (otherwise theyll stay
dead)!

11
Importance of sequence analysis

gt900,000 sequences available in public dbs
millions more (including ESTs) in proprietary
dbs
these s will snowball with completion of more
genomes
so what?
Locked up in sequences is a huge amount of
structural, functional evolutionary info
they're a highly valuable resource
By contrast, the of unique protein structures
is 2000
a huge information deficit

12
The legacy of the genome projectsSequence-structu
re deficit
800 700 600 500 400 300 200 100
1988

2002

Non-redundant growth of sequences during
1988-2002 ( ) the corresponding growth in
the number of structures ( ).

13
Challenges for bioinformatics

Spurred on by the seq/structure deficit, the
challenges
rationalise the mass of sequence data
derive more efficient means of data storage
design more incisive reliable analysis tools
The imperative - to convert sequence information
into biochemical biophysical knowledge
to decipher the structural, functional
evolutionary clues encoded in the language of
biological sequences

14
The Holy Grail of bioinformatics

...to be able to understand the words in a
sequence sentence that form a particular protein
structure

15
The reality of sequence analysis

...isn't so glamorous....but means we can
recognise words that form characteristic
patterns, even if we don't know the precise
syntax to build complete protein sentences

16
Pattern recognition prediction

In investigating the meaning of sequences, two
distinct analytical approaches have emerged
pattern recognition is used to detect similarity
between sequences hence to infer related
structures functions
ab initio prediction is used to deduce structure,
to infer function, directly from sequence
These methods are quite different!
pattern recognition methods demand that some
characteristic has been seen before housed in a
db
prediction methods remove the need for template
dbs, because deductions are made directly from
sequence

17
Science fact fiction

Sequence pattern recognition is easier to
achieve, is much more reliable, than fold
recognition
which is 50 reliable even in expert hands
Prediction is still not possible
is unlikely to be so for decades to come (if
ever)
Structural genomics will yield representative
structures for many (but not all) proteins in
future
structures of new sequences will be determined by
modelling
prediction will become an academic exercise
But, to debunk a popular myth, knowing structure
alone does not inherently tell us function

18
A reality check

What is the function of this structure?

What is the function of this sequence?

What is the function of this motif?
the fold provides a scaffold, which can be
decorated in different ways by different
sequences to confer different functions knowing
the fold function allows us to rationalise how
the structure effects its function at the
molecular level

19
A test case for structural genomics
Structure-based assignment of the biochemical
function of hypothetical protein mj0577
(Zarembinski et al., PNAS 95 1998)
Although the structure co-crystallised with ATP,
the biochemical function of the protein is
unknown
20
The Twilight Zone

Prediction methods dont work because we dont
fully understand the Folding Problem
we cant read the language sequences use to
create their folds
But, with sequence analysis techniques, we can
try to find similarities between new sequences
those in dbs
whose structures functions we hope have been
elucidated
This is straightforward at high levels of
identity, but below 50 it is difficult to
establish relationships reliably
Analyses can be pursued with decreasing certainty
towards the Twilight Zone
20 identity, where results may look plausible
to the eye, but are no longer statistically
significant

21
Beyond the Twilight Zone

To penetrate deeper into the Twilight Zone is the
aim of most analytical methods
whether using single sequences, motifs, complex
weighting schemes or raw amino acid frequencies
Each offers a different perspective, depending on
the type of information used in the search
none gives the right answer
It is good practice to devise an analysis
protocol that uses a variety of methods
but dont expect the impossible no method is
infallible!

22
Application areas of analysis tools

The scale indicates identity between aligned
sequences
Alignment of 2 random seqs can produce 20
identity
less than 20 does not constitute a significant
alignment
around this threshold is the Twilight Zone,
where alignments may appear plausible to the eye,
but cant be proved by conventional methods

23
Homology analogy

The term homology is confounded abused!
sequences are homologous if they are related by
divergence from a common ancestor
analogy relates to the acquisition of common
features from unrelated ancestors via convergent
evolution
e.g., b-barrels occur in soluble membrane
proteins enzymes chymotrypsin subtilisin share
groups of catalytic residues, with near identical
spatial geometries, but no other similarities
It is not a measure of similarity is not
quantifiable
it is an absolute statement that sequences have a
divergent rather than a convergent relationship
the phrases "the level of homology is high" or
"the sequences show 50 homology", or any like
them, are strictly meaningless!
This is not just a semantic issue
loose use muddies thinking about evolutionary
relationships

24
A terminology muddle

The same arguments apply to 3D structures
structures may be similar, as denoted by RMS
positional deviation between compared atomic
positions
but their common evolutionary origin is a
hypothesis
the hypothesis may be correct or mistaken, but
their similarity is a fact, no matter how it is
interpreted
Similarity of sequence or structure is just that
similarity
Homology connotes a common evolutionary origin
Reeck, G.R., de Haen, C., Teller, D.C.,
Doolittle, R.F., Fitch, W.M., Dickerson, R.E.,
Chambon, P., McLachlan, A.D., Margoliash, E.,
Jukes, T.H. Zuckerkandl, E. (1987) Homology
in proteins and nucleic acids a terminology
muddle and a way out of it. Cell, 50, 667.

25
More challenges for sequence analysis

Much of the challenge is in getting the biology
right
this is complicated by the problem of orthology
vs paralogy
Following a search, how much functional
annotation can be legitimately inherited by a
query?
source of numerous annotation errors in dbs
error propagation could lead to an error
catastrophe
Further complications arise due to modular nature
of proteins
modules are autonomous folding units (protein
building blocks)
confer variety of functions on a parent protein,
by multiple combin-ations of the same module, or
different modules to form mosaics
Automatic analysis systems dont distinguish
orthologues from paralogues dont consider the
modular nature of proteins

26
(No Transcript)
27

Monkeys are exploited in different Goldberg
machines, where they perform different functions
here, we could not predict a monkey sitting in
that spot, even with total knowledge of the rest
of the machine
Similarity searches are just like this
identifying the presence of a module tells little
of the function of the complete system
knowing most components of a mosaic, we cant
predict a missing one
modules (monkeys) in different proteins dont
always perform exactly the same function

28
The Midnight Zone

Notwithstanding the lessons of Goldberg machines,
identifying evolutionary links between sequences
is useful
this often implies a shared function
In the genome era, prediction of function from
sequence is of more immediate value than is the
prediction of structure
However, between distantly-related proteins,
structure is more conserved than the underlying
sequences
thus, some relationships are only apparent at the
structural level
Such relationships cant be detected by even the
most sensitive sequence comparison methods
the region of identity where sequence comparisons
fail completely to detect structural similarity
is the Midnight Zone there is thus a
theoretical limit to the effectiveness of
sequence analysis methods

29
Significance

Appreciating that mathematical biological
significance are different is crucial it is
especially important in understanding the
limitations of
search alignment algorithms, pattern
recognition techniques, functional site
structure prediction tools
Contrary to popular opinion, there is currently
still
no biologically-reliable automatic multiple
alignment algorithm
no infallible pattern-recognition technique
no reliable gene, function or structure
prediction algorithms

30
(No Transcript)
31
Computers dont do biology!
32
Biological Databases

Overview
Sequence repositories
SWISS-PROT TrEMBL
Composite sequence databases
NRDB, SPTrEMBL
Family (pattern) resources
PROSITE, PRINTS, profiles, Pfam, Blocks, eMOTIF
Composite family databases
InterPro

33
Primary sequence databases

In the early '80s, when sequence data started to
accumulate, several labs saw advantages to
establishing central repositories
trouble is, many labs. thought this was a good
idea made their own
Nucleic Protein
EMBL PIR
GenBank SWISS-PROT
DDBJ MIPS
JIPID
TrEMBL
The proliferation of dbs causes problems
do they have the same format? Which is the most
accurate? The most up-to-date? The most
comprehensive? Which should we use?

34
SWISS-PROT

Endeavours to provide high-level annotation
e.g., descriptions of the function of the
protein, the organisation of its domains, PTMs,
family disease relationships, variants, etc.
Contains entries from gt5,000 species
the bulk of these from just a handful of model
organisms
H.sapiens, E.coli, M.musculus, D.melanogaster,
S.cerevisiae, etc.
The quality of its annotations sets is apart from
other dbs
Consequently, it cannot keep pace with the rate
of data acquisition from the sequencing centres

35
(No Transcript)
36
(No Transcript)
37
TrEMBL

A computer-annotated supplement to SP
has the SP format contains translations of all
CDSs in EMBL
It has 2 main sections
SP-TrEMBL contains all entries that will
eventually go into SP, but haven't yet been
manually annotated
REM-TrEMBL contains sequences not destined to
be in SP
Igs, fragments of lt8 residues, synthetic
sequences, etc.
Arose from the need for a structured SP-like
resource, allowing rapid access to genome data,
without compromising the quality of SP by
including entries with poor analysis
insufficient annotation

38
(No Transcript)
39
Composite sequence databases

A solution to the problem of proliferating dbs is
to compile a composite
these render searches very efficient, especially
if non-redundant
Trouble is, there are now several composites,
each with their own format redundancy criteria
the most commonly used are
NRDB SPTrEMBL
PDB SWISS-PROT
SWISS-PROT TrEMBL
PIR
GenPept
GenPept updates
NRDB SPTrEMBL are non-identical, not
non-redundant
but which is best? Which the most comprehensive?
The most up-to-date? Which should we use?

40
NRDB

NRDB is built locally at the NCBI
it includes weekly updates of SP daily updates
of GenBank, so is up-to-date comprehensive
But the simplistic manner of its construction
causes problems
multiple copies of the same protein are retained
as a result of polymorphisms /or sequencing
errors
errors corrected in SP are reintroduced when
retranslated from DNA
numerous sequences are duplicates of existing
fragments
The contents of the db are thus error-prone
redundant
NRDB is the default db of the NCBI BLAST service

41
SPTrEMBL

This resource is intended to be both
comprehensive minimally redundant
It contains fewer errors than NRDB, but is not
truly non-redundant
30 of the combined total of SP TrEMBL is
non-unique
Further reduction of error rates requires more
manual intervention better expert db management
systems

42
Family (pattern) databases

As well as 1' resources, there are also many
family or pattern dbs derived from them
trouble is, they use different 1' sources
different analysis methods, all have different
formats!
But it isn't all bad SWISS-PROT is emerging as
a standard, most pattern dbs use it as their
basis
PROSITE SWISS-PROT Regular expressions
(patterns)
PRINTS SWISS-PROT/TrEMBL Aligned motifs
(fingerprints)
Pfam SWISS-PROT/TrEMBL Hidden Markov Models
(HMMs)
Profiles SWISS-PROT Weight matrices (profiles)
Blocks InterPro/PRINTS Weighted motifs (blocks)
eMOTIF Blocks/PRINTS Permissive regular
expressions

43
Why create pattern databases?

Pattern dbs arise from the need to make more
specific functional diagnoses than are possible
simply by searching the 1's
They are built on the principle that homologous
sequences may be gathered together in multiple
alignments, within which are regions (motifs)
that show little variation
these motifs usually reflect some vital
biological role in terms of either structure or
function
Motifs are exploited in different ways to build
diagnostic patterns for protein families
new sequences can be searched against dbs of such
patterns to see if they can be assigned to known
families
hence they offer a fast track to the inference of
function

44
What's in a sequence?
45
Methods for family analysis
Single motif methods
Fuzzy regex (eMOTIF)
Full domain alignment methods
Exact regex (PROSITE)
Profiles (PROFILE LIBRARY)
HMMs (Pfam)
Identity matrices (PRINTS)
Multiple motif methods
Weight matrices (Blocks)
46
The challenge of family analysis

highly divergent family with single function?
superfamily with many diverse functional
families?
must distinguish if function analysis done in
silico
a tough challenge!

47
Know your family
48
The problem with domains
49
PROSITE

The first pattern db
based on the idea that a protein family can be
characterised by a pattern of conserved residues
within a single motif
Sequence information in motifs is reduced to
consensus or regular expressions (regexs) the
seed regex used to search SP
results are inspected manually to achieve optimal
results
Some families cant be characterised by single
motifs
here, additional regexs are created until an
optimal set is achieved that captures most or all
of the family
results are then manually annotated for inclusion
in the db

50
R-Y-x-DT-W-x-LIVM-ST-T-P-LIVM(3)
51
(No Transcript)
52
(No Transcript)
53
PRINTS

Most protein families are characterised by gt1
motif
it is sensible to use many/all of them to build a
diagnostic signature
This is the principle of fingerprints
these offer improved diagnostic reliability by
virtue of the biological context provided by
motif neighbours
Motifs are excised from alignments by hand
residue information is augmented via iterative
searches
results are manually annotated prior to inclusion
in the db

54
Motif context
order
1
2
3
4
5
interval
55
(No Transcript)
56
SUMMARY INFORMATION 37 codes involving 8
elements 0 codes involving 7 elements
0 codes involving 6 elements 0 codes
involving 5 elements 0 codes involving 4
elements 1 codes involving 3 elements 0
codes involving 2 elements COMPOSITE
FINGERPRINT INDEX 8 37 37 37 37
37 37 37 37 7 0 0 0 0
0 0 0 0 6 0 0 0 0 0
0 0 0 5 0 0 0 0 0
0 0 0 4 0 0 0 0 0
0 0 0 3 1 0 0 0 1 1
0 0 2 0 0 0 0 0 0 0
0 ----------------------------------------
-- 1 2 3 4 5 6 7 8
True positives.. PRIO_COLGU PRIO_MACFA
PRIO_CEREL PRIO_ODOHE PRIO_GORGO PRIO_PANTR
PRIO_HUMAN O46648 PRIO_SHEEP PRIO_CALJA
PRIO_BOVIN PRP2_BOVIN PRIO_ATEPA PRIO_SAISC
PRIO_PREFR PRIO_PONPY O75942 PRIO_CAPHI
PRIO_CEBAP PRIO_CAMDR PRIO_FELCA PRP1_TRAST
PRIO_RABIT PRP2_TRAST PRIO_PIG
PRIO_CANFA PRIO_CRIGR PRIO_CRIMI Q15216
PRIO_RAT PRIO_CERAE PRIO_MUSPF PRIO_MUSVI
PRIO_MESAU PRIO_MOUSE O46593 PRIO_TRIVU
Subfamily Codes involving 3 elements
Subfamily True positives.. PRIO_CHICK
57
(No Transcript)
58
Profiles Pfam

An alternative to motif-based methods exploits
regions between motifs, which also contain
valuable information
the full alignment effectively becomes the
discriminator
A complex scoring scheme allowing for
substitutions INDELs is used to create
family-specific profiles
These profiles can be used to detect distant
relation-ships, where only few residues are
conserved
this is the basis of the Profile library
In an extension of this approach, alignments are
encoded as probabilistic models termed HMMs
this is the basis of Pfam

59
(No Transcript)
60
(No Transcript)
61
(No Transcript)
62
Blocks eMOTIF

Various advantages to storing motifs in a raw
form
no information is lost, different scoring
schemes may be used to confer different
diagnostic potentials on the same data
Additional dbs have arisen in this way
Blocks uses families identified in InterPro,
aligns the sequences detects motifs
automatically
BLOCKS-format PRINTS uses motifs in PRINTS with
the Blocks scoring scheme
eMOTIF creates permissive regexs from Blocks
PRINTS
These dbs are derived fully automatically hence
offer
no family annotation (they link back to InterPro
PRINTS)
no further family coverage

63
(No Transcript)
64
Composite pattern databases

To simplify sequence analysis, the family
databases are being integrated to create a
unified annotation resource InterPro
release 4.0 contains 4691 entries
a central annotation resource, with pointers to
its satellite dbs
initial partners were PRINTS, PROSITE, profiles
Pfam
new partners include ProDom, TIGRfam, SMART
hopefully others (e.g., Blocks, MetaFam)
lags behind its sources
major role in fly human genome annotation

65
(No Transcript)
66
(No Transcript)
67
Pattern Recognition

Overview
Determining significance of db matches
Pattern recognition methods
regular expression patterns rules
fingerprints blocks
profiles HMMs
Current status of pattern dbs

68
Pattern recognition methods

These methods classify proteins into families
the basis of the methods is multiple sequence
alignment
They depend on developing representations of
conserved elements of alignments that may be
diagnostic of structure or function, whether from
homologous sequence families
sequences that share some structural/functional
domains

69
Single motif methods
Fuzzy regex (eMOTIF)
Exact regex (PROSITE)
Full domain alignment methods
Profiles (PROFILE LIBRARY)
HMMs (Pfam)
Identity matrices (PRINTS)
Multiple motif methods
Weight matrices (Blocks)
70
Determining significance of database matches

When searching a db, the challenge for analysis
methods is to determine if matches are related
(true-positive) or unrelated (true-negative)
At a given scoring threshold, it is likely that
unrelated sequences will be matched erroneously
(false-positives) some correct matches will be
missed (false-negative)
The aim is to improve the resolution between the
curves - in the overlap, it is difficult or
impossible to establish if matches are
significant
Different methods tackle this problem in
different ways

71
Resolving true false matches
N
True negative
Score
72
Resolving true false matches
N
True negative
Score
73
Regular expressions (patterns)

These are derived from single conserved regions
in alignments
they are minimal expressions, so sequence
information is lost
the more divergent the sequences used, the more
fuzzy poorly discriminating the regex becomes
Alignment Regex
GAVDFIALCDRYF
GPIDFVCFCERFY G-X-IV-DE-F-IVL-X2-C-DE-R-
FY2
GRVEFLNRCDRYY
Regexs do not tolerate similarity
sequences either match or not, regardless of how
similar they are
matching is a binary on-off event frequently
misses true matches
single-motif methods are very hit-or-miss how
do you know if you've encoded the best region?

74
In the beginning was PROSITE

G_PROTEIN_RECEPTOR PATTERN
PS00237
G-protein coupled receptor signature
GSTALIVMYWC-GSTANCPDE-EDPKRH-X(2)-LIVMNQGA
-
X(2)-LIVMFT-GSTANC-LIVMFYWSTAC-DENH-R
/TOTAL1121(1121) /POS1057(1057)
/FALSE_POS64(64)
/FALSE_NEG112 /PARTIAL48 UNKNOWN0(0)
This represents an apparent 20 error rate
the actual rate is probably higher
Thus, a match to a pattern is not necessarily
true
a mis-match is not necessarily false!
False-negatives are a fundamental limitation to
this type of pattern matching
if you don't know what you're looking for, you'll
never know you missed it!

75
(No Transcript)
76
Regular expressions (rules)

Regex patterns are most effective when applied to
highly-conserved, family-specific motifs
It is often possible to identify, shorter generic
patterns within sequences, characteristic of
common functional sites
Functional site Rule
N-glycosylation N-P-ST-P
Protein kinase C phosphorylation ST-X-RK
Casein kinase II phosphorylation ST-X2-DE
Such features result from convergence to a common
property
glycosylation sites, phosphorylation sites, etc.
They cannot be used for family diagnosis dont
discriminate
they can only be used to suggest whether a
certain functional site might exist (which must
then be tested by experiment)
such patterns are normally termed rules

77
Residue groups for fuzzy regexs

It is possible to assign residues to groups based
on various biochemical properties e.g., charge
size
using such groups theoretically ensures that
resulting regexs have sensible biochemical
interpretations
small Ala, Gly
small hydroxyl Ser, Thr
basic His, Lys, Arg
aromatic Phe, Tyr, Trp
aliphatic Val, Leu, Ile, Met
acidic/amide Asp, Glu, Asn, Gln
small/polar Ala, Gly, Ser, Thr, Pro
This is more flexible than exact regex matching

78
Diagnostic limitations

Consider the sequence motif Asp-Ala-Val-Ile-Asp
(DAVID)
results of searching for such a motif will
differ, depending on the db, the motif length
whether we use exact or permissive fuzzy regexs
Pattern Matches
D-A-V-I-D 71 (99)
D-A-V-I-DEQN 252
DEQN-A-V-I-DEQN 925
DEQN-A-VLI-I-DEQN 2,739
DEQN-AG-VLI-VLI-DEQN 51,506
D-A-V-E 1,088 (1,493)
(number of matches in OWL29.6 ( OWL31.1))
Use of fuzzy regexs has the potential advantage
of being able to recognise more distant
relationships
the inherent disadvantage that more matches
will be made by chance, making it difficult to
separate true matches from noise

79
Fingerprints

Fingerprints are groups of conserved (ungapped)
motifs excised from alignments used for
iterative db searching
no weighting scheme is used
searches depend only on residue frequencies
resulting scoring matrices are thus sparse
Each motif trawls the db independently
search results are correlated to determine which
sequences match all the motifs which match only
partially
no information is thrown away
The iterative process refines the fingerprint
increases its power
potency is gained from the mutual context of
motif neighbours
results are biologically more meaningful than
those from single motifs

80
TM domain
TM domain
loop region
81
loop region
TM domain
TM domain
82
A fingerprinting overview
PRINTS
annotation
83
How fingerprints are stored
84

T C A G N S P F L Y H Q V K D E
I W R M B X Z
0 0 2 0 0 0 0 0 0 7 0 0 1 0 0 0
0 2 0 0 0 0 0
0 0 3 0 0 0 0 0 2 0 0 0 4 0 0 0
3 0 0 0 0 0 0
6 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0
1 0 1 1 0 3 0 0 1 0 0 0 3 0 0 0
0 0 0 2 0 0 0
2 0 1 1 0 0 0 0 0 0 0 3 0 4 0 0
0 0 1 0 0 0 0
4 0 1 0 0 1 0 2 0 1 3 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 12 0 0
0 0 0 0 0 0 0
0 0 0 0 0 6 0 0 0 0 0 0 0 6 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 12 0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0
0 0 10 0 0 0 0
9 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 12 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0 4 0 0 2 0 0 6 0 0 0 0 0 0 0
0 0 0 0 0 0 0
(b)
T C A G N S P F L Y H Q V K D E
I W R M B X Z
0 0 4 0 0 0 0 8 4 34 0 0 15 0 0 0
1 7 0 0 0 0 0
0 4 15 0 0 0 0 0 7 0 0 0 37 0 0 0
10 0 0 0 0 0 0
50 0 0 0 0 3 0 18 0 0 0 0 0 0 0 0
0 0 0 2 0 0 0

YVTVQHKKLRTPL
YVTVQHKKLRTPL
YVTVQHKKLRTPL
AATMKFKKLRHPL
AATMKFKKLRHPL
YIFATTKSLRTPA
VATLRYKKLRQPL
YIFGGTKSLRTPA
WVFSAAKSLRTPS
WIFSTSKSLRTPS
YLFSKTKSLQTPA
YLFTKTKSLQTPA
(a)
Key
(a) motif, with 3 conserved positions
(b) corresponding frequency matrix
(c) same matrix, but after 3 iterations
(d) same matrix, with PAM250 weighting

85
Fingerprint visualisation

The full potency of fingerprinting is gained from
the mutual context provided by motif neighbours
This is important, as the method inherently
implies a biological context to motifs that are
matched in the correct order in the query
sequences, with appropriate distances between
them
This allows sequence identification even when
parts of the fingerprint are absent
e.g., a sequence that matches only 4 of 7 motifs
may still be diagnosed as a true match if the
pattern of motif matching is consistent with that
expected of true neighbouring motifs
Such matches are best visualised graphically

86
Visualising fingerprints
ID
PRINTS
N
C
Query sequence
Missing motif?
87
N
C
88
N
C
89
(No Transcript)
90
(No Transcript)
91
Blocks

Blocks are groups of motifs derived automatically
from families identified in InterPro
sequences are aligned automatically motifs are
automatically identified by searching for spaced
residue triplets (e.g., AxxxVxxC)
a block score is calculated using the BLOSUM62
matrix
validity of blocks is confirmed with a 2nd
motif-finding algorithm
blocks found by both methods are considered
reliable
Sequences within motifs are clustered to reduce
contributions to residue frequencies from sets of
closely-related sequences
each cluster is treated as a single sequence
given a score that gives a measure of its
relatedness
the higher the weight, the more dissimilar the
segment from others in the block, the most
distant being given a score of 100
segments lt80 similar are separated by blank lines

92
(No Transcript)
93
CSC triplet
94
Profiles

Profiles are scoring tables derived from full
domain alignments
these define which residues are allowed at given
positions
which positions are conserved which degenerate
which positions, or regions, can tolerate
insertions
the scoring system is intricate, may include
evolutionary weights, results from structural
studies, data implicit in the alignment
variable penalties are specified to weight
against INDELs occurring in core 2' structure
elements
Within a profile, the I M fields contain
position-specific scores for insert match
positions
in conserved regions, INDELs aren't totally
forbidden, but are strongly impeded by large
penalties defined in the DEFAULT field
these are superseded by more permissive values in
gapped regions
the inherent complexity of profiles renders them
highly potent discriminators, but they are
time-consuming to derive

95
(No Transcript)
96
(No Transcript)
97
Hidden Markov Models

HMMs are similar in concept to profiles by virtue
of encoding full domain alignments
they are probabilistic models consisting of a
number of inter-connecting states
essentially, linear chains of match, delete or
insert states
Match states are assigned to conserved columns in
an alignment
insert states allow for insertions relative to
match states
delete states allow match positions to be skipped
thus, building an HMM from an alignment requires
each position to be assigned either to match,
delete or insert states
HMMs usually perform well, but can be
over-trained
they may also suffer if they are created from an
iterative automatic alignment process if this
once accepts a false match, the HMM will become
corrupt

98
An HMM
C
L
Y
E
C
L
W
D
99
Which craft is best?

The wide variety of methods available leads to
familiar problems
which should we use?
which is the most reliable?
which is the most comprehensive?
......etc.
None of the pattern-recognition techniques is
infallible, none of the resulting pattern dbs
is complete
bearing in mind the diagnostic strengths
weaknesses of the different approaches, always
keeping biological significance in mind, the best
strategy is simply to use them all

100
Overview of resources

PROSITE (SIB) - 1108 entries
single motifs (regexs) - best with small highly
conserved sites
Profile library (ISREC) - 300 entries
weight matrices - good with divergent domains
superfamilies
PRINTS (Manchester) - 1750 entries
multiple motifs (fingerprints) - best for
families and sub-families
Pfam (Sanger Centre) - 3071 entries
HMMs - good with divergent domains
superfamilies
InterPro (EBI) - 4691 entries
derived from PRINTS, PROSITE, Profiles, Pfam,
ProDom, etc.
Blocks (FHCRC) - 2608 entries
multiple motifs (derived from InterPro PRINTS)
eMOTIF (Stanford)
permissive regexs (derived from PRINTS BLOCKS)

101
Building a Search Protocol

Overview
The usual starting point
searching the primary data sources
Pattern recognition methods
searching the secondary sources
Structural functional interpretation of results
Estimating significance
when do we believe a result?

102
A practical approach

Given a newly-determined sequence, we want to
know
what is my protein?
to what family does it belong?
what is its function?
how can we explain its function in structural
terms?
To this end, by searching pattern dbs fold
libraries, we may recognise patterns that allow
us to infer relationships with previously-characte
rised families/folds
Given the variety of dbs to search, how do we use
them to build a sensible search protocol for
novel sequences?

103

Protein sequence
database identity search
e.g., for short fragments, pinpoints
identical matches
to probe - may identify correct reading
frame
Protein sequence database similarity search
e.g., nrdb, SPSPTrEMBL - identifies potential
homologues to probe
Protein pattern database search
e.g., PROSITE, profiles, PRINTS, Blocks,
Pfam - identifies
family relationships or pin-
points key
structural or functional sites
Known structure No known
structure
Structure classification database query
Protein fold pattern library search
e.g., scop, CATH, FSSP - provides details
e.g., threading - identifies compatible
of structural class, secondary structure
folds for the probe sequence
information, ligand-binding, etc.

104
Searching the primary databases

Identity searching
the fastest test of an unknown fragment is to
perform an identity search. This will reveal in
seconds whether an exact match to the unknown
peptide already exists
This can be helpful in identifying the correct
reading frame following a 6-frame translation
ccgtactacaactacgctggtgcattcaag
Forward 0
PYYNYAGAFK TRFE_XENLA 207 AGIKEHKCSRSNNE
PYYNYAGAFK CLQDDQGDVAFVKQ
Forward 1 XLTRSFER 207 AGIKEHKCSRSNNE
PYYNYAGAFK CLQDDQGDVAFVKQ
RTTTTLVHS
Forward 2 TRFE_XENLA TRANSFERRIN PRECURSOR
- XENOPUS LAEVIS
VLQLRWCIQ XLTRSFER TRANSFERRIN PRECURSOR
- XENOPUS LAEVIS
Reverse 0
LECTSVVVR
Reverse 1
LNAPA!L!Y
Reverse 2
!MHQRSCST

105
Similarity searching

Whether or not an identity search finds a match,
the next step is to look for similar sequences
e.g., you may wish to know if a wider family
exists
The most rapid simple option is to use BLAST,
flavours of it, or FastA
Several features are worthy of note in BLAST
output
look for high scores with low P-values (unlikely
to be random)
look for clusters of high scores at the top of
the hitlist (a family?)
look for trends in the type of sequences matched

106
Ideal results show high scores low E-values
107
(No Transcript)
108
Why bother with pattern searches?

Primary searches won't always allow outright
diagnosis
BLAST FASTA are not infallible
BLAST, in particular, often can't assign
significant scores
results may be complicated by the presence of
modules, or compositionally-biased regions
annotations of retrieved hits may be incorrect
Pattern dbs contain potent descriptors
so, distant relationships missed by BLAST may be
captured by one or more of the family or
functional site distillations

109
(No Transcript)
110
Searching the pattern databases

Searching PROSITE
when using PROSITE's Web form, it is advisable to
exclude rules from the search, otherwise output
is filled with spurious matches
results are either match, or no match
the user has to judge whether hits are significant

111
(No Transcript)
112
(No Transcript)
113
Searching the pattern databases

Searching Profiles
the SIB Web server offers access both to profiles
within PROSITE pre-release (undocumented)
profiles
results are highly specific generally
diagnostically reliable
if no match is returned, its usually because the
entry isnt in the db
matches to undocumented profiles are often
dead-ends

114
(No Transcript)
115
Searching the pattern databases

Searching Pfam
results are returned in HTML tables accompanied
by simple graphics to illustrate matched domains
results are specific usually diagnostically
reliable
E-values provide the measure of confidence

116
(No Transcript)
117
Searching the pattern databases

Searching PRINTS
results are returned in HTML tables on different
levels
a best "guess
the top 10 best-scoring matches
the raw data
graphical options provide a visual impression of
the quality of matches
results are specific usually diagnostically
reliable
combined E- p-values provide the measure of
confidence

118
(No Transcript)
119
(No Transcript)
120
Searching the pattern databases

Searching Blocks
if results of searching PROSITE PRINTS are
positive, we would expect these to be confirmed
by searches of the Blocks dbs
key features to note in the output are
the description line, the accession codes (which
indicate which is the matched motif), the
best-scoring or anchor block
most important is the detection of multiple block
hits where this happens, an E-value denotes the
significance of the match
single block matches are usually spurious

121
(No Transcript)
122
Searching the pattern databases

Searching eMOTIF
as with Blocks, if results of searching PROSITE
PRINTS are positive, this should be confirmed by
searches of eMOTIF
output is given at several stringency levels,
which indicate the number of false matches to
expect in the reported results

123
(No Transcript)
124
Which approach is best?

BLAST frequently fails to assign significant
scores
The hit-or-miss nature of single-motif regular
expressions can render them worthless
In spite of (because of?) their complexity,
profiles HMMs are often out-performed by
simpler motif methods
The non-weighting system of fingerprints means
that Twilight relationships may be missed
The scoring system used to create blocks
generates large amounts of noise that may obscure
the signal
Only PROSITE PRINTS are fully manually
annotated
No method alone is best

125
Structural functional interpretation

Db searches often do little more than identify a
protein family
this only scratches the surface we still want
to know what our protein does what it might
look like
The first step is to examine the detailed family
documentations in PROSITE, PRINTS InterPro
these should help to elucidate the function of
the protein
The next step is to examine the fold
classification structure summary resources
e.g., scop, CATH PDBsum, assuming that a
structure is in fact available.

126
(No Transcript)
127
(No Transcript)
128
(No Transcript)
129
(No Transcript)
130
(No Transcript)
131
Estimating significance

When do we believe a result?
a real example.....

132
(No Transcript)
133
(No Transcript)
134
(No Transcript)
135
(No Transcript)
136
(No Transcript)
137
(No Transcript)
138
(No Transcript)
139
(No Transcript)
140
(No Transcript)
141
(No Transcript)
142
(No Transcript)
143
(No Transcript)
144
Conclusions

What are the lessons for sequence analysis?
when searching for distant homologues, several
dbs should be searched
different methods provide different perspectives
dbs arent complete their contents dont fully
overlap
The more dbs searched, the more difficult it can
be to interpret results
hence s/w is being designed to provide
"intelligent" consensus outputs
The more computers are involved in automating
genome annotation, the greater the need for
collaboration
especially between s/w developers, annotators
biologists
The more data we have to handle, the more
rigorous we must be in our thinking ( writing)
if we are to make sense of the complexities
We are a long way from having reliable tools for
deducing protein structure function from
sequence
but with the right approach, there is hope