Title: Biological Databases for Protein Sequence Analysis
1Biological Databasesfor Protein Sequence Analysis
- Terri Attwood
- School of Biological Sciences
- University of Manchester, Oxford Road
- Manchester M13 9PT, UK
- http//www.bioinf.man.ac.uk/dbbrowser/
2Overview
- Introduction
- Web practical, science fact fiction
- the Twilight Zone, the Midnight Zone
- Biological databases
- primary, secondary pattern, composite, etc.
- Pattern recognition
- regular expressions, fingerprints, profiles,
etc.
- Building a search protocol
- combining results, estimating significance
3The practical - BioActivity
- BioActivity is intended to support the lectures
- you begin with a DNA sequence fragment
- try to find out what protein this codes for, the
family to which it belongs, whether its function
structure are known, etc.
- The practical is entirely Web-based
- largely uses local servers, but also links to
external sites
- be patient mindful of traffic - don't waste
time on slow links
- Most important of all
- please read the instructions!
- The Web is constantly evolving....
- please report dead links (otherwise theyll stay
dead)!
4(No Transcript)
5The stuff you have to know
- Single- three-letter amino acid codes
- G Glycine Gly P Proline Pro
- A Alanine Ala V Valine Val
- L Leucine Leu I Isoleucine Ile
- M Methionine Met C Cysteine Cys
- F Phenylalanine Phe Y Tyrosine Tyr
- W Tryptophan Trp H Histidine His
- K Lysine Lys R Arginine Arg
- Q Glutamine Gln N Asparagine Asn
- E Glutamic Acid Glu D Aspartic Acid Asp
- S Serine Ser T Threonine Thr
- Additional codes
- B Asn/Asp Z Gln/Glu X Any amino acid
6(No Transcript)
7Basic definitions
- Primary structure
- the linear sequence of amino acids in a protein
- Secondary structure
- regions of local regularity
- i.e., a-helices, b-strands, -sheets -turns
8Definitions contd.
- Super-secondary structure
- the packing of secondary structure elements into
stable units
- e.g., b-barrels, bab units, Greek keys, etc..
9Definitions contd.
- Tertiary structure
- the overall chain fold that results from packing
of secondary structure elements
10Definitions contd.
- Quaternary structure
- the arrangement of separate chains within a
protein that has more than one subunit
- e.g., haemoglobin
11Definitions contd.
- Quinternary structure
- the arrangement of separate molecules, such as in
protein-protein or protein-nucleic acid
interactions
12Definitions contd.
- Bioinformatics
- broadly, Information Technology applied to
biology
- this can mean anything from AI robotics to
genome analysis!
- boundaries with computational biology now
blurred
- originally coined in the 80s to mean
bio-sequence analysis
- with increasing availability of protein
structures, the term now also encompasses
structure analysis
- but the scale of the problem here is vastly
different.....
13Importance of sequence analysis
- 694,000 sequences available in public databases
- millions more (including ESTs) in proprietary
databases
- these s will snowball with completion of more
genomes
- so what?
- Locked up in sequences is a huge amount of
structural, functional evolutionary info
- they're a highly valuable resource
- By contrast, the of unique protein structures
is 2000
- this represents a huge information deficit
14Sequence-structure deficit
- Non-redundant growth of sequences during
1988-1998 ( ) the corresponding growth in
the number of structures ( ).
15Challenges for bioinformatics
- Spurred on by the sequence/structure deficit, the
challenges are to
- rationalise the mass of sequence data
- derive more efficient means of data storage
- design more incisive reliable analysis tools
- The imperative - to convert sequence information
into biochemical biophysical knowledge
- to decipher the structural, functional
evolutionary clues encoded in the language of
biological sequences
16The Holy Grail of bioinformatics
- ...to be able to understand the words in a
sequence sentence that form a particular protein
structure
17The reality of sequence analysis
- ...isn't so glamorous....but means we can
recognise words that form characteristic
patterns, even if we don't know the precise
syntax to build complete protein sentences
18Pattern recognition prediction
- In investigating the meaning of sequences, 2
distinct analytical approaches have emerged
- pattern recognition is used to detect similarity
between sequences hence to infer related
structures functions
- ab initio prediction is used to deduce structure,
to infer function, directly from sequence
- These methods are different shouldnt be
confused
- Sequence- structure-based pattern recognition
methods demand that some characteristic has been
seen before housed in a db
- Prediction methods remove the need for template
dbs because deductions are made directly from
sequence
19Science fact fiction
- Sequence pattern recognition is easier to
achieve, is much more reliable, than fold
recognition
- which is 40-50 reliable even in expert hands
- Prediction is still not possible
- is unlikely to be so for decades to come (if
ever)
- Structural genomics will yield representative
structures for more proteins in future
- structures of new sequences will be determined by
modelling
- prediction will become an academic exercise
- But, to debunk a popular myth, knowing structure
alone does not inherently tell us function
20A reality check
- What is the function of this structure?
- What is the function of this sequence?
- What is the function of this motif?
- the fold provides a scaffold, which can be
decorated in different ways by different
sequences to confer different functions - knowing
the fold function allows us to rationalise how
the structure effects its function at the
molecular level
21 A test case for structural genomics
Structure-based assignment of the biochemical
function of hypothetical protein mj0577
(Zarembinski et al., PNAS 95 1998)
Although the structure co-crystallised with ATP,
the biochemical function of the protein is
unknown
22The Twilight Zone
- Prediction methods dont work because we dont
fully understand the Folding Problem
- we cant read the language sequences use to
create their folds
- But, with sequence analysis techniques, we can
try to find similarities between new sequences
those in dbs
- whose structures functions we hope have been
elucidated
- This is straightforward at high levels of
identity, but below 50 it is difficult to
establish relationships reliably
- Analyses can be pursued with decreasing certainty
towards the Twilight Zone
- 20 identity, where results may look plausible
to the eye, but are no longer statistically
significant
23Application areas of analysis tools
- The scale indicates identity between aligned
sequences
- Alignment of 2 random seqs can produce 20
identity
- less than 20 does not constitute a significant
alignment
- around this threshold is the Twilight Zone,
where alignments may appear plausible to the eye,
but cant be proved by conventional methods
24Homology analogy
- The term homology is confounded abused in the
literature!
- sequences are homologous if theyre related by
divergence from a common ancestor
- analogy relates to the acquisition of common
features from unrelated ancestors via convergent
evolution
- e.g., b-barrels occur in soluble serine proteases
integral membrane porins chymotrypsin
subtilisin share groups of catalytic residues,
with near identical spatial geometries, but no
other similarities - Homology is not a measure of similarity is not
quantifiable
- it is an absolute statement that sequences have a
divergent rather than a convergent relationship
- the phrases "the level of homology is high" or
"the sequences show 50 homology", or any like
them, are strictly meaningless!
- This is not just a semantic issue
- loose use muddies thinking about evolutionary
relationships
25A terminology muddle
- In comparing 3D structures, exactly the same
arguments apply
- structures may be similar, as denoted by RMS
positional deviation between compared atomic
positions
- common evolutionary origin remains a hypothesis,
until supported by other evidence
- homology among similar structures is a
hypothesis
- This may be correct or mistaken, but their
similarity is a fact, no matter how it is
interpreted
- Similarity of sequence or structure is just that
- similarity
- Homology connotes a common evolutionary origin
- Reeck, G.R., de Haen, C., Teller, D.C.,
Doolittle, R.F., Fitch, W.M., Dickerson, R.E.,
Chambon, P., McLachlan, A.D., Margoliash, E.,
Jukes, T.H. Zuckerkandl, E. (1987) Homology
in proteins and nucleic acids a terminology
muddle and a way out of it. Cell, 50, 667.
26Orthology paralogy
- Among homologous sequences we can distinguish
- orthologues - largely perform the same function
in different species
- paralogues - perform different but related
functions in one organism
- Studying orthologues opens the way to molecular
palaeontology
- e.g., using phylogenetic trees to show
cross-species relationships
- Paralogues shed light on underlying evolutionary
mechanisms
- paralogous proteins are thought to have arisen
from single genes via successive duplication
events
- duplicated genes follow separate evolutionary
pathways new specificities evolve through
variation adaptation
- Such complexity presents real challenges for
sequence analysis
27Challenges for sequence analysis
- Much of the challenge is in getting the biology
right
- complicated by orthology vs paralogy
- Following a db search, it may be unclear how much
functional annotation can be legitimately
inherited by a query
- source of numerous annotation errors in dbs
- propagation could lead to an error catastrophe
- Further complications result from the modular
nature of proteins
- modules are autonomous folding units, used as
protein building blocks - like Lego bricks, they
can confer a variety of functions on the parent
protein, either by multiple combinations of the
same module, or via different modules to form
mosaics - Automatic systems dont distinguish orthologues
from paralogues dont consider the modular
nature of proteins
28(No Transcript)
29- Monkeys are exploited in different Goldberg
machines, where they perform different functions
- here, we couldnt predict a monkey in that
spot, even with total knowledge of the rest of
the machine - Similarity searches are just like this
- identifying the presence of a module tells little
of the function of the complete system
- knowing most components of a mosaic, we cant
predict a missing one
- modules (monkeys) in different proteins dont
always perform exactly the same function
30The Midnight Zone
- Identifying evolutionary links between sequences
is useful
- this often implies a shared function
- Arguably, prediction of function from sequence is
of more immediate value than the prediction of
structure
- However, between distantly-related proteins,
structure is more conserved than the underlying
sequences
- thus, some relationships are only apparent at the
structural level
- Such relationships can't be detected by even the
most sensitive sequence comparison methods
- the region of identity where sequence comparisons
fail completely to detect structural similarity
is the Midnight Zone - there is thus a
theoretical limit to the effectiveness of
sequence analysis methods
31Ground rules for bioinformatics
- Don't always believe what programs tell you
- they're often misleading sometimes wrong!
- Don't always believe what databases tell you
- they're often misleading sometimes wrong!
- Don't always believe what lecturers tell you
- they're often misleading sometimes wrong!
- In short, don't be a naive user
- when computers are applied to biology, it is
vital to understand the difference between
mathematical biological significance
- computers dont do biology
- they do sums
- quickly!
32Significance
- Appreciating that mathematical biological
significance are different is crucial
- It is especially important in understanding the
limitations of
- database search algorithms
- multiple sequence alignment algorithms
- pattern recognition techniques
- functional site structure prediction tools
- Contrary to popular opinion, there is currently
still
- no biologically-reliable automatic multiple
alignment algorithm
- no infallible pattern-recognition technique
- no reliable gene, function or structure
prediction algorithm
33(No Transcript)
34(No Transcript)
35Biological Databases
- Overview
- Primary data sources
- GenBank, SWISS-PROT TrEMBL
- Composite sequence databases
- NRDB, OWL, SPTrEMBL
- Secondary pattern databases
- PROSITE, PRINTS, Profiles, Pfam, BLOCKS,
IDENTIFY
- Composite pattern databases
- BLOCKS, InterPro
36Primary sequence databases
- In the '80s, when sequences started to
accumulate, several labs saw advantages to
establishing central repositories
- trouble is, many labs thought the same made
their own
- Nucleic Protein
- EMBL SWISS-PROT
- GenBank PIR
- DDBJ MIPS
- TrEMBL
- NRL-3D
- The proliferation of dbs causes problems
- do they have the same format? Which is the most
accurate? The most up-to-date? The most
comprehensive? Which should we use?
37Composite sequence databases
- A solution to proliferating dbs is to compile a
composite
- these render searches very efficient, especially
if non-redundant
- Trouble is, there are now several composites,
each with their own format redundancy criteria
- NRDB OWL SPTrEMBL
- PDB SWISS-PROT SWISS-PROT
- SWISS-PROT PIR TrEMBL
- PIR GenBank
- GenPept NRL-3D
- GenPept updates
- NRDB SPTrEMBL are non-identical, not
non-redundant
- but which is best? Which the most comprehensive?
The most up-to-date? Which should we use?
38Secondary pattern databases
- As well as 1' resources, there are also many 2'
pattern dbs derived from them
- trouble is, they use different 1' sources
different analysis methods, all have different
formats!
- But it isn't all bad - SWISS-PROT is emerging as
a standard, most of the 2' dbs use it as their
basis
- PROSITE SWISS-PROT Regular expressions
(patterns)
- PRINTS SWISS-PROT/TrEMBL Aligned motifs
(fingerprints)
- Pfam SWISS-PROT/TrEMBL Hidden Markov Models
(HMMs)
- Profiles SWISS-PROT Weight matrices (profiles)
- BLOCKS PRINTS/InterPro/Domo Weighted motifs
(blocks)
- IDENTIFY PRINTS/InterPro Permissive regular
expressions
39Why create pattern databases?
- Arise from the need to make more specific
functional diagnoses than are possible by just
searching the 1's
- Theyre built on the principle that homologous
sequences may be gathered into alignments, within
which are regions (motifs) that show little
variation - these usually reflect vital structural or
functional roles
- Motifs are exploited in different ways to build
diagnostic patterns for protein families
- new sequences can be searched against dbs of such
patterns to see if they can be assigned to known
families
- hence they offer a fast track to the inference of
function
40What's in a sequence?
41Single motif methods
Fuzzy regex (IDENTIFY)
Exact regex (PROSITE)
Full domain alignment methods
Profiles (PROFILE LIBRARY)
HMMs (Pfam)
Identity matrices (PRINTS)
Multiple motif methods
Weight matrices (BLOCKS)
42The challenge of family analysis
43Know your family
44The problem with domains
45PROSITE
- This was the first pattern database
- protein families characterised by single motifs
- Sequence information in motifs is reduced to
consensus or regular expressions the seed
pattern used to search SP
- results are checked by hand to determine true
false matches
- noisy patterns are revised to achieve optimal
results
- Some families cant be characterised by single
motifs
- here, additional patterns are created refined
until an optimal set of patterns is achieved that
capture most or all of the family
- results are then manually annotated for inclusion
in the db
46(No Transcript)
47(No Transcript)
48PRINTS
- Most protein families are characterised by 1
motif
- it is sensible to use them all to build a
diagnostic signature
- This is the principle of fingerprints
- these offer improved diagnostic reliability by
virtue of the biological context provided by
motif neighbours
- Motifs are excised from alignments by hand
encoded as ungapped, unweighted local alignments
- residue information is augmented via iterative
searches
- sequences matching all motifs that weren't in the
original alignment are added to the motifs, the
db searched again
- The process is repeated until convergence
- results are manually annotated prior to inclusion
in the db
49(No Transcript)
50SUMMARY INFORMATION 37 codes involving 8 el
ements 0 codes involving 7 elements
0 codes involving 6 elements
0 codes involving 5 elements
0 codes involving 4 elements
1 codes involving 3 elements
0 codes involving 2 elements
COMPOSITE FINGERPRINT INDEX 8 37 37
37 37 37 37 37 37
7 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0
5 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0
3 1 0 0 0 1 1 0 0
2 0 0 0 0 0 0 0 0
------------------------------------------
1 2 3 4 5 6 7 8
True positives.. PRIO_COLGU PRIO_MACFA PRIO_C
EREL PRIO_ODOHE PRIO_GORGO PRIO_PANTR PRIO_
HUMAN O46648 PRIO_SHEEP PRIO_CALJA PRIO_BOV
IN PRP2_BOVIN PRIO_ATEPA PRIO_SAISC PRIO_PR
EFR PRIO_PONPY O75942 PRIO_CAPHI PRIO_
CEBAP PRIO_CAMDR PRIO_FELCA PRP1_TRAST PRI
O_RABIT PRP2_TRAST PRIO_PIG PRIO_CANFA P
RIO_CRIGR PRIO_CRIMI Q15216 PRIO_RAT
PRIO_CERAE PRIO_MUSPF PRIO_MUSVI PRIO_MESAU
PRIO_MOUSE O46593 PRIO_TRIVU Subfamily Co
des involving 3 elements Subfamily True positive
s.. PRIO_CHICK
51(No Transcript)
52Profiles Pfam
- An alternative to motif-based methods exploits
regions between motifs, which contain valuable
information
- the full alignment effectively becomes the
discriminator
- A complex scoring scheme allowing for
substitutions INDELs is used to create
family-specific profiles
- These profiles can be used to detect distant
relation-ships, where only few residues are
conserved
- this is the basis of the Profile library
- In an extension of this approach, alignments are
encoded as probabilistic models termed HMMs
- this is the basis of Pfam
53BLOCKS IDENTIFY
- There are advantages to storing motifs in a raw
form
- no information is lost
- different scoring schemes may be used to confer
different diagnostic potentials on the same data
- Additional pattern databases have arisen in this
way
- BLOCKS - processed PROSITE families automatically
(BLOCKS includes many other sources)
- BLOCKS-format PRINTS - PRINTS motifs with BLOCKS
scoring
- IDENTIFY - creates fuzzy expressions from PRINTS
InterPro
- These databases are derived fully automatically,
hence offer
- no family annotation (they link back to PRINTS
InterPro)
- no further family coverage
54Composite pattern databases
- To simplify sequence analysis, the pattern
databases are being integrated to create a
unified protein family resource - InterPro
- this is a central annotation resource (derived
from PRINTS PROSITE documentation), with
pointers to its satellite databases
- release 3.0 contains 3591 entries
- current partners are PRINTS, PROSITE, Profiles,
Pfam ProDom
- future partners will include SMART, TigrFam
hopefully others (BLOCKS, MetaFam, etc.)
- lags behind its sources
55(No Transcript)
56(No Transcript)
57Pattern Recognition
- Overview
- Pattern recognition methods
- regular expressions, fingerprints, blocks,
profiles HMMs
- Which method is best?
58Pattern recognition methods
- These methods classify proteins into families
- the basis of the methods is multiple sequence
alignment
- They depend on developing a representation of
conserved elements of alignments that may be
diagnostic of structure or function, whether
from - homologous sequence families
- sequences that share some structural/functional
domains
59Determining significance of database matches
- When searching a db, the challenge for analysis
methods is to determine if matches are related
(true-positive) or unrelated (true-negative)
- At a given scoring threshold, it is likely that
unrelated sequences will be matched erroneously
(false-positives) some correct matches will be
missed (false-negative) - The aim is to improve the resolution between the
curves - in the overlap, it is difficult or
impossible to establish if matches are
significant - Different methods tackle this problem in
different ways
60Regular expressions/patterns
- These are derived from single conserved regions,
which are reduced to consensus expressions for db
searches
- they are minimal expressions, so sequence
information is lost
- the more divergent the sequences used, the more
fuzzy poorly discriminating the pattern
becomes
- Alignment Pattern
- GAVDFIALCDRYF
- GPIDFVCFCERFY G-X-IV-DE-F-IVL-X2-C-DE-R-
FY2
- GRVEFLNRCDRYY
- Patterns do not tolerate similarity
- sequences either match or not, regardless of how
similar they are
- matching is a binary on-off event frequently
misses true matches
- single-motif methods are very hit-or-miss - how
do you know if you've encoded the best region?
61In the beginning was PROSITE
- G_PROTEIN_RECEPTOR PATTERN
- PS00237
- G-protein coupled receptor signature
- GSTALIVMYWC-GSTANCPDE-EDPKRH-X(2)-LIVMNQGA
-
- X(2)-LIVMFT-GSTANC-LIVMFYWSTAC-DENH-R
- /TOTAL919(919)/POS869(869)/FALSE_POS50(50)/F
ALSE_NEG70
- /PARTIAL49 UNKNOWN0(0)
- This represents an apparent 18 error rate
- the actual rate is probably higher
- Thus, a match to a pattern is not necessarily
true
- a mis-match is not necessarily false!
- False-negatives are a fundamental limitation to
this type of pattern matching
- if you don't know what you're looking for, you'll
never know you missed it!
62R-Y-x-DT-W-x-LIVM-ST-T-P-LIVM(3)
63(No Transcript)
64Regular expressions/rules
- Regular expression patterns are most effective
when applied to highly-conserved, family-specific
motifs
- It is often possible to identify, shorter generic
patterns that are characteristic of common
functional sites
- Functional site Rule
- N-glycosylation N-P-ST-P
- Protein kinase C phosphorylation ST-X-RK
- Casein kinase II phosphorylation ST-X2-DE
- Such features result from convergence to a common
property
- glycosylation sites, phosphorylation sites, etc.
- They cannot be used for family diagnosis don't
discriminate
- they can only be used to suggest whether a
certain functional site might exist (which must
then be tested by experiment)
- such patterns are termed rules
65Diagnostic limitations of short motifs
- Consider the sequence motif Asp-Ala-Val-Ile-Asp
(DAVID)
- results of db searching for such a sequence will
differ, depending on whether we search for exact
or permissive fuzzy matches
- Pattern Matches
- D-A-V-I-D 71 (99)
- D-A-V-I-DEQN 252
- DEQN-A-V-I-DEQN 925
- DEQN-A-VLI-I-DEQN 2,739
- DEQN-AG-VLI-VLI-DEQN 51,506
- D-A-V-E 1,088 (1,493)
- (number of matches in OWL29.6 ( OWL31.1))
- Use of fuzzy regular expressions has the
potential advantage of being able to recognise
more distant relationships
- the inherent disadvantage that more matches
will be made by chance, making it difficult to
separate out true matches from noise
66Residue groups for fuzzy patterns
- It is possible to assign residues to groups
corresponding to various biochemical properties -
e.g., charge size
- using such groups to create fuzzy expressions
theoretically ensures that resulting motifs have
sensible biochemical interpretations
- small Ala, Gly
- small hydroxyl Ser, Thr
- basic His, Lys, Arg
- aromatic Phe, Tyr, Trp
- aliphatic Val, Leu, Ile, Met
- acidic/amide Asp, Glu, Asn, Gln
- small/polar Ala, Gly, Ser, Thr, Pro
- This is more flexible than exact regular
expression matching
- but the inherent permissiveness of the fuzzy
approach brings an inevitable signal-to-noise
trade-off
67Fingerprints
- Fingerprints are groups of motifs excised from
alignments used for iterative db searching
- no weighting scheme is used
- searches depend only on residue frequencies
- resulting scoring matrices are thus sparse
- Each motif trawls the database independently
- search results are correlated to determine which
sequences match all the motifs which match only
partially
- no information is thrown away
- Iteration refines the fingerprint increases its
potency
- fingerprints are diagnostically more powerful
than regular expressions
68TM domain
TM domain
69loop region
70A fingerprinting overview
71- T C A G N S P F L Y H Q V K D E
I W R M B X Z
- 0 0 2 0 0 0 0 0 0 7 0 0 1 0 0 0
0 2 0 0 0 0 0
- 0 0 3 0 0 0 0 0 2 0 0 0 4 0 0 0
3 0 0 0 0 0 0
- 6 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0
- 1 0 1 1 0 3 0 0 1 0 0 0 3 0 0 0
0 0 0 2 0 0 0
- 2 0 1 1 0 0 0 0 0 0 0 3 0 4 0 0
0 0 1 0 0 0 0
- 4 0 1 0 0 1 0 2 0 1 3 0 0 0 0 0
0 0 0 0 0 0 0
- 0 0 0 0 0 0 0 0 0 0 0 0 0 12 0 0
0 0 0 0 0 0 0
- 0 0 0 0 0 6 0 0 0 0 0 0 0 6 0 0
0 0 0 0 0 0 0
- 0 0 0 0 0 0 0 0 12 0 0 0 0 0 0 0
0 0 0 0 0 0 0
- 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0
0 0 10 0 0 0 0
- 9 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0
0 0 0 0 0 0 0
- 0 0 0 0 0 0 12 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0
- 0 0 4 0 0 2 0 0 6 0 0 0 0 0 0 0
0 0 0 0 0 0 0
- (b)
- T C A G N S P F L Y H Q V K D E
I W R M B X Z
- 0 0 4 0 0 0 0 8 4 34 0 0 15 0 0 0
1 7 0 0 0 0 0
- 0 4 15 0 0 0 0 0 7 0 0 0 37 0 0 0
10 0 0 0 0 0 0
- 50 0 0 0 0 3 0 18 0 0 0 0 0 0 0 0
0 0 0 2 0 0 0
- YVTVQHKKLRTPL
- YVTVQHKKLRTPL
- YVTVQHKKLRTPL
- AATMKFKKLRHPL
- AATMKFKKLRHPL
- YIFATTKSLRTPA
- VATLRYKKLRQPL
- YIFGGTKSLRTPA
- WVFSAAKSLRTPS
- WIFSTSKSLRTPS
- YLFSKTKSLQTPA
- YLFTKTKSLQTPA
- (a)
- Key
- (a) motif, with 3 conserved positions
- (b) corresponding frequency matrix
- (c) same matrix, but after 3 iterations
- (d) same matrix, with PAM250 weighting
72Fingerprint visualisation
- Full potency of fingerprinting is gained from the
mutual context provided by motif neighbours
- Important, as it inherently implies a biological
context to motifs matched in the correct order,
with appropriate distances between them
- results are thus biologically more meaningful
than those from single motifs
- Allows sequence identification even when parts of
the fingerprint are absent
- such matches are best visualised graphically
73(No Transcript)
74(No Transcript)
75Blocks
- Blocks are groups of motifs derived automatically
from families identified in PRINTS InterPro
- sequences are aligned automatically motifs are
automatically identified by searching for spaced
residue triplets (e.g., AxxxVxxC)
- a block score is calculated using the BLOSUM62
matrix
- validity of blocks is confirmed with a 2nd
motif-finding algorithm
- blocks found by both methods are considered
reliable
- Sequences within motifs are clustered to reduce
contributions to residue frequencies from sets of
closely-related sequences
- each cluster is treated as a single sequence
given a score that gives a measure of its
relatedness
- the higher the weight, the more dissimilar the
segment from others in the block, the most
distant being given a score of 100
- segments
76(No Transcript)
77(No Transcript)
78Profiles
- Profiles are scoring tables derived from full
alignments
- these define which residues are allowed at given
positions
- which positions are conserved which degenerate
- which positions, or regions, can tolerate
insertions
- the scoring system is intricate, may include
evolutionary weights, results from structural
studies, data implicit in the alignment
- variable penalties are specified to weight
against INDELs occurring in core 2' structure
elements
- Within a profile, the I M fields contain
position-specific scores for insert match
positions
- in conserved regions, INDELs aren't totally
forbidden, but are strongly impeded by large
penalties defined in the DEFAULT field
- these are superseded by more permissive values in
gapped regions
- the inherent complexity of profiles renders them
highly potent discriminators, but they are
time-consuming to derive
79(No Transcript)
80(No Transcript)
81Hidden Markov Models
- HMMs are similar in concept to profiles
- they are probabilistic models consisting of
inter-connecting states
- essentially, linear chains of match, delete or
insert states
- Match states are assigned to conserved columns in
an alignment
- insert states allow for insertions relative to
match states
- delete states allow match positions to be
skipped
- thus, building an HMM requires each position in
an alignment to be assigned to match, delete or
insert states
- HMMs usually perform well, but can be
over-trained
- they may also suffer if created from automatic
iterative processes
- if it once accepts a false match, an HMM becomes
corrupt
82An HMM
C
L
Y
E
C
L
W
D
83Which method is best?
- The range of methods available leads to familiar
problems
- which should we use?
- which is the most reliable?
- which is the most comprehensive?
- None of the pattern-recognition techniques is
infallible
- each has its optimum area of application
- None of the resulting pattern databases is
complete
- none is the best
- bearing in mind the diagnostic strengths
weaknesses of the different approaches, keeping
biological significance in mind, the best
strategy is to use them all
84Current status of pattern databases
- PROSITE (SIB) - 1034 entries
- single motifs (regexs) - best with small highly
conserved sites
- Profile library (ISREC) - 300 entries
- weight matrices - good with divergent domains
superfamilies
- PRINTS (Manchester) - 1500 entries
- multiple motifs (fingerprints) - best for
families and sub-families
- Pfam (Sanger Centre) - 2727 entries
- HMMs - good with divergent domains
superfamilies
- InterPro (EBI) - 3591 entries
- derived from PRINTS, PROSITE, Profiles, Pfam,
ProDom, etc.
- BLOCKS (FHCRC) - 2433 entries
- multiple motifs (derived from PRINTS, InterPro,
Domo etc.)
- IDENTIFY (Stanford)
- permissive regexs (derived from PRINTS InterPro)
85Tools for predicting protein function from
sequence
86Building a search protocol
- Overview
- The usual starting point
- searching the primary data sources
- NRDB, SPTR, etc.
- Pattern recognition methods
- searching the secondary sources
- patterns, profiles, blocks, fingerprints HMMs
- Estimating significance
- when do we believe a result?
87A practical approach
- A central goal is to predict protein function
from sequence
- Given a newly-determined sequence, we want to
know
- what is my protein?
- to what family does it belong?
- what is its function?
- how can we explain its function in structural
terms?
- By searching pattern dbs fold libraries, we may
recognise patterns that allow us to infer
relationships with previously-characterised
families folds - Given the variety of dbs to search, how do we use
them to build a sensible search protocol?
88- Protein sequence
database identity search
- e.g., for short fragments, pinpoints
identical matches
- to probe - may identify correct reading
frame
- Protein sequence database similarity search
- e.g., nrdb, OWL, SPSPTrEMBL - identifies
- homologues to
probe
- Protein pattern database search
- e.g., PROSITE, profiles, PRINTS, BLOCKS,
- Pfam - identifies
family relationships or pin-
- points key
structural or functional sites
- Known structure No known
structure
- Structure classification database query
Protein fold pattern library search
- e.g., scop, CATH, FSSP - provides details
e.g., threading - identifies compatible
- of structural class, secondary structure
folds for the probe sequence
- information, ligand-binding, etc.
89Similarity searching
- Whether or not an identity search finds a match,
the next step is to look for similar sequences
- e.g., you may wish to know if a wider family
exists
- The most rapid option is to use BLAST (Best Local
Alignment Search Tool), flavours of it, or
FastA
- In BLAST output, look for
- high scores with low P-values (unlikely to be
random)
- clusters of high scores at the top of the hitlist
(a family?)
- trends in the type of sequences matched
- To ensure a comprehensive search, identity
similarity searches are best performed on
composite databases
- e.g., NRDB, SPSP-TrEMBL
90Ideal results show high scores low E-values
91Why bother with pattern searches?
- Primary searches won't always allow outright
diagnosis
- BLAST FASTA are not infallible
- often can't assign mathematically significant
scores
- results may be complicated by modules, domains or
compositionally-biased regions
- annotations of retrieved hits may be incorrect
- Pattern databases contain potent descriptors
- so, distant relationships missed by BLAST may be
captured by one or more of the family or
functional site distillations
92(No Transcript)
93(No Transcript)
94(No Transcript)
95(No Transcript)
96(No Transcript)
97(No Transcript)
98(No Transcript)
99(No Transcript)
100Structural functional interpretation
- Running db searches often does little more than
identify a protein family
- this only scratches the surface - we still want
to know what our protein does what it might
look like
- The first step is to examine the detailed family
documentations in PROSITE, PRINTS InterPro
- these should help to elucidate the function of
the protein
- The next step is to examine the fold
classification structure summary resources
- e.g., SCOP, CATH PDBsum (assuming the structure
is known)
101(No Transcript)
102(No Transcript)
103(No Transcript)
104(No Transcript)
105(No Transcript)
106Estimating significance
- When do we believe a result?
- A real example.....
107(No Transcript)
108(No Transcript)
109(No Transcript)
110(No Transcript)
111(No Transcript)
112(No Transcript)
113(No Transcript)
114(No Transcript)
115(No Transcript)
116(No Transcript)
117(No Transcript)
118(No Transcript)
119Conclusions
- Gene prediction, structure function prediction
are non-trivial
- structure function prediction tools are, at
best, 70 accurate
- What are the lessons for sequence analysis?
- when searching for distant homologues, several
dbs should be searched
- different methods provide different perspectives
- dbs arent complete their contents dont fully
overlap
- The more dbs searched, the more difficult it can
be to interpret results
- The more computers are involved in automating
genome annotation, the greater the need for
collaboration with biologists
- The more data we have to handle, the more
rigorous we must be in our thinking ( writing)
if we are to make sense of the complexities
- We are still a long way from having reliable
tools for deducing protein function from
sequence
- but with the right approach, there is hope
120The right approach risks being obscured by other
issues...