Title: Biological Databases for Protein Sequence Analysis
1Biological Databasesfor Protein Sequence Analysis
- Teresa K. Attwood
- School of Biological Sciences
- University of Manchester, Oxford Road
- Manchester M13 9PT, UK
- http//www.bioinf.man.ac.uk/dbbrowser/
2Overview
- Introduction
- Web practical, science fact fiction, some
reality checks - Biological databases
- sequence, family, composite, etc.
- Pattern recognition
- regular expressions, fingerprints, profiles, etc.
- Building a search protocol
- a real example
3Introduction
- Single- three-letter amino acid codes
- G Glycine Gly P Proline Pro
- A Alanine Ala V Valine Val
- L Leucine Leu I Isoleucine Ile
- M Methionine Met C Cysteine Cys
- F Phenylalanine Phe Y Tyrosine Tyr
- W Tryptophan Trp H Histidine His
- K Lysine Lys R Arginine Arg
- Q Glutamine Gln N Asparagine Asn
- E Glutamic Acid Glu D Aspartic Acid Asp
- S Serine Ser T Threonine Thr
- Additional codes
- B Asn/Asp Z Gln/Glu X Any amino acid
4(No Transcript)
5Basic definitions
- Primary structure
- the linear sequence of amino acids in a protein
- Secondary structure
- regions of local regularity
- i.e., a-helices, b-strands, -sheets -turns
6Definitions contd.
- Super-secondary structure
- the packing of secondary structure elements into
stable units - e.g., b-barrels, bab units, Greek keys, etc..
7Definitions contd.
- Tertiary structure
- the overall chain fold that results from packing
of secondary structure elements
8Definitions contd.
- Quaternary structure
- the arrangement of separate chains within a
protein that has more than one subunit - e.g., haemoglobin
9Definitions contd.
- Quinternary structure
- the arrangement of separate molecules, such as in
protein-protein or protein-nucleic acid
interactions
10The practical - BioActivity
- BioActivity sequence analysis in action
- begin with a fragment of a DNA sequence
- try to find out what protein this codes for, the
family to which it belongs, whether its
function structure are known - The practical is entirely Web-based
- be mindful of traffic don't waste time on slow
links - Most important of all
- read the instructions!
- The Web is constantly evolving....
- please report dead links (otherwise theyll stay
dead)!
11Importance of sequence analysis
- gt900,000 sequences available in public dbs
- millions more (including ESTs) in proprietary
dbs - these s will snowball with completion of more
genomes - so what?
- Locked up in sequences is a huge amount of
structural, functional evolutionary info - they're a highly valuable resource
- By contrast, the of unique protein structures
is 2000 - a huge information deficit
12The legacy of the genome projectsSequence-structu
re deficit
800 700 600 500 400 300 200 100
1988
2002
- Non-redundant growth of sequences during
1988-2002 ( ) the corresponding growth in
the number of structures ( ).
13Challenges for bioinformatics
- Spurred on by the seq/structure deficit, the
challenges - rationalise the mass of sequence data
- derive more efficient means of data storage
- design more incisive reliable analysis tools
- The imperative - to convert sequence information
into biochemical biophysical knowledge - to decipher the structural, functional
evolutionary clues encoded in the language of
biological sequences
14The Holy Grail of bioinformatics
- ...to be able to understand the words in a
sequence sentence that form a particular protein
structure
15The reality of sequence analysis
- ...isn't so glamorous....but means we can
recognise words that form characteristic
patterns, even if we don't know the precise
syntax to build complete protein sentences
16Pattern recognition prediction
- In investigating the meaning of sequences, two
distinct analytical approaches have emerged - pattern recognition is used to detect similarity
between sequences hence to infer related
structures functions - ab initio prediction is used to deduce structure,
to infer function, directly from sequence - These methods are quite different!
- pattern recognition methods demand that some
characteristic has been seen before housed in a
db - prediction methods remove the need for template
dbs, because deductions are made directly from
sequence
17Science fact fiction
- Sequence pattern recognition is easier to
achieve, is much more reliable, than fold
recognition - which is 50 reliable even in expert hands
- Prediction is still not possible
- is unlikely to be so for decades to come (if
ever) - Structural genomics will yield representative
structures for many (but not all) proteins in
future - structures of new sequences will be determined by
modelling - prediction will become an academic exercise
- But, to debunk a popular myth, knowing structure
alone does not inherently tell us function
18A reality check
- What is the function of this structure?
- What is the function of this sequence?
- What is the function of this motif?
- the fold provides a scaffold, which can be
decorated in different ways by different
sequences to confer different functions knowing
the fold function allows us to rationalise how
the structure effects its function at the
molecular level
19 A test case for structural genomics
Structure-based assignment of the biochemical
function of hypothetical protein mj0577
(Zarembinski et al., PNAS 95 1998)
Although the structure co-crystallised with ATP,
the biochemical function of the protein is
unknown
20The Twilight Zone
- Prediction methods dont work because we dont
fully understand the Folding Problem - we cant read the language sequences use to
create their folds - But, with sequence analysis techniques, we can
try to find similarities between new sequences
those in dbs - whose structures functions we hope have been
elucidated - This is straightforward at high levels of
identity, but below 50 it is difficult to
establish relationships reliably - Analyses can be pursued with decreasing certainty
towards the Twilight Zone - 20 identity, where results may look plausible
to the eye, but are no longer statistically
significant
21Beyond the Twilight Zone
- To penetrate deeper into the Twilight Zone is the
aim of most analytical methods - whether using single sequences, motifs, complex
weighting schemes or raw amino acid frequencies - Each offers a different perspective, depending on
the type of information used in the search - none gives the right answer
- It is good practice to devise an analysis
protocol that uses a variety of methods - but dont expect the impossible no method is
infallible!
22Application areas of analysis tools
- The scale indicates identity between aligned
sequences - Alignment of 2 random seqs can produce 20
identity - less than 20 does not constitute a significant
alignment - around this threshold is the Twilight Zone,
where alignments may appear plausible to the eye,
but cant be proved by conventional methods
23Homology analogy
- The term homology is confounded abused!
- sequences are homologous if they are related by
divergence from a common ancestor - analogy relates to the acquisition of common
features from unrelated ancestors via convergent
evolution - e.g., b-barrels occur in soluble membrane
proteins enzymes chymotrypsin subtilisin share
groups of catalytic residues, with near identical
spatial geometries, but no other similarities - It is not a measure of similarity is not
quantifiable - it is an absolute statement that sequences have a
divergent rather than a convergent relationship - the phrases "the level of homology is high" or
"the sequences show 50 homology", or any like
them, are strictly meaningless! - This is not just a semantic issue
- loose use muddies thinking about evolutionary
relationships
24A terminology muddle
- The same arguments apply to 3D structures
- structures may be similar, as denoted by RMS
positional deviation between compared atomic
positions - but their common evolutionary origin is a
hypothesis - the hypothesis may be correct or mistaken, but
their similarity is a fact, no matter how it is
interpreted - Similarity of sequence or structure is just that
similarity - Homology connotes a common evolutionary origin
- Reeck, G.R., de Haen, C., Teller, D.C.,
Doolittle, R.F., Fitch, W.M., Dickerson, R.E.,
Chambon, P., McLachlan, A.D., Margoliash, E.,
Jukes, T.H. Zuckerkandl, E. (1987) Homology
in proteins and nucleic acids a terminology
muddle and a way out of it. Cell, 50, 667.
25More challenges for sequence analysis
- Much of the challenge is in getting the biology
right - this is complicated by the problem of orthology
vs paralogy - Following a search, how much functional
annotation can be legitimately inherited by a
query? - source of numerous annotation errors in dbs
- error propagation could lead to an error
catastrophe - Further complications arise due to modular nature
of proteins - modules are autonomous folding units (protein
building blocks) - confer variety of functions on a parent protein,
by multiple combin-ations of the same module, or
different modules to form mosaics - Automatic analysis systems dont distinguish
orthologues from paralogues dont consider the
modular nature of proteins
26(No Transcript)
27- Monkeys are exploited in different Goldberg
machines, where they perform different functions
here, we could not predict a monkey sitting in
that spot, even with total knowledge of the rest
of the machine - Similarity searches are just like this
- identifying the presence of a module tells little
of the function of the complete system - knowing most components of a mosaic, we cant
predict a missing one - modules (monkeys) in different proteins dont
always perform exactly the same function
28The Midnight Zone
- Notwithstanding the lessons of Goldberg machines,
identifying evolutionary links between sequences
is useful - this often implies a shared function
- In the genome era, prediction of function from
sequence is of more immediate value than is the
prediction of structure - However, between distantly-related proteins,
structure is more conserved than the underlying
sequences - thus, some relationships are only apparent at the
structural level - Such relationships cant be detected by even the
most sensitive sequence comparison methods - the region of identity where sequence comparisons
fail completely to detect structural similarity
is the Midnight Zone there is thus a
theoretical limit to the effectiveness of
sequence analysis methods
29Significance
- Appreciating that mathematical biological
significance are different is crucial it is
especially important in understanding the
limitations of - search alignment algorithms, pattern
recognition techniques, functional site
structure prediction tools - Contrary to popular opinion, there is currently
still - no biologically-reliable automatic multiple
alignment algorithm - no infallible pattern-recognition technique
- no reliable gene, function or structure
prediction algorithms
30(No Transcript)
31Computers dont do biology!
32Biological Databases
- Overview
- Sequence repositories
- SWISS-PROT TrEMBL
- Composite sequence databases
- NRDB, SPTrEMBL
- Family (pattern) resources
- PROSITE, PRINTS, profiles, Pfam, Blocks, eMOTIF
- Composite family databases
- InterPro
33Primary sequence databases
- In the early '80s, when sequence data started to
accumulate, several labs saw advantages to
establishing central repositories - trouble is, many labs. thought this was a good
idea made their own - Nucleic Protein
- EMBL PIR
- GenBank SWISS-PROT
- DDBJ MIPS
- JIPID
- TrEMBL
- The proliferation of dbs causes problems
- do they have the same format? Which is the most
accurate? The most up-to-date? The most
comprehensive? Which should we use?
34SWISS-PROT
- Endeavours to provide high-level annotation
- e.g., descriptions of the function of the
protein, the organisation of its domains, PTMs,
family disease relationships, variants, etc. - Contains entries from gt5,000 species
- the bulk of these from just a handful of model
organisms - H.sapiens, E.coli, M.musculus, D.melanogaster,
S.cerevisiae, etc. - The quality of its annotations sets is apart from
other dbs - Consequently, it cannot keep pace with the rate
of data acquisition from the sequencing centres
35(No Transcript)
36(No Transcript)
37TrEMBL
- A computer-annotated supplement to SP
- has the SP format contains translations of all
CDSs in EMBL - It has 2 main sections
- SP-TrEMBL contains all entries that will
eventually go into SP, but haven't yet been
manually annotated - REM-TrEMBL contains sequences not destined to
be in SP - Igs, fragments of lt8 residues, synthetic
sequences, etc. - Arose from the need for a structured SP-like
resource, allowing rapid access to genome data,
without compromising the quality of SP by
including entries with poor analysis
insufficient annotation
38(No Transcript)
39Composite sequence databases
- A solution to the problem of proliferating dbs is
to compile a composite - these render searches very efficient, especially
if non-redundant - Trouble is, there are now several composites,
each with their own format redundancy criteria
the most commonly used are - NRDB SPTrEMBL
- PDB SWISS-PROT
- SWISS-PROT TrEMBL
- PIR
- GenPept
- GenPept updates
- NRDB SPTrEMBL are non-identical, not
non-redundant - but which is best? Which the most comprehensive?
The most up-to-date? Which should we use?
40NRDB
- NRDB is built locally at the NCBI
- it includes weekly updates of SP daily updates
of GenBank, so is up-to-date comprehensive - But the simplistic manner of its construction
causes problems - multiple copies of the same protein are retained
as a result of polymorphisms /or sequencing
errors - errors corrected in SP are reintroduced when
retranslated from DNA - numerous sequences are duplicates of existing
fragments - The contents of the db are thus error-prone
redundant - NRDB is the default db of the NCBI BLAST service
41SPTrEMBL
- This resource is intended to be both
comprehensive minimally redundant - It contains fewer errors than NRDB, but is not
truly non-redundant - 30 of the combined total of SP TrEMBL is
non-unique - Further reduction of error rates requires more
manual intervention better expert db management
systems
42Family (pattern) databases
- As well as 1' resources, there are also many
family or pattern dbs derived from them - trouble is, they use different 1' sources
different analysis methods, all have different
formats! - But it isn't all bad SWISS-PROT is emerging as
a standard, most pattern dbs use it as their
basis - PROSITE SWISS-PROT Regular expressions
(patterns) - PRINTS SWISS-PROT/TrEMBL Aligned motifs
(fingerprints) - Pfam SWISS-PROT/TrEMBL Hidden Markov Models
(HMMs) - Profiles SWISS-PROT Weight matrices (profiles)
- Blocks InterPro/PRINTS Weighted motifs (blocks)
- eMOTIF Blocks/PRINTS Permissive regular
expressions
43Why create pattern databases?
- Pattern dbs arise from the need to make more
specific functional diagnoses than are possible
simply by searching the 1's - They are built on the principle that homologous
sequences may be gathered together in multiple
alignments, within which are regions (motifs)
that show little variation - these motifs usually reflect some vital
biological role in terms of either structure or
function - Motifs are exploited in different ways to build
diagnostic patterns for protein families - new sequences can be searched against dbs of such
patterns to see if they can be assigned to known
families - hence they offer a fast track to the inference of
function
44What's in a sequence?
45Methods for family analysis
Single motif methods
Fuzzy regex (eMOTIF)
Full domain alignment methods
Exact regex (PROSITE)
Profiles (PROFILE LIBRARY)
HMMs (Pfam)
Identity matrices (PRINTS)
Multiple motif methods
Weight matrices (Blocks)
46The challenge of family analysis
- highly divergent family with single function?
- superfamily with many diverse functional
families? - must distinguish if function analysis done in
silico - a tough challenge!
47Know your family
48The problem with domains
49PROSITE
- The first pattern db
- based on the idea that a protein family can be
characterised by a pattern of conserved residues
within a single motif - Sequence information in motifs is reduced to
consensus or regular expressions (regexs) the
seed regex used to search SP - results are inspected manually to achieve optimal
results - Some families cant be characterised by single
motifs - here, additional regexs are created until an
optimal set is achieved that captures most or all
of the family - results are then manually annotated for inclusion
in the db
50R-Y-x-DT-W-x-LIVM-ST-T-P-LIVM(3)
51(No Transcript)
52(No Transcript)
53PRINTS
- Most protein families are characterised by gt1
motif - it is sensible to use many/all of them to build a
diagnostic signature - This is the principle of fingerprints
- these offer improved diagnostic reliability by
virtue of the biological context provided by
motif neighbours - Motifs are excised from alignments by hand
- residue information is augmented via iterative
searches - results are manually annotated prior to inclusion
in the db
54Motif context
order
1
2
3
4
5
interval
55(No Transcript)
56SUMMARY INFORMATION 37 codes involving 8
elements 0 codes involving 7 elements
0 codes involving 6 elements 0 codes
involving 5 elements 0 codes involving 4
elements 1 codes involving 3 elements 0
codes involving 2 elements COMPOSITE
FINGERPRINT INDEX 8 37 37 37 37
37 37 37 37 7 0 0 0 0
0 0 0 0 6 0 0 0 0 0
0 0 0 5 0 0 0 0 0
0 0 0 4 0 0 0 0 0
0 0 0 3 1 0 0 0 1 1
0 0 2 0 0 0 0 0 0 0
0 ----------------------------------------
-- 1 2 3 4 5 6 7 8
True positives.. PRIO_COLGU PRIO_MACFA
PRIO_CEREL PRIO_ODOHE PRIO_GORGO PRIO_PANTR
PRIO_HUMAN O46648 PRIO_SHEEP PRIO_CALJA
PRIO_BOVIN PRP2_BOVIN PRIO_ATEPA PRIO_SAISC
PRIO_PREFR PRIO_PONPY O75942 PRIO_CAPHI
PRIO_CEBAP PRIO_CAMDR PRIO_FELCA PRP1_TRAST
PRIO_RABIT PRP2_TRAST PRIO_PIG
PRIO_CANFA PRIO_CRIGR PRIO_CRIMI Q15216
PRIO_RAT PRIO_CERAE PRIO_MUSPF PRIO_MUSVI
PRIO_MESAU PRIO_MOUSE O46593 PRIO_TRIVU
Subfamily Codes involving 3 elements
Subfamily True positives.. PRIO_CHICK
57(No Transcript)
58Profiles Pfam
- An alternative to motif-based methods exploits
regions between motifs, which also contain
valuable information - the full alignment effectively becomes the
discriminator - A complex scoring scheme allowing for
substitutions INDELs is used to create
family-specific profiles - These profiles can be used to detect distant
relation-ships, where only few residues are
conserved - this is the basis of the Profile library
- In an extension of this approach, alignments are
encoded as probabilistic models termed HMMs - this is the basis of Pfam
59(No Transcript)
60(No Transcript)
61(No Transcript)
62Blocks eMOTIF
- Various advantages to storing motifs in a raw
form - no information is lost, different scoring
schemes may be used to confer different
diagnostic potentials on the same data - Additional dbs have arisen in this way
- Blocks uses families identified in InterPro,
aligns the sequences detects motifs
automatically - BLOCKS-format PRINTS uses motifs in PRINTS with
the Blocks scoring scheme - eMOTIF creates permissive regexs from Blocks
PRINTS - These dbs are derived fully automatically hence
offer - no family annotation (they link back to InterPro
PRINTS) - no further family coverage
63(No Transcript)
64Composite pattern databases
- To simplify sequence analysis, the family
databases are being integrated to create a
unified annotation resource InterPro - release 4.0 contains 4691 entries
- a central annotation resource, with pointers to
its satellite dbs - initial partners were PRINTS, PROSITE, profiles
Pfam - new partners include ProDom, TIGRfam, SMART
hopefully others (e.g., Blocks, MetaFam) - lags behind its sources
- major role in fly human genome annotation
65(No Transcript)
66(No Transcript)
67Pattern Recognition
- Overview
- Determining significance of db matches
- Pattern recognition methods
- regular expression patterns rules
- fingerprints blocks
- profiles HMMs
- Current status of pattern dbs
68Pattern recognition methods
- These methods classify proteins into families
- the basis of the methods is multiple sequence
alignment - They depend on developing representations of
conserved elements of alignments that may be
diagnostic of structure or function, whether from - homologous sequence families
- sequences that share some structural/functional
domains
69Single motif methods
Fuzzy regex (eMOTIF)
Exact regex (PROSITE)
Full domain alignment methods
Profiles (PROFILE LIBRARY)
HMMs (Pfam)
Identity matrices (PRINTS)
Multiple motif methods
Weight matrices (Blocks)
70Determining significance of database matches
- When searching a db, the challenge for analysis
methods is to determine if matches are related
(true-positive) or unrelated (true-negative) - At a given scoring threshold, it is likely that
unrelated sequences will be matched erroneously
(false-positives) some correct matches will be
missed (false-negative) - The aim is to improve the resolution between the
curves - in the overlap, it is difficult or
impossible to establish if matches are
significant - Different methods tackle this problem in
different ways
71Resolving true false matches
N
True negative
Score
72Resolving true false matches
N
True negative
Score
73Regular expressions (patterns)
- These are derived from single conserved regions
in alignments - they are minimal expressions, so sequence
information is lost - the more divergent the sequences used, the more
fuzzy poorly discriminating the regex becomes - Alignment Regex
- GAVDFIALCDRYF
- GPIDFVCFCERFY G-X-IV-DE-F-IVL-X2-C-DE-R-
FY2 - GRVEFLNRCDRYY
- Regexs do not tolerate similarity
- sequences either match or not, regardless of how
similar they are - matching is a binary on-off event frequently
misses true matches - single-motif methods are very hit-or-miss how
do you know if you've encoded the best region?
74In the beginning was PROSITE
- G_PROTEIN_RECEPTOR PATTERN
- PS00237
- G-protein coupled receptor signature
- GSTALIVMYWC-GSTANCPDE-EDPKRH-X(2)-LIVMNQGA
- - X(2)-LIVMFT-GSTANC-LIVMFYWSTAC-DENH-R
- /TOTAL1121(1121) /POS1057(1057)
/FALSE_POS64(64) - /FALSE_NEG112 /PARTIAL48 UNKNOWN0(0)
- This represents an apparent 20 error rate
- the actual rate is probably higher
- Thus, a match to a pattern is not necessarily
true - a mis-match is not necessarily false!
- False-negatives are a fundamental limitation to
this type of pattern matching - if you don't know what you're looking for, you'll
never know you missed it!
75(No Transcript)
76Regular expressions (rules)
- Regex patterns are most effective when applied to
highly-conserved, family-specific motifs - It is often possible to identify, shorter generic
patterns within sequences, characteristic of
common functional sites - Functional site Rule
- N-glycosylation N-P-ST-P
- Protein kinase C phosphorylation ST-X-RK
- Casein kinase II phosphorylation ST-X2-DE
- Such features result from convergence to a common
property - glycosylation sites, phosphorylation sites, etc.
- They cannot be used for family diagnosis dont
discriminate - they can only be used to suggest whether a
certain functional site might exist (which must
then be tested by experiment) - such patterns are normally termed rules
77Residue groups for fuzzy regexs
- It is possible to assign residues to groups based
on various biochemical properties e.g., charge
size - using such groups theoretically ensures that
resulting regexs have sensible biochemical
interpretations - small Ala, Gly
- small hydroxyl Ser, Thr
- basic His, Lys, Arg
- aromatic Phe, Tyr, Trp
- aliphatic Val, Leu, Ile, Met
- acidic/amide Asp, Glu, Asn, Gln
- small/polar Ala, Gly, Ser, Thr, Pro
- This is more flexible than exact regex matching
78Diagnostic limitations
- Consider the sequence motif Asp-Ala-Val-Ile-Asp
(DAVID) - results of searching for such a motif will
differ, depending on the db, the motif length
whether we use exact or permissive fuzzy regexs - Pattern Matches
- D-A-V-I-D 71 (99)
- D-A-V-I-DEQN 252
- DEQN-A-V-I-DEQN 925
- DEQN-A-VLI-I-DEQN 2,739
- DEQN-AG-VLI-VLI-DEQN 51,506
- D-A-V-E 1,088 (1,493)
- (number of matches in OWL29.6 ( OWL31.1))
- Use of fuzzy regexs has the potential advantage
of being able to recognise more distant
relationships - the inherent disadvantage that more matches
will be made by chance, making it difficult to
separate true matches from noise
79Fingerprints
- Fingerprints are groups of conserved (ungapped)
motifs excised from alignments used for
iterative db searching - no weighting scheme is used
- searches depend only on residue frequencies
- resulting scoring matrices are thus sparse
- Each motif trawls the db independently
- search results are correlated to determine which
sequences match all the motifs which match only
partially - no information is thrown away
- The iterative process refines the fingerprint
increases its power - potency is gained from the mutual context of
motif neighbours - results are biologically more meaningful than
those from single motifs
80TM domain
TM domain
loop region
81loop region
TM domain
TM domain
82A fingerprinting overview
PRINTS
annotation
83How fingerprints are stored
84- T C A G N S P F L Y H Q V K D E
I W R M B X Z - 0 0 2 0 0 0 0 0 0 7 0 0 1 0 0 0
0 2 0 0 0 0 0 - 0 0 3 0 0 0 0 0 2 0 0 0 4 0 0 0
3 0 0 0 0 0 0 - 6 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 - 1 0 1 1 0 3 0 0 1 0 0 0 3 0 0 0
0 0 0 2 0 0 0 - 2 0 1 1 0 0 0 0 0 0 0 3 0 4 0 0
0 0 1 0 0 0 0 - 4 0 1 0 0 1 0 2 0 1 3 0 0 0 0 0
0 0 0 0 0 0 0 - 0 0 0 0 0 0 0 0 0 0 0 0 0 12 0 0
0 0 0 0 0 0 0 - 0 0 0 0 0 6 0 0 0 0 0 0 0 6 0 0
0 0 0 0 0 0 0 - 0 0 0 0 0 0 0 0 12 0 0 0 0 0 0 0
0 0 0 0 0 0 0 - 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0
0 0 10 0 0 0 0 - 9 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0
0 0 0 0 0 0 0 - 0 0 0 0 0 0 12 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 - 0 0 4 0 0 2 0 0 6 0 0 0 0 0 0 0
0 0 0 0 0 0 0 - (b)
- T C A G N S P F L Y H Q V K D E
I W R M B X Z - 0 0 4 0 0 0 0 8 4 34 0 0 15 0 0 0
1 7 0 0 0 0 0 - 0 4 15 0 0 0 0 0 7 0 0 0 37 0 0 0
10 0 0 0 0 0 0 - 50 0 0 0 0 3 0 18 0 0 0 0 0 0 0 0
0 0 0 2 0 0 0
- YVTVQHKKLRTPL
- YVTVQHKKLRTPL
- YVTVQHKKLRTPL
- AATMKFKKLRHPL
- AATMKFKKLRHPL
- YIFATTKSLRTPA
- VATLRYKKLRQPL
- YIFGGTKSLRTPA
- WVFSAAKSLRTPS
- WIFSTSKSLRTPS
- YLFSKTKSLQTPA
- YLFTKTKSLQTPA
- (a)
- Key
- (a) motif, with 3 conserved positions
- (b) corresponding frequency matrix
- (c) same matrix, but after 3 iterations
- (d) same matrix, with PAM250 weighting
85Fingerprint visualisation
- The full potency of fingerprinting is gained from
the mutual context provided by motif neighbours - This is important, as the method inherently
implies a biological context to motifs that are
matched in the correct order in the query
sequences, with appropriate distances between
them - This allows sequence identification even when
parts of the fingerprint are absent - e.g., a sequence that matches only 4 of 7 motifs
may still be diagnosed as a true match if the
pattern of motif matching is consistent with that
expected of true neighbouring motifs - Such matches are best visualised graphically
86Visualising fingerprints
ID
PRINTS
N
C
Query sequence
Missing motif?
87N
C
88N
C
89(No Transcript)
90(No Transcript)
91Blocks
- Blocks are groups of motifs derived automatically
from families identified in InterPro - sequences are aligned automatically motifs are
automatically identified by searching for spaced
residue triplets (e.g., AxxxVxxC) - a block score is calculated using the BLOSUM62
matrix - validity of blocks is confirmed with a 2nd
motif-finding algorithm - blocks found by both methods are considered
reliable - Sequences within motifs are clustered to reduce
contributions to residue frequencies from sets of
closely-related sequences - each cluster is treated as a single sequence
given a score that gives a measure of its
relatedness - the higher the weight, the more dissimilar the
segment from others in the block, the most
distant being given a score of 100 - segments lt80 similar are separated by blank lines
92(No Transcript)
93CSC triplet
94Profiles
- Profiles are scoring tables derived from full
domain alignments - these define which residues are allowed at given
positions - which positions are conserved which degenerate
- which positions, or regions, can tolerate
insertions - the scoring system is intricate, may include
evolutionary weights, results from structural
studies, data implicit in the alignment - variable penalties are specified to weight
against INDELs occurring in core 2' structure
elements - Within a profile, the I M fields contain
position-specific scores for insert match
positions - in conserved regions, INDELs aren't totally
forbidden, but are strongly impeded by large
penalties defined in the DEFAULT field - these are superseded by more permissive values in
gapped regions - the inherent complexity of profiles renders them
highly potent discriminators, but they are
time-consuming to derive
95(No Transcript)
96(No Transcript)
97Hidden Markov Models
- HMMs are similar in concept to profiles by virtue
of encoding full domain alignments - they are probabilistic models consisting of a
number of inter-connecting states - essentially, linear chains of match, delete or
insert states - Match states are assigned to conserved columns in
an alignment - insert states allow for insertions relative to
match states - delete states allow match positions to be skipped
- thus, building an HMM from an alignment requires
each position to be assigned either to match,
delete or insert states - HMMs usually perform well, but can be
over-trained - they may also suffer if they are created from an
iterative automatic alignment process if this
once accepts a false match, the HMM will become
corrupt
98An HMM
C
L
Y
E
C
L
W
D
99Which craft is best?
- The wide variety of methods available leads to
familiar problems - which should we use?
- which is the most reliable?
- which is the most comprehensive?
- ......etc.
- None of the pattern-recognition techniques is
infallible, none of the resulting pattern dbs
is complete - bearing in mind the diagnostic strengths
weaknesses of the different approaches, always
keeping biological significance in mind, the best
strategy is simply to use them all
100Overview of resources
- PROSITE (SIB) - 1108 entries
- single motifs (regexs) - best with small highly
conserved sites - Profile library (ISREC) - 300 entries
- weight matrices - good with divergent domains
superfamilies - PRINTS (Manchester) - 1750 entries
- multiple motifs (fingerprints) - best for
families and sub-families - Pfam (Sanger Centre) - 3071 entries
- HMMs - good with divergent domains
superfamilies - InterPro (EBI) - 4691 entries
- derived from PRINTS, PROSITE, Profiles, Pfam,
ProDom, etc. - Blocks (FHCRC) - 2608 entries
- multiple motifs (derived from InterPro PRINTS)
- eMOTIF (Stanford)
- permissive regexs (derived from PRINTS BLOCKS)
101Building a Search Protocol
- Overview
- The usual starting point
- searching the primary data sources
- Pattern recognition methods
- searching the secondary sources
- Structural functional interpretation of results
- Estimating significance
- when do we believe a result?
102A practical approach
- Given a newly-determined sequence, we want to
know - what is my protein?
- to what family does it belong?
- what is its function?
- how can we explain its function in structural
terms? - To this end, by searching pattern dbs fold
libraries, we may recognise patterns that allow
us to infer relationships with previously-characte
rised families/folds - Given the variety of dbs to search, how do we use
them to build a sensible search protocol for
novel sequences?
103- Protein sequence
database identity search - e.g., for short fragments, pinpoints
identical matches - to probe - may identify correct reading
frame - Protein sequence database similarity search
- e.g., nrdb, SPSPTrEMBL - identifies potential
- homologues to probe
- Protein pattern database search
- e.g., PROSITE, profiles, PRINTS, Blocks,
- Pfam - identifies
family relationships or pin- - points key
structural or functional sites - Known structure No known
structure - Structure classification database query
Protein fold pattern library search - e.g., scop, CATH, FSSP - provides details
e.g., threading - identifies compatible - of structural class, secondary structure
folds for the probe sequence - information, ligand-binding, etc.
104Searching the primary databases
- Identity searching
- the fastest test of an unknown fragment is to
perform an identity search. This will reveal in
seconds whether an exact match to the unknown
peptide already exists - This can be helpful in identifying the correct
reading frame following a 6-frame translation - ccgtactacaactacgctggtgcattcaag
- Forward 0
- PYYNYAGAFK TRFE_XENLA 207 AGIKEHKCSRSNNE
PYYNYAGAFK CLQDDQGDVAFVKQ - Forward 1 XLTRSFER 207 AGIKEHKCSRSNNE
PYYNYAGAFK CLQDDQGDVAFVKQ - RTTTTLVHS
- Forward 2 TRFE_XENLA TRANSFERRIN PRECURSOR
- XENOPUS LAEVIS - VLQLRWCIQ XLTRSFER TRANSFERRIN PRECURSOR
- XENOPUS LAEVIS - Reverse 0
- LECTSVVVR
- Reverse 1
- LNAPA!L!Y
- Reverse 2
- !MHQRSCST
105Similarity searching
- Whether or not an identity search finds a match,
the next step is to look for similar sequences - e.g., you may wish to know if a wider family
exists - The most rapid simple option is to use BLAST,
flavours of it, or FastA - Several features are worthy of note in BLAST
output - look for high scores with low P-values (unlikely
to be random) - look for clusters of high scores at the top of
the hitlist (a family?) - look for trends in the type of sequences matched
106Ideal results show high scores low E-values
107(No Transcript)
108Why bother with pattern searches?
- Primary searches won't always allow outright
diagnosis - BLAST FASTA are not infallible
- BLAST, in particular, often can't assign
significant scores - results may be complicated by the presence of
modules, or compositionally-biased regions - annotations of retrieved hits may be incorrect
- Pattern dbs contain potent descriptors
- so, distant relationships missed by BLAST may be
captured by one or more of the family or
functional site distillations
109(No Transcript)
110Searching the pattern databases
- Searching PROSITE
- when using PROSITE's Web form, it is advisable to
exclude rules from the search, otherwise output
is filled with spurious matches - results are either match, or no match
- the user has to judge whether hits are significant
111(No Transcript)
112(No Transcript)
113Searching the pattern databases
- Searching Profiles
- the SIB Web server offers access both to profiles
within PROSITE pre-release (undocumented)
profiles - results are highly specific generally
diagnostically reliable - if no match is returned, its usually because the
entry isnt in the db - matches to undocumented profiles are often
dead-ends
114(No Transcript)
115Searching the pattern databases
- Searching Pfam
- results are returned in HTML tables accompanied
by simple graphics to illustrate matched domains - results are specific usually diagnostically
reliable - E-values provide the measure of confidence
116(No Transcript)
117Searching the pattern databases
- Searching PRINTS
- results are returned in HTML tables on different
levels - a best "guess
- the top 10 best-scoring matches
- the raw data
- graphical options provide a visual impression of
the quality of matches - results are specific usually diagnostically
reliable - combined E- p-values provide the measure of
confidence
118(No Transcript)
119(No Transcript)
120Searching the pattern databases
- Searching Blocks
- if results of searching PROSITE PRINTS are
positive, we would expect these to be confirmed
by searches of the Blocks dbs - key features to note in the output are
- the description line, the accession codes (which
indicate which is the matched motif), the
best-scoring or anchor block - most important is the detection of multiple block
hits where this happens, an E-value denotes the
significance of the match - single block matches are usually spurious
121(No Transcript)
122Searching the pattern databases
- Searching eMOTIF
- as with Blocks, if results of searching PROSITE
PRINTS are positive, this should be confirmed by
searches of eMOTIF - output is given at several stringency levels,
which indicate the number of false matches to
expect in the reported results
123(No Transcript)
124Which approach is best?
- BLAST frequently fails to assign significant
scores - The hit-or-miss nature of single-motif regular
expressions can render them worthless - In spite of (because of?) their complexity,
profiles HMMs are often out-performed by
simpler motif methods - The non-weighting system of fingerprints means
that Twilight relationships may be missed - The scoring system used to create blocks
generates large amounts of noise that may obscure
the signal - Only PROSITE PRINTS are fully manually
annotated - No method alone is best
125Structural functional interpretation
- Db searches often do little more than identify a
protein family - this only scratches the surface we still want
to know what our protein does what it might
look like - The first step is to examine the detailed family
documentations in PROSITE, PRINTS InterPro - these should help to elucidate the function of
the protein - The next step is to examine the fold
classification structure summary resources - e.g., scop, CATH PDBsum, assuming that a
structure is in fact available.
126(No Transcript)
127(No Transcript)
128(No Transcript)
129(No Transcript)
130(No Transcript)
131Estimating significance
- When do we believe a result?
- a real example.....
132(No Transcript)
133(No Transcript)
134(No Transcript)
135(No Transcript)
136(No Transcript)
137(No Transcript)
138(No Transcript)
139(No Transcript)
140(No Transcript)
141(No Transcript)
142(No Transcript)
143(No Transcript)
144Conclusions
- What are the lessons for sequence analysis?
- when searching for distant homologues, several
dbs should be searched - different methods provide different perspectives
- dbs arent complete their contents dont fully
overlap - The more dbs searched, the more difficult it can
be to interpret results - hence s/w is being designed to provide
"intelligent" consensus outputs - The more computers are involved in automating
genome annotation, the greater the need for
collaboration - especially between s/w developers, annotators
biologists - The more data we have to handle, the more
rigorous we must be in our thinking ( writing)
if we are to make sense of the complexities - We are a long way from having reliable tools for
deducing protein structure function from
sequence - but with the right approach, there is hope