Biological Databases for Protein Sequence Analysis - PowerPoint PPT Presentation

1 / 144
About This Presentation
Title:

Biological Databases for Protein Sequence Analysis

Description:

Biological Databases. for Protein Sequence Analysis. Teresa K. Attwood ... Appreciating that mathematical & biological significance are different is ... – PowerPoint PPT presentation

Number of Views:205
Avg rating:3.0/5.0
Slides: 145
Provided by: attw
Category:

less

Transcript and Presenter's Notes

Title: Biological Databases for Protein Sequence Analysis


1
Biological Databasesfor Protein Sequence Analysis
  • Teresa K. Attwood
  • School of Biological Sciences
  • University of Manchester, Oxford Road
  • Manchester M13 9PT, UK
  • http//www.bioinf.man.ac.uk/dbbrowser/

2
Overview
  • Introduction
  • Web practical, science fact fiction, some
    reality checks
  • Biological databases
  • sequence, family, composite, etc.
  • Pattern recognition
  • regular expressions, fingerprints, profiles, etc.
  • Building a search protocol
  • a real example

3
Introduction
  • Single- three-letter amino acid codes
  • G Glycine Gly P Proline Pro
  • A Alanine Ala V Valine Val
  • L Leucine Leu I Isoleucine Ile
  • M Methionine Met C Cysteine Cys
  • F Phenylalanine Phe Y Tyrosine Tyr
  • W Tryptophan Trp H Histidine His
  • K Lysine Lys R Arginine Arg
  • Q Glutamine Gln N Asparagine Asn
  • E Glutamic Acid Glu D Aspartic Acid Asp
  • S Serine Ser T Threonine Thr
  • Additional codes
  • B Asn/Asp Z Gln/Glu X Any amino acid

4
(No Transcript)
5
Basic definitions
  • Primary structure
  • the linear sequence of amino acids in a protein
  • Secondary structure
  • regions of local regularity
  • i.e., a-helices, b-strands, -sheets -turns

6
Definitions contd.
  • Super-secondary structure
  • the packing of secondary structure elements into
    stable units
  • e.g., b-barrels, bab units, Greek keys, etc..

7
Definitions contd.
  • Tertiary structure
  • the overall chain fold that results from packing
    of secondary structure elements

8
Definitions contd.
  • Quaternary structure
  • the arrangement of separate chains within a
    protein that has more than one subunit
  • e.g., haemoglobin

9
Definitions contd.
  • Quinternary structure
  • the arrangement of separate molecules, such as in
    protein-protein or protein-nucleic acid
    interactions

10
The practical - BioActivity
  • BioActivity sequence analysis in action
  • begin with a fragment of a DNA sequence
  • try to find out what protein this codes for, the
    family to which it belongs, whether its
    function structure are known
  • The practical is entirely Web-based
  • be mindful of traffic don't waste time on slow
    links
  • Most important of all
  • read the instructions!
  • The Web is constantly evolving....
  • please report dead links (otherwise theyll stay
    dead)!

11
Importance of sequence analysis
  • gt900,000 sequences available in public dbs
  • millions more (including ESTs) in proprietary
    dbs
  • these s will snowball with completion of more
    genomes
  • so what?
  • Locked up in sequences is a huge amount of
    structural, functional evolutionary info
  • they're a highly valuable resource
  • By contrast, the of unique protein structures
    is 2000
  • a huge information deficit

12
The legacy of the genome projectsSequence-structu
re deficit
800 700 600 500 400 300 200 100
1988

2002
  • Non-redundant growth of sequences during
    1988-2002 ( ) the corresponding growth in
    the number of structures ( ).

13
Challenges for bioinformatics
  • Spurred on by the seq/structure deficit, the
    challenges
  • rationalise the mass of sequence data
  • derive more efficient means of data storage
  • design more incisive reliable analysis tools
  • The imperative - to convert sequence information
    into biochemical biophysical knowledge
  • to decipher the structural, functional
    evolutionary clues encoded in the language of
    biological sequences

14
The Holy Grail of bioinformatics
  • ...to be able to understand the words in a
    sequence sentence that form a particular protein
    structure

15
The reality of sequence analysis
  • ...isn't so glamorous....but means we can
    recognise words that form characteristic
    patterns, even if we don't know the precise
    syntax to build complete protein sentences

16
Pattern recognition prediction
  • In investigating the meaning of sequences, two
    distinct analytical approaches have emerged
  • pattern recognition is used to detect similarity
    between sequences hence to infer related
    structures functions
  • ab initio prediction is used to deduce structure,
    to infer function, directly from sequence
  • These methods are quite different!
  • pattern recognition methods demand that some
    characteristic has been seen before housed in a
    db
  • prediction methods remove the need for template
    dbs, because deductions are made directly from
    sequence

17
Science fact fiction
  • Sequence pattern recognition is easier to
    achieve, is much more reliable, than fold
    recognition
  • which is 50 reliable even in expert hands
  • Prediction is still not possible
  • is unlikely to be so for decades to come (if
    ever)
  • Structural genomics will yield representative
    structures for many (but not all) proteins in
    future
  • structures of new sequences will be determined by
    modelling
  • prediction will become an academic exercise
  • But, to debunk a popular myth, knowing structure
    alone does not inherently tell us function

18
A reality check
  • What is the function of this structure?
  • What is the function of this sequence?
  • What is the function of this motif?
  • the fold provides a scaffold, which can be
    decorated in different ways by different
    sequences to confer different functions knowing
    the fold function allows us to rationalise how
    the structure effects its function at the
    molecular level

19
A test case for structural genomics
Structure-based assignment of the biochemical
function of hypothetical protein mj0577
(Zarembinski et al., PNAS 95 1998)
Although the structure co-crystallised with ATP,
the biochemical function of the protein is
unknown
20
The Twilight Zone
  • Prediction methods dont work because we dont
    fully understand the Folding Problem
  • we cant read the language sequences use to
    create their folds
  • But, with sequence analysis techniques, we can
    try to find similarities between new sequences
    those in dbs
  • whose structures functions we hope have been
    elucidated
  • This is straightforward at high levels of
    identity, but below 50 it is difficult to
    establish relationships reliably
  • Analyses can be pursued with decreasing certainty
    towards the Twilight Zone
  • 20 identity, where results may look plausible
    to the eye, but are no longer statistically
    significant

21
Beyond the Twilight Zone
  • To penetrate deeper into the Twilight Zone is the
    aim of most analytical methods
  • whether using single sequences, motifs, complex
    weighting schemes or raw amino acid frequencies
  • Each offers a different perspective, depending on
    the type of information used in the search
  • none gives the right answer
  • It is good practice to devise an analysis
    protocol that uses a variety of methods
  • but dont expect the impossible no method is
    infallible!

22
Application areas of analysis tools
  • The scale indicates identity between aligned
    sequences
  • Alignment of 2 random seqs can produce 20
    identity
  • less than 20 does not constitute a significant
    alignment
  • around this threshold is the Twilight Zone,
    where alignments may appear plausible to the eye,
    but cant be proved by conventional methods

23
Homology analogy
  • The term homology is confounded abused!
  • sequences are homologous if they are related by
    divergence from a common ancestor
  • analogy relates to the acquisition of common
    features from unrelated ancestors via convergent
    evolution
  • e.g., b-barrels occur in soluble membrane
    proteins enzymes chymotrypsin subtilisin share
    groups of catalytic residues, with near identical
    spatial geometries, but no other similarities
  • It is not a measure of similarity is not
    quantifiable
  • it is an absolute statement that sequences have a
    divergent rather than a convergent relationship
  • the phrases "the level of homology is high" or
    "the sequences show 50 homology", or any like
    them, are strictly meaningless!
  • This is not just a semantic issue
  • loose use muddies thinking about evolutionary
    relationships

24
A terminology muddle
  • The same arguments apply to 3D structures
  • structures may be similar, as denoted by RMS
    positional deviation between compared atomic
    positions
  • but their common evolutionary origin is a
    hypothesis
  • the hypothesis may be correct or mistaken, but
    their similarity is a fact, no matter how it is
    interpreted
  • Similarity of sequence or structure is just that
    similarity
  • Homology connotes a common evolutionary origin
  • Reeck, G.R., de Haen, C., Teller, D.C.,
    Doolittle, R.F., Fitch, W.M., Dickerson, R.E.,
    Chambon, P., McLachlan, A.D., Margoliash, E.,
    Jukes, T.H. Zuckerkandl, E. (1987) Homology
    in proteins and nucleic acids a terminology
    muddle and a way out of it. Cell, 50, 667.

25
More challenges for sequence analysis
  • Much of the challenge is in getting the biology
    right
  • this is complicated by the problem of orthology
    vs paralogy
  • Following a search, how much functional
    annotation can be legitimately inherited by a
    query?
  • source of numerous annotation errors in dbs
  • error propagation could lead to an error
    catastrophe
  • Further complications arise due to modular nature
    of proteins
  • modules are autonomous folding units (protein
    building blocks)
  • confer variety of functions on a parent protein,
    by multiple combin-ations of the same module, or
    different modules to form mosaics
  • Automatic analysis systems dont distinguish
    orthologues from paralogues dont consider the
    modular nature of proteins

26
(No Transcript)
27
  • Monkeys are exploited in different Goldberg
    machines, where they perform different functions
    here, we could not predict a monkey sitting in
    that spot, even with total knowledge of the rest
    of the machine
  • Similarity searches are just like this
  • identifying the presence of a module tells little
    of the function of the complete system
  • knowing most components of a mosaic, we cant
    predict a missing one
  • modules (monkeys) in different proteins dont
    always perform exactly the same function

28
The Midnight Zone
  • Notwithstanding the lessons of Goldberg machines,
    identifying evolutionary links between sequences
    is useful
  • this often implies a shared function
  • In the genome era, prediction of function from
    sequence is of more immediate value than is the
    prediction of structure
  • However, between distantly-related proteins,
    structure is more conserved than the underlying
    sequences
  • thus, some relationships are only apparent at the
    structural level
  • Such relationships cant be detected by even the
    most sensitive sequence comparison methods
  • the region of identity where sequence comparisons
    fail completely to detect structural similarity
    is the Midnight Zone there is thus a
    theoretical limit to the effectiveness of
    sequence analysis methods

29
Significance
  • Appreciating that mathematical biological
    significance are different is crucial it is
    especially important in understanding the
    limitations of
  • search alignment algorithms, pattern
    recognition techniques, functional site
    structure prediction tools
  • Contrary to popular opinion, there is currently
    still
  • no biologically-reliable automatic multiple
    alignment algorithm
  • no infallible pattern-recognition technique
  • no reliable gene, function or structure
    prediction algorithms

30
(No Transcript)
31
Computers dont do biology!
32
Biological Databases
  • Overview
  • Sequence repositories
  • SWISS-PROT TrEMBL
  • Composite sequence databases
  • NRDB, SPTrEMBL
  • Family (pattern) resources
  • PROSITE, PRINTS, profiles, Pfam, Blocks, eMOTIF
  • Composite family databases
  • InterPro

33
Primary sequence databases
  • In the early '80s, when sequence data started to
    accumulate, several labs saw advantages to
    establishing central repositories
  • trouble is, many labs. thought this was a good
    idea made their own
  • Nucleic Protein
  • EMBL PIR
  • GenBank SWISS-PROT
  • DDBJ MIPS
  • JIPID
  • TrEMBL
  • The proliferation of dbs causes problems
  • do they have the same format? Which is the most
    accurate? The most up-to-date? The most
    comprehensive? Which should we use?

34
SWISS-PROT
  • Endeavours to provide high-level annotation
  • e.g., descriptions of the function of the
    protein, the organisation of its domains, PTMs,
    family disease relationships, variants, etc.
  • Contains entries from gt5,000 species
  • the bulk of these from just a handful of model
    organisms
  • H.sapiens, E.coli, M.musculus, D.melanogaster,
    S.cerevisiae, etc.
  • The quality of its annotations sets is apart from
    other dbs
  • Consequently, it cannot keep pace with the rate
    of data acquisition from the sequencing centres

35
(No Transcript)
36
(No Transcript)
37
TrEMBL
  • A computer-annotated supplement to SP
  • has the SP format contains translations of all
    CDSs in EMBL
  • It has 2 main sections
  • SP-TrEMBL contains all entries that will
    eventually go into SP, but haven't yet been
    manually annotated
  • REM-TrEMBL contains sequences not destined to
    be in SP
  • Igs, fragments of lt8 residues, synthetic
    sequences, etc.
  • Arose from the need for a structured SP-like
    resource, allowing rapid access to genome data,
    without compromising the quality of SP by
    including entries with poor analysis
    insufficient annotation

38
(No Transcript)
39
Composite sequence databases
  • A solution to the problem of proliferating dbs is
    to compile a composite
  • these render searches very efficient, especially
    if non-redundant
  • Trouble is, there are now several composites,
    each with their own format redundancy criteria
    the most commonly used are
  • NRDB SPTrEMBL
  • PDB SWISS-PROT
  • SWISS-PROT TrEMBL
  • PIR
  • GenPept
  • GenPept updates
  • NRDB SPTrEMBL are non-identical, not
    non-redundant
  • but which is best? Which the most comprehensive?
    The most up-to-date? Which should we use?

40
NRDB
  • NRDB is built locally at the NCBI
  • it includes weekly updates of SP daily updates
    of GenBank, so is up-to-date comprehensive
  • But the simplistic manner of its construction
    causes problems
  • multiple copies of the same protein are retained
    as a result of polymorphisms /or sequencing
    errors
  • errors corrected in SP are reintroduced when
    retranslated from DNA
  • numerous sequences are duplicates of existing
    fragments
  • The contents of the db are thus error-prone
    redundant
  • NRDB is the default db of the NCBI BLAST service

41
SPTrEMBL
  • This resource is intended to be both
    comprehensive minimally redundant
  • It contains fewer errors than NRDB, but is not
    truly non-redundant
  • 30 of the combined total of SP TrEMBL is
    non-unique
  • Further reduction of error rates requires more
    manual intervention better expert db management
    systems

42
Family (pattern) databases
  • As well as 1' resources, there are also many
    family or pattern dbs derived from them
  • trouble is, they use different 1' sources
    different analysis methods, all have different
    formats!
  • But it isn't all bad SWISS-PROT is emerging as
    a standard, most pattern dbs use it as their
    basis
  • PROSITE SWISS-PROT Regular expressions
    (patterns)
  • PRINTS SWISS-PROT/TrEMBL Aligned motifs
    (fingerprints)
  • Pfam SWISS-PROT/TrEMBL Hidden Markov Models
    (HMMs)
  • Profiles SWISS-PROT Weight matrices (profiles)
  • Blocks InterPro/PRINTS Weighted motifs (blocks)
  • eMOTIF Blocks/PRINTS Permissive regular
    expressions

43
Why create pattern databases?
  • Pattern dbs arise from the need to make more
    specific functional diagnoses than are possible
    simply by searching the 1's
  • They are built on the principle that homologous
    sequences may be gathered together in multiple
    alignments, within which are regions (motifs)
    that show little variation
  • these motifs usually reflect some vital
    biological role in terms of either structure or
    function
  • Motifs are exploited in different ways to build
    diagnostic patterns for protein families
  • new sequences can be searched against dbs of such
    patterns to see if they can be assigned to known
    families
  • hence they offer a fast track to the inference of
    function

44
What's in a sequence?
45
Methods for family analysis
Single motif methods
Fuzzy regex (eMOTIF)
Full domain alignment methods
Exact regex (PROSITE)
Profiles (PROFILE LIBRARY)
HMMs (Pfam)
Identity matrices (PRINTS)
Multiple motif methods
Weight matrices (Blocks)
46
The challenge of family analysis
  • highly divergent family with single function?
  • superfamily with many diverse functional
    families?
  • must distinguish if function analysis done in
    silico
  • a tough challenge!

47
Know your family
48
The problem with domains
49
PROSITE
  • The first pattern db
  • based on the idea that a protein family can be
    characterised by a pattern of conserved residues
    within a single motif
  • Sequence information in motifs is reduced to
    consensus or regular expressions (regexs) the
    seed regex used to search SP
  • results are inspected manually to achieve optimal
    results
  • Some families cant be characterised by single
    motifs
  • here, additional regexs are created until an
    optimal set is achieved that captures most or all
    of the family
  • results are then manually annotated for inclusion
    in the db

50
R-Y-x-DT-W-x-LIVM-ST-T-P-LIVM(3)
51
(No Transcript)
52
(No Transcript)
53
PRINTS
  • Most protein families are characterised by gt1
    motif
  • it is sensible to use many/all of them to build a
    diagnostic signature
  • This is the principle of fingerprints
  • these offer improved diagnostic reliability by
    virtue of the biological context provided by
    motif neighbours
  • Motifs are excised from alignments by hand
  • residue information is augmented via iterative
    searches
  • results are manually annotated prior to inclusion
    in the db

54
Motif context
order
1
2
3
4
5
interval
55
(No Transcript)
56
SUMMARY INFORMATION 37 codes involving 8
elements 0 codes involving 7 elements
0 codes involving 6 elements 0 codes
involving 5 elements 0 codes involving 4
elements 1 codes involving 3 elements 0
codes involving 2 elements COMPOSITE
FINGERPRINT INDEX 8 37 37 37 37
37 37 37 37 7 0 0 0 0
0 0 0 0 6 0 0 0 0 0
0 0 0 5 0 0 0 0 0
0 0 0 4 0 0 0 0 0
0 0 0 3 1 0 0 0 1 1
0 0 2 0 0 0 0 0 0 0
0 ----------------------------------------
-- 1 2 3 4 5 6 7 8
True positives.. PRIO_COLGU PRIO_MACFA
PRIO_CEREL PRIO_ODOHE PRIO_GORGO PRIO_PANTR
PRIO_HUMAN O46648 PRIO_SHEEP PRIO_CALJA
PRIO_BOVIN PRP2_BOVIN PRIO_ATEPA PRIO_SAISC
PRIO_PREFR PRIO_PONPY O75942 PRIO_CAPHI
PRIO_CEBAP PRIO_CAMDR PRIO_FELCA PRP1_TRAST
PRIO_RABIT PRP2_TRAST PRIO_PIG
PRIO_CANFA PRIO_CRIGR PRIO_CRIMI Q15216
PRIO_RAT PRIO_CERAE PRIO_MUSPF PRIO_MUSVI
PRIO_MESAU PRIO_MOUSE O46593 PRIO_TRIVU
Subfamily Codes involving 3 elements
Subfamily True positives.. PRIO_CHICK
57
(No Transcript)
58
Profiles Pfam
  • An alternative to motif-based methods exploits
    regions between motifs, which also contain
    valuable information
  • the full alignment effectively becomes the
    discriminator
  • A complex scoring scheme allowing for
    substitutions INDELs is used to create
    family-specific profiles
  • These profiles can be used to detect distant
    relation-ships, where only few residues are
    conserved
  • this is the basis of the Profile library
  • In an extension of this approach, alignments are
    encoded as probabilistic models termed HMMs
  • this is the basis of Pfam

59
(No Transcript)
60
(No Transcript)
61
(No Transcript)
62
Blocks eMOTIF
  • Various advantages to storing motifs in a raw
    form
  • no information is lost, different scoring
    schemes may be used to confer different
    diagnostic potentials on the same data
  • Additional dbs have arisen in this way
  • Blocks uses families identified in InterPro,
    aligns the sequences detects motifs
    automatically
  • BLOCKS-format PRINTS uses motifs in PRINTS with
    the Blocks scoring scheme
  • eMOTIF creates permissive regexs from Blocks
    PRINTS
  • These dbs are derived fully automatically hence
    offer
  • no family annotation (they link back to InterPro
    PRINTS)
  • no further family coverage

63
(No Transcript)
64
Composite pattern databases
  • To simplify sequence analysis, the family
    databases are being integrated to create a
    unified annotation resource InterPro
  • release 4.0 contains 4691 entries
  • a central annotation resource, with pointers to
    its satellite dbs
  • initial partners were PRINTS, PROSITE, profiles
    Pfam
  • new partners include ProDom, TIGRfam, SMART
    hopefully others (e.g., Blocks, MetaFam)
  • lags behind its sources
  • major role in fly human genome annotation

65
(No Transcript)
66
(No Transcript)
67
Pattern Recognition
  • Overview
  • Determining significance of db matches
  • Pattern recognition methods
  • regular expression patterns rules
  • fingerprints blocks
  • profiles HMMs
  • Current status of pattern dbs

68
Pattern recognition methods
  • These methods classify proteins into families
  • the basis of the methods is multiple sequence
    alignment
  • They depend on developing representations of
    conserved elements of alignments that may be
    diagnostic of structure or function, whether from
  • homologous sequence families
  • sequences that share some structural/functional
    domains

69
Single motif methods
Fuzzy regex (eMOTIF)
Exact regex (PROSITE)
Full domain alignment methods
Profiles (PROFILE LIBRARY)
HMMs (Pfam)
Identity matrices (PRINTS)
Multiple motif methods
Weight matrices (Blocks)
70
Determining significance of database matches
  • When searching a db, the challenge for analysis
    methods is to determine if matches are related
    (true-positive) or unrelated (true-negative)
  • At a given scoring threshold, it is likely that
    unrelated sequences will be matched erroneously
    (false-positives) some correct matches will be
    missed (false-negative)
  • The aim is to improve the resolution between the
    curves - in the overlap, it is difficult or
    impossible to establish if matches are
    significant
  • Different methods tackle this problem in
    different ways

71
Resolving true false matches
N
True negative
Score
72
Resolving true false matches
N
True negative
Score
73
Regular expressions (patterns)
  • These are derived from single conserved regions
    in alignments
  • they are minimal expressions, so sequence
    information is lost
  • the more divergent the sequences used, the more
    fuzzy poorly discriminating the regex becomes
  • Alignment Regex
  • GAVDFIALCDRYF
  • GPIDFVCFCERFY G-X-IV-DE-F-IVL-X2-C-DE-R-
    FY2
  • GRVEFLNRCDRYY
  • Regexs do not tolerate similarity
  • sequences either match or not, regardless of how
    similar they are
  • matching is a binary on-off event frequently
    misses true matches
  • single-motif methods are very hit-or-miss how
    do you know if you've encoded the best region?

74
In the beginning was PROSITE
  • G_PROTEIN_RECEPTOR PATTERN
  • PS00237
  • G-protein coupled receptor signature
  • GSTALIVMYWC-GSTANCPDE-EDPKRH-X(2)-LIVMNQGA
    -
  • X(2)-LIVMFT-GSTANC-LIVMFYWSTAC-DENH-R
  • /TOTAL1121(1121) /POS1057(1057)
    /FALSE_POS64(64)
  • /FALSE_NEG112 /PARTIAL48 UNKNOWN0(0)
  • This represents an apparent 20 error rate
  • the actual rate is probably higher
  • Thus, a match to a pattern is not necessarily
    true
  • a mis-match is not necessarily false!
  • False-negatives are a fundamental limitation to
    this type of pattern matching
  • if you don't know what you're looking for, you'll
    never know you missed it!

75
(No Transcript)
76
Regular expressions (rules)
  • Regex patterns are most effective when applied to
    highly-conserved, family-specific motifs
  • It is often possible to identify, shorter generic
    patterns within sequences, characteristic of
    common functional sites
  • Functional site Rule
  • N-glycosylation N-P-ST-P
  • Protein kinase C phosphorylation ST-X-RK
  • Casein kinase II phosphorylation ST-X2-DE
  • Such features result from convergence to a common
    property
  • glycosylation sites, phosphorylation sites, etc.
  • They cannot be used for family diagnosis dont
    discriminate
  • they can only be used to suggest whether a
    certain functional site might exist (which must
    then be tested by experiment)
  • such patterns are normally termed rules

77
Residue groups for fuzzy regexs
  • It is possible to assign residues to groups based
    on various biochemical properties e.g., charge
    size
  • using such groups theoretically ensures that
    resulting regexs have sensible biochemical
    interpretations
  • small Ala, Gly
  • small hydroxyl Ser, Thr
  • basic His, Lys, Arg
  • aromatic Phe, Tyr, Trp
  • aliphatic Val, Leu, Ile, Met
  • acidic/amide Asp, Glu, Asn, Gln
  • small/polar Ala, Gly, Ser, Thr, Pro
  • This is more flexible than exact regex matching

78
Diagnostic limitations
  • Consider the sequence motif Asp-Ala-Val-Ile-Asp
    (DAVID)
  • results of searching for such a motif will
    differ, depending on the db, the motif length
    whether we use exact or permissive fuzzy regexs
  • Pattern Matches
  • D-A-V-I-D 71 (99)
  • D-A-V-I-DEQN 252
  • DEQN-A-V-I-DEQN 925
  • DEQN-A-VLI-I-DEQN 2,739
  • DEQN-AG-VLI-VLI-DEQN 51,506
  • D-A-V-E 1,088 (1,493)
  • (number of matches in OWL29.6 ( OWL31.1))
  • Use of fuzzy regexs has the potential advantage
    of being able to recognise more distant
    relationships
  • the inherent disadvantage that more matches
    will be made by chance, making it difficult to
    separate true matches from noise

79
Fingerprints
  • Fingerprints are groups of conserved (ungapped)
    motifs excised from alignments used for
    iterative db searching
  • no weighting scheme is used
  • searches depend only on residue frequencies
  • resulting scoring matrices are thus sparse
  • Each motif trawls the db independently
  • search results are correlated to determine which
    sequences match all the motifs which match only
    partially
  • no information is thrown away
  • The iterative process refines the fingerprint
    increases its power
  • potency is gained from the mutual context of
    motif neighbours
  • results are biologically more meaningful than
    those from single motifs

80
TM domain
TM domain
loop region
81
loop region
TM domain
TM domain
82
A fingerprinting overview
PRINTS
annotation
83
How fingerprints are stored
84
  • T C A G N S P F L Y H Q V K D E
    I W R M B X Z
  • 0 0 2 0 0 0 0 0 0 7 0 0 1 0 0 0
    0 2 0 0 0 0 0
  • 0 0 3 0 0 0 0 0 2 0 0 0 4 0 0 0
    3 0 0 0 0 0 0
  • 6 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0
  • 1 0 1 1 0 3 0 0 1 0 0 0 3 0 0 0
    0 0 0 2 0 0 0
  • 2 0 1 1 0 0 0 0 0 0 0 3 0 4 0 0
    0 0 1 0 0 0 0
  • 4 0 1 0 0 1 0 2 0 1 3 0 0 0 0 0
    0 0 0 0 0 0 0
  • 0 0 0 0 0 0 0 0 0 0 0 0 0 12 0 0
    0 0 0 0 0 0 0
  • 0 0 0 0 0 6 0 0 0 0 0 0 0 6 0 0
    0 0 0 0 0 0 0
  • 0 0 0 0 0 0 0 0 12 0 0 0 0 0 0 0
    0 0 0 0 0 0 0
  • 0 0 0 0 0 0 0 0 0 0 0 2 0 0 0 0
    0 0 10 0 0 0 0
  • 9 0 0 0 0 0 0 0 0 0 2 1 0 0 0 0
    0 0 0 0 0 0 0
  • 0 0 0 0 0 0 12 0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0
  • 0 0 4 0 0 2 0 0 6 0 0 0 0 0 0 0
    0 0 0 0 0 0 0
  • (b)
  • T C A G N S P F L Y H Q V K D E
    I W R M B X Z
  • 0 0 4 0 0 0 0 8 4 34 0 0 15 0 0 0
    1 7 0 0 0 0 0
  • 0 4 15 0 0 0 0 0 7 0 0 0 37 0 0 0
    10 0 0 0 0 0 0
  • 50 0 0 0 0 3 0 18 0 0 0 0 0 0 0 0
    0 0 0 2 0 0 0
  • YVTVQHKKLRTPL
  • YVTVQHKKLRTPL
  • YVTVQHKKLRTPL
  • AATMKFKKLRHPL
  • AATMKFKKLRHPL
  • YIFATTKSLRTPA
  • VATLRYKKLRQPL
  • YIFGGTKSLRTPA
  • WVFSAAKSLRTPS
  • WIFSTSKSLRTPS
  • YLFSKTKSLQTPA
  • YLFTKTKSLQTPA
  • (a)
  • Key
  • (a) motif, with 3 conserved positions
  • (b) corresponding frequency matrix
  • (c) same matrix, but after 3 iterations
  • (d) same matrix, with PAM250 weighting

85
Fingerprint visualisation
  • The full potency of fingerprinting is gained from
    the mutual context provided by motif neighbours
  • This is important, as the method inherently
    implies a biological context to motifs that are
    matched in the correct order in the query
    sequences, with appropriate distances between
    them
  • This allows sequence identification even when
    parts of the fingerprint are absent
  • e.g., a sequence that matches only 4 of 7 motifs
    may still be diagnosed as a true match if the
    pattern of motif matching is consistent with that
    expected of true neighbouring motifs
  • Such matches are best visualised graphically

86
Visualising fingerprints
ID
PRINTS
N
C
Query sequence
Missing motif?
87
N
C
88
N
C
89
(No Transcript)
90
(No Transcript)
91
Blocks
  • Blocks are groups of motifs derived automatically
    from families identified in InterPro
  • sequences are aligned automatically motifs are
    automatically identified by searching for spaced
    residue triplets (e.g., AxxxVxxC)
  • a block score is calculated using the BLOSUM62
    matrix
  • validity of blocks is confirmed with a 2nd
    motif-finding algorithm
  • blocks found by both methods are considered
    reliable
  • Sequences within motifs are clustered to reduce
    contributions to residue frequencies from sets of
    closely-related sequences
  • each cluster is treated as a single sequence
    given a score that gives a measure of its
    relatedness
  • the higher the weight, the more dissimilar the
    segment from others in the block, the most
    distant being given a score of 100
  • segments lt80 similar are separated by blank lines

92
(No Transcript)
93
CSC triplet
94
Profiles
  • Profiles are scoring tables derived from full
    domain alignments
  • these define which residues are allowed at given
    positions
  • which positions are conserved which degenerate
  • which positions, or regions, can tolerate
    insertions
  • the scoring system is intricate, may include
    evolutionary weights, results from structural
    studies, data implicit in the alignment
  • variable penalties are specified to weight
    against INDELs occurring in core 2' structure
    elements
  • Within a profile, the I M fields contain
    position-specific scores for insert match
    positions
  • in conserved regions, INDELs aren't totally
    forbidden, but are strongly impeded by large
    penalties defined in the DEFAULT field
  • these are superseded by more permissive values in
    gapped regions
  • the inherent complexity of profiles renders them
    highly potent discriminators, but they are
    time-consuming to derive

95
(No Transcript)
96
(No Transcript)
97
Hidden Markov Models
  • HMMs are similar in concept to profiles by virtue
    of encoding full domain alignments
  • they are probabilistic models consisting of a
    number of inter-connecting states
  • essentially, linear chains of match, delete or
    insert states
  • Match states are assigned to conserved columns in
    an alignment
  • insert states allow for insertions relative to
    match states
  • delete states allow match positions to be skipped
  • thus, building an HMM from an alignment requires
    each position to be assigned either to match,
    delete or insert states
  • HMMs usually perform well, but can be
    over-trained
  • they may also suffer if they are created from an
    iterative automatic alignment process if this
    once accepts a false match, the HMM will become
    corrupt

98
An HMM
C
L
Y
E
C
L
W
D
99
Which craft is best?
  • The wide variety of methods available leads to
    familiar problems
  • which should we use?
  • which is the most reliable?
  • which is the most comprehensive?
  • ......etc.
  • None of the pattern-recognition techniques is
    infallible, none of the resulting pattern dbs
    is complete
  • bearing in mind the diagnostic strengths
    weaknesses of the different approaches, always
    keeping biological significance in mind, the best
    strategy is simply to use them all

100
Overview of resources
  • PROSITE (SIB) - 1108 entries
  • single motifs (regexs) - best with small highly
    conserved sites
  • Profile library (ISREC) - 300 entries
  • weight matrices - good with divergent domains
    superfamilies
  • PRINTS (Manchester) - 1750 entries
  • multiple motifs (fingerprints) - best for
    families and sub-families
  • Pfam (Sanger Centre) - 3071 entries
  • HMMs - good with divergent domains
    superfamilies
  • InterPro (EBI) - 4691 entries
  • derived from PRINTS, PROSITE, Profiles, Pfam,
    ProDom, etc.
  • Blocks (FHCRC) - 2608 entries
  • multiple motifs (derived from InterPro PRINTS)
  • eMOTIF (Stanford)
  • permissive regexs (derived from PRINTS BLOCKS)

101
Building a Search Protocol
  • Overview
  • The usual starting point
  • searching the primary data sources
  • Pattern recognition methods
  • searching the secondary sources
  • Structural functional interpretation of results
  • Estimating significance
  • when do we believe a result?

102
A practical approach
  • Given a newly-determined sequence, we want to
    know
  • what is my protein?
  • to what family does it belong?
  • what is its function?
  • how can we explain its function in structural
    terms?
  • To this end, by searching pattern dbs fold
    libraries, we may recognise patterns that allow
    us to infer relationships with previously-characte
    rised families/folds
  • Given the variety of dbs to search, how do we use
    them to build a sensible search protocol for
    novel sequences?

103
  • Protein sequence
    database identity search
  • e.g., for short fragments, pinpoints
    identical matches
  • to probe - may identify correct reading
    frame
  • Protein sequence database similarity search
  • e.g., nrdb, SPSPTrEMBL - identifies potential
  • homologues to probe
  • Protein pattern database search
  • e.g., PROSITE, profiles, PRINTS, Blocks,
  • Pfam - identifies
    family relationships or pin-
  • points key
    structural or functional sites
  • Known structure No known
    structure
  • Structure classification database query
    Protein fold pattern library search
  • e.g., scop, CATH, FSSP - provides details
    e.g., threading - identifies compatible
  • of structural class, secondary structure
    folds for the probe sequence
  • information, ligand-binding, etc.

104
Searching the primary databases
  • Identity searching
  • the fastest test of an unknown fragment is to
    perform an identity search. This will reveal in
    seconds whether an exact match to the unknown
    peptide already exists
  • This can be helpful in identifying the correct
    reading frame following a 6-frame translation
  • ccgtactacaactacgctggtgcattcaag
  • Forward 0
  • PYYNYAGAFK TRFE_XENLA 207 AGIKEHKCSRSNNE
    PYYNYAGAFK CLQDDQGDVAFVKQ
  • Forward 1 XLTRSFER 207 AGIKEHKCSRSNNE
    PYYNYAGAFK CLQDDQGDVAFVKQ
  • RTTTTLVHS
  • Forward 2 TRFE_XENLA TRANSFERRIN PRECURSOR
    - XENOPUS LAEVIS
  • VLQLRWCIQ XLTRSFER TRANSFERRIN PRECURSOR
    - XENOPUS LAEVIS
  • Reverse 0
  • LECTSVVVR
  • Reverse 1
  • LNAPA!L!Y
  • Reverse 2
  • !MHQRSCST

105
Similarity searching
  • Whether or not an identity search finds a match,
    the next step is to look for similar sequences
  • e.g., you may wish to know if a wider family
    exists
  • The most rapid simple option is to use BLAST,
    flavours of it, or FastA
  • Several features are worthy of note in BLAST
    output
  • look for high scores with low P-values (unlikely
    to be random)
  • look for clusters of high scores at the top of
    the hitlist (a family?)
  • look for trends in the type of sequences matched

106
Ideal results show high scores low E-values
107
(No Transcript)
108
Why bother with pattern searches?
  • Primary searches won't always allow outright
    diagnosis
  • BLAST FASTA are not infallible
  • BLAST, in particular, often can't assign
    significant scores
  • results may be complicated by the presence of
    modules, or compositionally-biased regions
  • annotations of retrieved hits may be incorrect
  • Pattern dbs contain potent descriptors
  • so, distant relationships missed by BLAST may be
    captured by one or more of the family or
    functional site distillations

109
(No Transcript)
110
Searching the pattern databases
  • Searching PROSITE
  • when using PROSITE's Web form, it is advisable to
    exclude rules from the search, otherwise output
    is filled with spurious matches
  • results are either match, or no match
  • the user has to judge whether hits are significant

111
(No Transcript)
112
(No Transcript)
113
Searching the pattern databases
  • Searching Profiles
  • the SIB Web server offers access both to profiles
    within PROSITE pre-release (undocumented)
    profiles
  • results are highly specific generally
    diagnostically reliable
  • if no match is returned, its usually because the
    entry isnt in the db
  • matches to undocumented profiles are often
    dead-ends

114
(No Transcript)
115
Searching the pattern databases
  • Searching Pfam
  • results are returned in HTML tables accompanied
    by simple graphics to illustrate matched domains
  • results are specific usually diagnostically
    reliable
  • E-values provide the measure of confidence

116
(No Transcript)
117
Searching the pattern databases
  • Searching PRINTS
  • results are returned in HTML tables on different
    levels
  • a best "guess
  • the top 10 best-scoring matches
  • the raw data
  • graphical options provide a visual impression of
    the quality of matches
  • results are specific usually diagnostically
    reliable
  • combined E- p-values provide the measure of
    confidence

118
(No Transcript)
119
(No Transcript)
120
Searching the pattern databases
  • Searching Blocks
  • if results of searching PROSITE PRINTS are
    positive, we would expect these to be confirmed
    by searches of the Blocks dbs
  • key features to note in the output are
  • the description line, the accession codes (which
    indicate which is the matched motif), the
    best-scoring or anchor block
  • most important is the detection of multiple block
    hits where this happens, an E-value denotes the
    significance of the match
  • single block matches are usually spurious

121
(No Transcript)
122
Searching the pattern databases
  • Searching eMOTIF
  • as with Blocks, if results of searching PROSITE
    PRINTS are positive, this should be confirmed by
    searches of eMOTIF
  • output is given at several stringency levels,
    which indicate the number of false matches to
    expect in the reported results

123
(No Transcript)
124
Which approach is best?
  • BLAST frequently fails to assign significant
    scores
  • The hit-or-miss nature of single-motif regular
    expressions can render them worthless
  • In spite of (because of?) their complexity,
    profiles HMMs are often out-performed by
    simpler motif methods
  • The non-weighting system of fingerprints means
    that Twilight relationships may be missed
  • The scoring system used to create blocks
    generates large amounts of noise that may obscure
    the signal
  • Only PROSITE PRINTS are fully manually
    annotated
  • No method alone is best

125
Structural functional interpretation
  • Db searches often do little more than identify a
    protein family
  • this only scratches the surface we still want
    to know what our protein does what it might
    look like
  • The first step is to examine the detailed family
    documentations in PROSITE, PRINTS InterPro
  • these should help to elucidate the function of
    the protein
  • The next step is to examine the fold
    classification structure summary resources
  • e.g., scop, CATH PDBsum, assuming that a
    structure is in fact available.

126
(No Transcript)
127
(No Transcript)
128
(No Transcript)
129
(No Transcript)
130
(No Transcript)
131
Estimating significance
  • When do we believe a result?
  • a real example.....

132
(No Transcript)
133
(No Transcript)
134
(No Transcript)
135
(No Transcript)
136
(No Transcript)
137
(No Transcript)
138
(No Transcript)
139
(No Transcript)
140
(No Transcript)
141
(No Transcript)
142
(No Transcript)
143
(No Transcript)
144
Conclusions
  • What are the lessons for sequence analysis?
  • when searching for distant homologues, several
    dbs should be searched
  • different methods provide different perspectives
  • dbs arent complete their contents dont fully
    overlap
  • The more dbs searched, the more difficult it can
    be to interpret results
  • hence s/w is being designed to provide
    "intelligent" consensus outputs
  • The more computers are involved in automating
    genome annotation, the greater the need for
    collaboration
  • especially between s/w developers, annotators
    biologists
  • The more data we have to handle, the more
    rigorous we must be in our thinking ( writing)
    if we are to make sense of the complexities
  • We are a long way from having reliable tools for
    deducing protein structure function from
    sequence
  • but with the right approach, there is hope
Write a Comment
User Comments (0)
About PowerShow.com