Title: From PSIBLAST to HMMer
1From PSI-BLAST to HMMer
- Professor Mark Pallen
- Credits Stephanie Minnema
- University of Calgary
- David Wishart
- University of Alberta
2Advanced BLAST Methods
- The NCBI BLAST pages have several advanced BLAST
methods available - PSI-BLAST
- PHI-BLAST
- RPS-BLAST
- All are powerful methods based on protein
similarities
3Position-Specific-Iterated-BLAST
- Intuition
- substitution matrices should be specific to a
particular site. - e.g. enalize alanine?glycine more in a helix
- Idea
- Use BLAST with high stringency to get a set of
closely related sequences. - Align those sequences to create a new
substitution matrix for each position. - Then use that matrix to find additional sequences
- Cycling/iterative method
- Gives increased sensitivity for detecting
distantly related proteins - Can give insight into functional relationships
- Very refined statistical methods
- Fast still based on BLAST methods
- Simple to use
4PSI-BLAST Principle
- First, a standard blastp is performed
- The highest scoring hits are used to generate a
multiple alignment - A PSSM is generated from the multiple alignment.
- Highly conserved residues get high scores
- Less conserved residues get lower scores
- Another similarity search is performed, this time
using the new PSSM - Steps 2-4 can be repeated until convergence
- No new sequences appear after iteration
5ExampleAminoacyl tRNA Synthetases
- 20 enzymes for 20 amino acids
- Each is very different
- Big, small, monomers, tetramers
- All bind to their appropriate tRNAs and amino
acids, with high specificity - TrpRS and TyrRS share only 13 sequence identity
- BUT, overall structures of TrpTRS and TyrTRS are
similar - Structure ? Function relationship
6Same SCOP family based on catalytic domain
7So is there sequence similarity between TyrRS
and TrpRS?
- Given structural similarities, we would expect to
find sequence similarity - BUT!
- blastp of E.coli TyrRS against bacterial
sequences in SwissProt does NOT show similarity
with TrpRS at e-value cutoff of 10
8No TrpRS!?
9Try Using PSI-BLAST
- PSI-BLAST available from BLAST main page
- Query form just like for blastp
- BUT one extra formatting option must be used
- Format for PSI-BLAST activate the tick box!
- Second e-value cutoff used to determine which
alignments will be used for PSSM build
Threshold for inclusion - First search using TyrRS as query
- Db SwissProt limit Bacteria ORGN
- Threshold for inclusion 0.005
10(No Transcript)
11(No Transcript)
12After A Few Iterations
13TyrRS Similarity to TrpRS!
14Power of PSI-BLAST
- We knew TyrRS and TrpRS were similarly
- Functionally and structurally
- BLASTP gave no indication
- PSI-BLAST was able to detect their weak sequence
similarity - Words of caution
- be sure to inspect and think about the results
included in the PSSM build - include/exclude sequences on basis of biological
knowledge you are in the driving seat! - PSI-BLAST performance varies according to choice
of matrix, filter, statistics etc just like BLASTP
15Why (not) PSI-BLAST
- If the sequences used to construct the Position
Specific Scoring Matrices (PSSMs) are all
homologous, the sensitivity at a given
specificity improves significantly - However, if non-homologous sequences are included
in the PSSMs, they are corrupted. Then they
pull in more non-homologous sequences, and become
worse than generic
16Query
Does the query really have a relationship with
the results? One way to check is to run the
search in the opposite direction but often not
reversible even when true homology
Results
17PSI-BLAST caveats
- Increased ability to find distant homologues
- Cost of additional required care to prevent
non-homologous sequences from being included in
the PSSM calculation - When in doubt, leave it out!
- Examine sequences with moderate similarity
carefully. - Be particularly cautious about matches to
sequences with highly biased amino acid content - Low complexity regions, transmembrane regions and
coiled-coil regions often display significant
similarity without homology - Screen them out of your query sequences!
18PSI-BLASTon the command line
- as with simple BLAST searches, using PSI-BLAST on
the command line gives the user more power - opens up additional options, e.g.
- PSI-BLASTing over nucleotide databases
- automating number of iterations
- trying out lots of different settings in parallel
- inputting multiple sequences
19PHI-BLAST
- Pattern Hit Initiated BLAST
- PHI-BLAST principle
- Same method as PSI-BLAST
- Starts first search with query sequence pattern
for a motif in the query - PHI-BLAST finds sequences containing the motif
and having significant sequence similarity in the
vicinity of the motif occurrence - Highly specific
20Example TyrRS
- TyrRS contains the aaRS class-I signature
- Want to find sequences containing that motif, and
regional similarity to TyrRS - First get the Prosite pattern for the class-I
signature - Prosite db of protein families and domains
21http//ca.expasy.org/prosite
22P-x(0,2)-GSTAN-DENQGAPK-x-LIVMFP-HT-LIVMY
AC-G- HNTG-LIVMFYSTAGPC
23Insert Query Sequence
Insert PHI Pattern
24PHI-BLAST Results
- After first search, PHI-BLAST functions same as
PSI-BLAST - Result page is the same
- Can iterate in same way
- Try it later if you like
25The Key to PHI- and PSI-BLAST
- Generating the multiple alignments to create
PSSMs - Refines scoring in searches
- Annotated collections of multiple alignments
defining domains exist - Conserved domain database (CDD)
- Contains 18039 alignments (10013 last year)
- Can search the CDD using CD search
- Uses RPS-BLAST
26RPS-BLAST
- Reverse Position Specific BLAST
- Opposite of PSI-BLAST
- CDD multiple alignments converted to PSSMs
- PSSMs are processed and turned into a searchable
database - Queries are searched against PSSMs using
RPS-BLAST - Output indicates conserved domains within the
query sequence
27Example CRADD protein
28Click on picture to see CDD multiple alignment
Click to see alignment with query
29Profile Hidden Markov Models
- statistical models of multiple sequence
alignments - capture position-specific information about
- how conserved each column of the alignment is
- which residues are likely
- use position-specific scores for amino acids (or
nucleotides) - position specific penalties for opening and
extending an insertion or deletion.
30Advantages of using HMMs
- HMMs have a formal probabilistic basis
- use probability theory to guide how all the
scoring parameters should be set - can do things that more heuristic methods cannot
do easily - For example, a profile HMM can be trained from
unaligned sequences, if a trusted alignment isnt
yet known - HMMs have a consistent theory behind gap and
insertion scores
31Advantages of using HMMs
- In most details, profile HMMs are a slight
improvement over a carefully constructed profile - but less skill and manual intervention are
necessary to use profile HMMs - HMMs can produce true global alignments, unlike
BLAST
32Limitations of HMMs
- do not capture any higher-order correlations
- assumes that the identity of a particular
position is independent of the identity of all
other positions - make poor models of RNAs because an HMM cannot
describe base pairs. - cf protein threading methods
- which usually include scoring terms for nearby
amino acids in a three-dimensional protein
structure. - slower than and less user-friendly than PSI-BLAST
33Applications of profile HMMs
- Database searching for weak homologies
- Alternative to PSI-BLAST
- Automated annotation of the domain structure of
proteins
34Applications of profile HMMs
- Useful for organizing sequences into
evolutionarily related families - Databases like Pfam constructed by distinguishing
between - a stable curated seed alignment of a small
number of representative sequences - full alignments of all detectable homologs
- HMMER used to
- make a model of the seed
- search the database for homologs
- automatically produce the full alignment by
aligning every sequence to the seed consensus
35Constructing a profile HMM
- multiple sequence alignment is made of known
members of a given protein family - quality of alignment, number and diversity of the
sequences crucial for success - profile HMM of family built from the alignment
- model-building program uses the alignment
together with its prior knowledge of the general
nature of proteins - model-scoring program used to assign a score with
respect to the model to any sequence of interest - better the score, the higher the chance that
query sequence is homologous to protein family in
the model. - each sequence in a database scored to find the
members of the family present in the database.
36Profile HMM programs HMMER
- developed by Sean Eddy
- freely available under GNU General Public License
- includes model-building and model-scoring
programs relevant to homology detection - contains a program that calibrates a model by
- scoring it against a set of random sequences
- fitting an extreme value distribution to the
resultant raw scores - parameters of this distribution then used to
calculate accurate E-values for sequences of
interest.
37Programs in the HMMER 2 package
- hmmalign
- Align sequences to existing model
- hmmbuild
- Build a model from multiple sequence alignment.
- hmmcalibrate
- Takes an HMM and empirically determines
parameters used to make searches more sensitive
by calculating more accurate E-values - hmmconvert
- Convert a model file into different formats,
including a compact HMMER 2 binary format, and
best effort emulation of GCG profiles.
- hmmemit
- Emit sequences probabilistically from a profile
HMM. - hmmfetch
- Get a single model from an HMM database.
- hmmindex
- Index an HMM database.
- hmmpfam
- Search an HMM database for matches to a query
sequence. - hmmsearch
- Search a sequence database for matches to an HMM.
38Profile HMM programs SAM
- Developed by the bioinformatics group at the
University of California, Santa Cruz - not open source, but free for academic use
- does not include a model-calibration program
- model-scoring program calculates E-values
directly using a theoretical function that takes
as its argument the difference between raw scores
of the query sequence and its reverse - important component is target99 script, which
generates a multiple sequence alignment suitable
for model building
39Clash of the TitansPSI-BLAST v. HMMer v. SAM!
- Nucleic Acids Research, 2002, Vol. 30 No. 19 4321
- SAM consistently produces better models than
HMMER - relative performance of the model-scoring
components varies - HMMER 1-3 X faster than SAM with large databases
- SAM faster with small ones
- both methods have effective low complexity and
repeat sequence masking - accuracy of their E-values was comparable.
- SAM T99 iterative database search procedure
outperforms PSI-BLAST - BUT scoring of PSI-BLAST profiles gt 30 X faster
than scoring of SAM models.
40(No Transcript)
41Summary
- PSI-BLAST
- Input SEQUENCE
- Database SEQUENCES
- Algorithm Constructs a PSSM from an initial pass
and uses this in the next pass - Output Distantly related sequences
- sensitive, -specific
- PHI-BLAST
- Input PROFILE SEQUENCE
- Database SEQUENCES
- Algorithm Same as PSI-BLAST except start with a
profile - Output Sequences containing the domain and that
are similar in the domain region - sensitive, -gt -specific
- RPS-BLAST
- Input SEQUENCE
- Database DOMAINS
- Output Domains found in the sequence
- sensitive, specific
- HMMs
- More sensitive
- But less user-friendly than PSI-BLAST and slower