Title: http:creativecommons'orglicensesbysa2'0
1http//creativecommons.org/licenses/by-sa/2.0/
2Sequence Similarity Searching Understanding
and UsingWeb Based BLAST
Dr. Joanne Fox joanne_at_bioinformatics.ubc.ca
3Concepts of Sequence Similarity Searching
- The premise
- The sequence itself is not informative it must
be analyzed by comparative methods against
existing databases to develop hypothesis
concerning relatives and function.
4Important Terms for Sequence Similarity Searching
with very different meanings
- Similarity
- The extent to which nucleotide or protein
sequences are related. The extent of similarity
between two sequences can be based on percent
sequence identity and/or conservation. In BLAST
similarity refers to a positive matrix score. - Identity
- The extent to which two (nucleotide or amino
acid) sequences are invariant. - Homology
- Similarity attributed to descent from a common
ancestor. - It is your responsibility as an informed
bioinformatician to use these terms correctly A
sequence is either homologous or not. Dont use
with this term!
5Sequence Similarity Searching The Approach
- Sequence similarity searching involves the use of
a set of algorithms (such as the BLAST programs)
to compare a query sequence to all the sequences
in a specified database. - Comparisons are made in a pairwise fashion. Each
comparison is given a score reflecting the degree
of similarity between the query and the sequence
being compared. - The higher the score, the greater the degree of
similarity. The similarity is measured and shown
by aligning two sequences.
6Sequence Similarity Searching The Alignment
- Alignments can be global or local (this is
algorithm specific) - A global alignment is an optimal alignment that
includes all characters from each sequence
(clustal generates global alignments) - A local alignment is an optimal alignment that
includes only the most similar local region or
regions (BLAST generates local alignments).
7QUERY sequence(s)
BLAST results
BLAST program
BLAST database
8Topics
BLAST program
- The different blast programs
- Understanding the BLAST algorithm
- Word size
- HSPs
- Understanding BLAST statistics
- The alignment score (S)
- Scoring Matrices
- Dealing with gaps in an alignment
- The expectation value (E)
9The BLAST algorithm
- The BLAST programs (Basic Local Alignment Search
Tools) are a set of sequence comparison
algorithms introduced in 1990 that are used to
search sequence databases for optimal local
alignments to a query. - Altschul SF, Gish W, Miller W, Myers EW, Lipman
DJ (1990) Basic local alignment search tool. J.
Mol. Biol. 215403-410. - Altschul SF, Madden TL, Schaeffer AA, Zhang J,
Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST
and PSI-BLAST a new generation of protein
database search programs. NAR 253389-3402.
10http//www.ncbi.nlm.nih.gov/BLAST/
blastn
11Several different BLAST programs
12Other BLAST programs
- BLAST 2 Sequences (bl2seq)
- Aligns two sequences of your choice
- Can do different types of comparison ex. Blastx
- Gives dot-plot like output
- VecScreen
- Compares query with sequences of known cloning
vectors - Both very handy for sequencing!
13More BLAST programs
- BLAST against genomes
- Many available
- BLAST parameters pre-optimized
- Handy for mapping query to genome
- Search for short exact matches
- BLAST parameters pre-optimized
- Great for checking probes and primers
14MegaBLAST
- megaBLAST
- For aligning sequences which differ slightly due
to sequencing errors etc. - Very efficient for long query sequences
- Uses big word (k-tuple) sizes to start search
- Very fast
- Accepts batch submissions of ESTs
- Can upload files of sequences as queries
- More detailed info see megaBLAST pages
15How Does BLAST Really Work?
- The BLAST programs improved the overall speed of
searches while retaining good sensitivity
(important as databases continue to grow) by
breaking the query and database sequences into
fragments ("words"), and initially seeking
matches between fragments. - Word hits are then extended in either direction
in an attempt to generate an alignment with a
score exceeding the threshold of "S".
16Picture used with permission from Chapter 11 of
Bioinformatics A Practical Guide to the
Analysis of Genes and Proteins
17How Does BLAST Really Work?
- The BLAST programs improved the overall speed of
searches while retaining good sensitivity
(important as databases continue to grow) by
breaking the query and database sequences into
fragments ("words"), and initially seeking
matches between fragments. - Word hits are then extended in either direction
in an attempt to generate an alignment with a
score exceeding the threshold of "S".
18Picture used with permission from Chapter 11 of
Bioinformatics A Practical Guide to the
Analysis of Genes and Proteins
19Each BLAST hit generates an alignment that can
contain one or more these high scoring pairs
(HSPs)
20Where does the score (S) come from?
- The quality of each pair-wise alignment is
represented as a score and the scores are ranked.
- Scoring matrices are used to calculate the score
of the alignment base by base (DNA) or amino acid
by amino acid (protein). - The alignment score will be the sum of the scores
for each position.
21Whats a scoring matrix?
- Substitution matrices are used for amino acid
alignments. These are matrices in which each
possible residue substitution is given a score
reflecting the probability that it is related to
the corresponding residue in the query. - A unitary matrix is used for DNA pairs because
each position can be given a score of 1 if it
matches and a score of zero if it does not.
22PAM vs. BLOSUM scoring matrices
- BLOSUM 62 is the default matrix in BLAST 2.0.
Though it is tailored for comparisons of
moderately distant proteins, it performs well in
detecting closer relationships. A search for
distant relatives may be more sensitive with a
different matrix.
23PAM vs BLOSUM scoring matrices
- The PAM Family
- PAM matrices are based on global alignments of
closely related proteins. - The PAM1 is the matrix calculated from
comparisons of sequences with no more than 1
divergence. - Other PAM matrices are extrapolated from PAM1.
- The BLOSUM family
- BLOSUM matrices are based on local alignments.
- BLOSUM 62 is a matrix calculated from comparisons
of sequences with no less than 62 divergence. - All BLOSUM matrices are based on observed
alignments they are not extrapolated from
comparisons of closely related proteins.
24What happens if you have a gap in the alignment?
- A gap is a position in the alignment at which a
letter is paired with a null - Gap scores are negative. Since a single
mutational event may cause the insertion or
deletion of more than one residue, the presence
of a gap is frequently ascribed more significance
than the length of the gap. - Hence the gap is penalized heavily, whereas a
lesser penalty is assigned to each subsequent
residue in the gap.
25What do the Score and the e-value really mean?
- The quality of the alignment is represented by
the Score. - Score (S)
- The score of an alignment is calculated as the
sum of substitution and gap scores. Substitution
scores are given by a look-up table (PAM, BLOSUM)
whereas gap scores are assigned empirically . - The significance of each alignment is computed as
an E value. - E value (E)
- Expectation value. The number of different
alignments with scores equivalent to or better
than S that are expected to occur in a database
search by chance. The lower the E value, the more
significant the score.
26Is the E-value the same P-value?
- E value (E)
- Expectation value. The number of different
alignments with scores equivalent to or better
than S that are expected to occur in a database
search by chance. The lower the E value, the more
significant the score. - When E lt 0.01, P-values and E-value are nearly
identical. - So, the E-value is the number of times you expect
to see your hit occur in the database (with as
good as or better score) due to randomn chance
alone.
27QUERY sequence(s)
BLAST results
BLAST program
BLAST database
28Topics
BLAST databases
- The different blast databases provided by the
NCBI - Protein databases
- Nucleotide databases
- Genomic databases
- Considerations for choosing a BLAST database
- Custom databases for BLAST
29BLAST protein databases available at through
blastp web interface _at_ NCBI
30BLAST nucleotide databases available at through
blastn web interface _at_ NCBI
31Considerations for choosing a BLAST database
- First consider your research question
- Are you looking for an ortholog in a particular
species? - BLAST against the genome of that species.
- Are you looking for additional members of a
protein family across all species? - BLAST against nr, if you cant find hits check
wgs, htgs, and the trace archives. - Are you looking to annotate genes in your species
of interest? - BLAST against known genes (RefSeq) and/or ESTs
from a closely related species.
32When choosing a database for BLAST
- It is important to know your reagents.
- Changing your choice of database is changing your
search space completely - Database size affects the BLAST statistics
- record BLAST parameters, database choice,
database size in your bioinformatics lab book,
just as you would for your wet-bench experiments. - Databases change rapidly and are updated
frequently - It may be necessary to repeat your analyses
33Creating Custom Databases for BLAST
UBiC FAQ
34QUERY sequence(s)
BLAST results
BLAST program
BLAST database
35Topics
BLAST results
- Choosing the right BLAST program
- Running a blastp search
- BLAST parameters and options to consider
- Viewing BLAST results
- Look at your alignments
- Using the BLAST taxonomy report
36http//www.ncbi.nlm.nih.gov/BLAST/
blastn
37http//www.ncbi.nlm.nih.gov/BLAST/
Program selection guide
38What BLAST program should I use? check the
NCBIs BLAST Program selection guide
39http//www.ncbi.nlm.nih.gov/BLAST/
40Input your query (gi231571) as FASTA, raw
sequence, or Accession/ID and choose your
database
database
41Links to more information can be found on the
BLAST page
links
links
links
links
42BLAST parameters and options to consider
conserved domains
Entrez query
E-value cutoff
Word size
43More BLAST parameters and options to consider
filtering
gap penalities
matrix
44Run your BLAST search
BLAST
45The BLAST Queue
click for more info
Note your RID
46Formatting and Retrieving your BLAST results
Results
options
47A graphical view of your BLAST results
48The BLAST hit list
Score
E-Value
GenBank
alignment
EntrezGene
49The BLAST pairwise alignments
Identity
Similarity
50Sorting BLAST results by Taxonomy
Taxonomy Report
51Tax BLAST Report
Summary hits by lineage
BLAST hits by organism
52BLAST statistics to record in your bioinformatics
labbook
Record the statistics that are found at bottom of
your BLAST results page
53Homology Some Guidelines
- Similarity can be indicative of homology
- Generally, if two sequences are significantly
similar over entire length they are likely
homologous - Low complexity regions can be highly similar
without being homologous - Homologous sequences not always highly similar
- Suggested BLAST Cutoffs
- (source Chapter 11 Bioinformatics A Practical
Guide to the Analysis of Genes and Proteins) - For nucleotide based searches, one should look
for hits with E-values of 10-6 or less and
sequence identity of 70 or more - For protein based searches, one should look for
hits with E-values of 10-3 or less and sequence
identity of 25 or more
54Advanced BLAST programs
- The NCBI BLAST pages have several advanced BLAST
methods available - PSI-BLAST
- PHI-BLAST
- RPS-BLAST
- All are powerful methods based on protein
similarities
55PSI-BLAST
- Position Specific Iterated BLAST
- A cycling/iterative method
- Gives increased sensitivity for detecting
distantly related proteins - Can give insight into functional relationships
- Very refined statistical methods
- Fast still based on BLAST methods
- Simple to use
56How does PSI-BLAST work?
- First, a standard blastp is performed
- The highest scoring hits are used to generate a
multiple alignment - A Position Specific Scoring Matrix (PSSM) is
generated from the multiple alignment. - Highly conserved residues get high scores
- Less conserved residues get lower scores
- The PSSM describes the sequence similarity
between your query and all significant blastp
hits - Another similarity search is performed, this time
using the new PSSM instead of the standard BLOSUM
or PAM matrices - - This PSSM (scoring matrix) is now customized to
find sequences that are related to your original
query - Steps 2-4 can be repeated until convergence
- Convergence occurs when no new sequences appear
after iteration
57http//www.ncbi.nlm.nih.gov/BLAST/
PSI-BLAST
58Format results for PSI-BLAST with inclusion
E-value set at 0.005
PSI-BLAST
BLAST
59Contributors
- Special thanks to David Wishart, Andy Baxevanis,
Stephanie Minnema, Sohrab Shah, and Francis
Ouellette for their contributions to these
materials - You are now ready to complete the BLAST
assignment