Title: BLAST I
1BLAST (I)
Basic Local Alignment Search Tool
??? ?????? email jimfann_at_itri.org.tw 02/26/2008
2Reference Sources
Jian Ye, Scott McGinnis, and Thomas L. Madden
(2006) "BLAST improvements for better sequence
analysis" Nucleic Acids Res. July 1 34 (Web
Server issue) W6-W9 McGinnis S, Madden TL.
(2004) "BLAST at the core of a powerful and
diverse set of sequence analysis tools." Nucleic
Acids Res. Jul 132 (Web Server issue)
W20-5. Altschul, S.F., Gish, W., Miller, W.,
Myers, E.W. Lipman, D.J. (1990) "Basic local
alignment search tool." J. Mol. Biol.
215403-410. Altschul, S.F., Madden, T.L.,
Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W.
Lipman, D.J. (1997) "Gapped BLAST and
PSI-BLAST a new generation of protein database
search programs." Nucleic Acids Res.
253389-3402. http//www.ncbi.nlm.nih.gov/BLAST/
ftp//ftp.ncbi.nih.gov/blast/ Joseph Bedell,
Ian Korf, Mark Yandell (2003) BLAST. O'Reilly
http//www.oreilly.com/catalog/blast/ http//ww
w.bioinfbook.org Jonathan Pevsner (2003)
Bioinformatics and Functional Genomics. John
Wiley Sons, Inc.
3Why use BLAST?
- To discover functional, structural and
evolutionary similarities - Because similarity may be an indicator of
homology and thus provide some insight into
function or gene identification. - Applications include
- identifying orthologs and paralogs
- discovering new genes or proteins
- discovering variants of genes or proteins
- investigating expressed sequence tags (ESTs)
- exploring protein structure and function
4http//www.ncbi.nlm.nih.gov/BLAST/
5BLAST
6Format for query sequence
Key
- FASTA (PEARSON and LIPMAN, 1988)
- gtgi129295spP01013OVAX_CHICK Gene X
protein (Ovalbumin-related) - QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAE ...
- Bare Sequence
- QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAE
- Identifiers
- accession number ( P01013 )
- accession number version codes (
AAA68881.1 ) - gi ( 129295 , gi129295 )
description
7IUPAC Nucleic acid codes
8IUPAC Amino acid codes
9Peptide Sequence Databases (FASTA format)
10Nucleotide Sequence Databases
11BLAST - Algorithm parameters
12SEG filtering of low-complexity segments
Wootton, J.C. Federhen, S. (1993) "Statistics
of local complexity in amino acid sequences and
sequence databases." Comput. Chem. 17149-163.
Ftp site ftp//ftp.ncbi.nih.gov/pub/seg/seg/
From http//www.ncbi.nlm.nih.gov/Education/BLAST
info/Seg.html
13BLAST background on sequence alignment
There are two main approaches to
sequence alignment 1 Global alignment
(Needleman Wunsch 1970) using dynamic
programming to find optimal alignments between
two sequences. (Although the alignments are
optimal, the search is not exhaustive.) Gaps are
permitted in the alignments, and the total
lengths of both sequences are aligned (hence
global).
From www.bioinfbook.org/
14BLAST background on sequence alignment
2 The second approach is local sequence
alignment (Smith Waterman, 1980). The
alignment may contain just a portion of either
sequence, and is appropriate for finding
matched domains between sequences. S-W is
guaranteed to find optimal alignments, but it is
computationally expensive (requires (O)n2
time). BLAST and FASTA are heuristic
approximations to local alignment. Each requires
only (O)n2/k time they examine only part of the
search space.
From www.bioinfbook.org/
15How a BLAST search works
The central idea of the BLAST algorithm is to
confine attention to segment pairs that contain
a word pair of length w with a score of at least
T. Altschul et al. (1990)
From www.bioinfbook.org/
16Pairwise alignment scores are determined using a
scoring matrix such as Blosum62
From www.bioinfbook.org/
17How the original BLAST algorithm works three
phases
Phase 1 (Seeding) ltNucleotide word perfect
matchgt compile a list of word pairs (w3) above
threshold T Example for a human RBP (retinol
binding protein) query FSGTWYA (query word is
in red) A list of words (w3) is FSG SGT GTW
TWY WYA YSG TGT ATW SWY WFA FTG SVT GSW TWF WYS
From www.bioinfbook.org/
18Phase 1 Seeding
compile a list of words (w3)
GTW 6,5,11 22 neighborhood GSW 6,1,11
18 word hits ATW 0,5,11 16 gt threshold NTW
0,5,11 16 GTY 6,5,2 13 GNW 10 neighborh
ood GAW 9 word hits lt below threshold
(T11)
From www.bioinfbook.org/
19How a BLAST search works 3 phases
Phase 1 Seeding You can modify the threshold
parameter. The default value for blastp is 11.
To change it, enter -f 16 or -f 5 in the
advanced options. Scan the database for entries
that match the compiled list. This is fast and
relatively easy.
From www.bioinfbook.org/
20How a BLAST search works 3 phases
Phase 2 Extension when you manage to find a
hit (i.e. a match between a word and a
database entry), extend the hit in either
direction. Keep track of the score (use a
scoring matrix) Stop when the score drops below
some cutoff.
KENFDKARFSGTWYAMAKKDPEG 50 RBP (query) MKGLDIQKVAG
TWYSLAMAASD. 44 lactoglobulin (hit)
extend
extend
Hit!
From www.bioinfbook.org/
21How a BLAST search works 3 phases
Phase 2 Extension
The quick brown fox jumps over the lazy dog. The
quiet brown cat purrs when she sees him. Scoring
matrix match 1 mismatch -1 seed T
The quick brown fox jump The quiet brown cat
purr 123 45654 56789 876 5654 lt- score 000 00012
10000 123 4345 lt- drop off score
lt- drop off score 5
lt- drop off score 2
From Korf et. al. 2003
22How a BLAST search works 3 phases
Phase 2 Extension In the original (1990)
implementation of BLAST, hits were extended in
either direction. In a 1997 refinement of BLAST,
two independent hits are required. The hits must
occur in close proximity to each other. With this
modification, only one seventh as many extensions
occur, greatly speeding the time required for a
search.
-y Dropoff (X) for blast extensions in bits
(default if zero) default 20 for blastn, 7
for other programs
From www.bioinfbook.org/
23How a BLAST search works 3 phases
Phase 3 Evaluation of HSP High-scoring segment
pair (HSP) Alignment threshold (set by
software) in Footer S1 ungapped S2
gapped Final alignment threshold (set as -e)
24Nucleotide BLAST
- Megablast is intended for comparing a query to
closely related sequences and works best if the
target percent identity is 95 or more but is
very fast. - Discontiguous megablast uses an initial seed that
ignores some bases (allowing mismatches) and is
intended for cross-species comparisons. - BlastN is slow, but allows a word-size down to
seven bases.
25Nucleotide BLAST - seeding
- Exact word match for
- Blastn gt word length 11
- Mega blast gt word length 28
- discontiguous Mega BLAST
- Template length 16, 18, 21.
- Word size (i.e. number of 1s in the template)
11, 12 - Template type coding, non-coding.
- Require two words for extension yes/no.
26Scoring Matrix
Ftp site ftp//ftp.ncbi.nih.gov/blast/matrices/
ftp//ftp.ncbi.nih.gov/blast/matrices/IDENTITY
ftp//ftp.ncbi.nih.gov/blast/matrices/MATCH
27Scoring Matrix
- Conservative amino acids substitution due to
similar physicochemical properties - Isoleucine for Valine (both small, hydrophobic)
- Serine for Threonine (both polar)
- ...
tiny
P
aliphatic
C
small
SS
G
G
I
A
S
V
C
N
SH
L
D
T
Y
hydrophobic
M
K
E
Q
F
W
H
R
positive
aromatic
polar
charged
From www.sanbi.ac.za/
28Scoring Matrix
- gt Substitution matrix
- to increase sensitivity of the alignment
algorithm - flexible lookup scheme for any pair of amino
acids - PAM, BLOSUM
- Calculating Similarity Scores (log-odds scores)
29Scoring - (BLOSUM 62)
From http//blast.wustl.edu/doc/infotheory.html
30PAM (Percent Accepted Mutations) matrices
- Derived from global alignments of protein
families . Family members - share at least 85 identity (Dayhoff et al.,
1978). -
- Construction of phylogenetic tree and ancestral
sequences of each protein family - Computation of number of replacements for each
pair of amino acids -
From www.sanbi.ac.za/ http//www.sdsc.edu/babu/
UCSD/week02/dbSearch_tut.html
31PAM (Percent Accepted Mutations) matrices
- The numbers of replacements were used to compute
a so-called - PAM-1 matrix.
- The PAM-1 matrix reflects an average change of
1 of all amino - acid positions. PAM matrices for larger
evolutionary distances can - be extrapolated from the PAM-1 matrix.
- Matrix multiplication using PAM-1
- PAM250 250 mutations per 100 residues.
- Family of matrices PAM10 PAM200
- Greater numbers mean bigger evolutionary
distance
.
From www.sanbi.ac.za/
32PAM Matrices
- If changes were purely random
- Frequency of each possible substitution is
proportional to background frequencies - In related proteins
- Observed substitution frequencies called the
target (replacement) frequencies are biased
toward those that do not seriously disrupt the
proteins function - These point mutations are accepted during
evolution - Log-odds approach
- Scores proportional to the natural log of the
ratio of target frequencies to background
frequencies
From http//omega.cbmi.upmc.edu/vanathi/
33PAM Matrices salient points
- Derived from global alignments of closely related
sequences. - Matrices for greater evolutionary distances are
extrapolated from those for lesser ones. - The number with the matrix (PAM40, PAM100) refers
to the evolutionary distance greater numbers are
greater distances. - Does not take into account different evolutionary
rates between conserved and non-conserved
regions.
From http//omega.cbmi.upmc.edu/vanathi/
34BLOSUM Matrices
- Henikoff, S. Henikoff J.G. (1992)
- Use blocks of protein sequence fragments from
different families (the BLOCKS database) - Amino acid pair frequencies calculated by summing
over all possible pairs in block - Different evolutionary distances are incorporated
into this scheme with a clustering procedure
(identity over particular threshold same
cluster)
From http//omega.cbmi.upmc.edu/vanathi/
35BLOSUM Matrices
- Probabilities estimated from blocks of sequence
fragments - Blocks represent structurally conserved regions
- Target frequencies are identified directly
- Sequences more than x identitical within the
block where substitutions are being counted, are
grouped together and treated as a single sequence - BLOSUM 50 gt 50 identity
- BLOSUM 62 gt 62 identity
From http//omega.cbmi.upmc.edu/vanathi/
36BLOSUM Matrices
From http//www.sdsc.edu/babu/UCSD/week02/dbSear
ch_tut.html
37BLOSUM Matrices - Summary
- Derived from local, ungapped alignments of
distantly related sequences - All matrices are directly calculated
- The number after the matrix (BLOSUM62) refers to
the minimum percent identity of the blocks used
to construct the matrix greater numbers are
lesser distances. - The BLOSUM series of matrices generally perform
better than PAM matrices for local similarity
searches
From http//omega.cbmi.upmc.edu/vanathi/lec5fall
02.ppt
38Comparable BLOSUM and PAM Matrices
Relative Entropy the average information per
alignment position in order to distinguish
relevant alignments from alignments expected by
chance
From http//www.sdsc.edu/babu/UCSD/week02/dbSear
ch_tut.html
39Gap Penalties
Linear gap penalty score ?(g) - bk Affine
gap penalty score ?(g) -(abk)
?(g) gap penalty score of a gap of length g
a gap opening penalty b gap extension
penalty k gap length
Query 85 ADDGCPKPPEIAHGYVEHSVRYQCKNYYKLRTEGDG---
---VYTLNNEKQWINKAVGDK 138
ADDGCPKPPIAHGYVEHSVRYQCKNYYKLRTEGDG
VYTLNNEKQWINKAVGDK Sbjct 62 ADDGCPKPPQIAHGYVEHSV
RYQCKNYYKLRTEGDGKMWTTRVYTLNNEKQWINKAVGDK 121
From http//www.ncbi.nlm.nih.gov/blast/html/sub_m
atrix.html
40Statistics of BLAST searches
Karlin-Altschul equation
normalized-score to bit-score
P 1 - e-E
E-value p-value
41BLAST E values and p values
Very small E values are very similar to p values.
E values of about 1 to 10 are far easier to
interpret than corresponding p values. E p 10
0.99995460 5 0.99326205 2 0.86466472 1 0.63212
056 0.1 0.09516258 (about 0.1) 0.05 0.04877058
(about 0.05) 0.001 0.00099950 (about
0.001) 0.0001 0.0001000
42BLAST - Report Format
43Header
BLAST Report
Body
Footer
Bedell et.al.2003
44Header
45Body Graphical Overview
46Body One-line summaries
set by -v
47Body Alignments
set by -b view set by -m ?
- ALIGNMENT_VIEW - Choose how to view alignments.
- Pairwise
- Pairwise with dots for identities
- Query-anchored with dots for identities
- Query-anchored with letters for identities
- Flat query-anchored with dots for identities
- Flat query-anchored with letters for identities
- Hit Table
- The default "pairwise" view shows how each
subject sequence aligns individually to the query
sequence. - The "query-anchored" view shows how all subject
sequences align to the query sequence. - For each view type, you can choose to show
"identities" (matching residues) as letters or
dots.
48Alignments Views - pairwise
set by -m 0
49Alignments Views - Query-anchored with dots for
identities
set by -m 1
50Alignments Views Query-anchored with letters
for identities
set by -m 2
51Alignments Views - Hits Table
set by -m 8
52Footer
BLOSUM matrix
gap penalties
10.0 is the E value
Effective search space mn length of query x
db length
threshold score (f) 11
cut-off parameters
53Footer (nucleotide)
No T
54Thank You!