BLAST I - PowerPoint PPT Presentation

1 / 54

About This Presentation

Title:

BLAST I

Description:

Jian Ye, Scott McGinnis, and Thomas L. Madden (2006) 'BLAST: ... aliphatic. aromatic. small. tiny. hydrophobic. Scoring Matrix. From: www.sanbi.ac.za/ 28 ... – PowerPoint PPT presentation

Number of Views:77

Avg rating:5.0/5.0

Slides: 55

Provided by: jimf98

Category:

more less

Transcript and Presenter's Notes

Title: BLAST I

1
BLAST (I)
Basic Local Alignment Search Tool
??? ?????? email jimfann_at_itri.org.tw 02/26/2008
2
Reference Sources
Jian Ye, Scott McGinnis, and Thomas L. Madden
(2006) "BLAST improvements for better sequence
analysis" Nucleic Acids Res. July 1 34 (Web
Server issue) W6-W9 McGinnis S, Madden TL.
(2004) "BLAST at the core of a powerful and
diverse set of sequence analysis tools." Nucleic
Acids Res. Jul 132 (Web Server issue)
W20-5. Altschul, S.F., Gish, W., Miller, W.,
Myers, E.W. Lipman, D.J. (1990) "Basic local
alignment search tool." J. Mol. Biol.
215403-410. Altschul, S.F., Madden, T.L.,
Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W.
Lipman, D.J. (1997) "Gapped BLAST and
PSI-BLAST a new generation of protein database
search programs." Nucleic Acids Res.
253389-3402. http//www.ncbi.nlm.nih.gov/BLAST/
ftp//ftp.ncbi.nih.gov/blast/ Joseph Bedell,
Ian Korf, Mark Yandell (2003) BLAST. O'Reilly
http//www.oreilly.com/catalog/blast/ http//ww
w.bioinfbook.org Jonathan Pevsner (2003)
Bioinformatics and Functional Genomics. John
Wiley Sons, Inc.
3
Why use BLAST?

To discover functional, structural and
evolutionary similarities
Because similarity may be an indicator of
homology and thus provide some insight into
function or gene identification.
Applications include
identifying orthologs and paralogs
discovering new genes or proteins
discovering variants of genes or proteins
investigating expressed sequence tags (ESTs)
exploring protein structure and function

4
http//www.ncbi.nlm.nih.gov/BLAST/
5
BLAST
6
Format for query sequence
Key

FASTA (PEARSON and LIPMAN, 1988)
gtgi129295spP01013OVAX_CHICK Gene X
protein (Ovalbumin-related)
QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAE ...
Bare Sequence
QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAE
Identifiers
accession number ( P01013 )
accession number version codes (
AAA68881.1 )
gi ( 129295 , gi129295 )

description
7
IUPAC Nucleic acid codes
8
IUPAC Amino acid codes
9
Peptide Sequence Databases (FASTA format)
10
Nucleotide Sequence Databases
11
BLAST - Algorithm parameters
12
SEG filtering of low-complexity segments
Wootton, J.C. Federhen, S. (1993) "Statistics
of local complexity in amino acid sequences and
sequence databases." Comput. Chem. 17149-163.
Ftp site ftp//ftp.ncbi.nih.gov/pub/seg/seg/
From http//www.ncbi.nlm.nih.gov/Education/BLAST
info/Seg.html
13
BLAST background on sequence alignment
There are two main approaches to
sequence alignment 1 Global alignment
(Needleman Wunsch 1970) using dynamic
programming to find optimal alignments between
two sequences. (Although the alignments are
optimal, the search is not exhaustive.) Gaps are
permitted in the alignments, and the total
lengths of both sequences are aligned (hence
global).
From www.bioinfbook.org/
14
BLAST background on sequence alignment
2 The second approach is local sequence
alignment (Smith Waterman, 1980). The
alignment may contain just a portion of either
sequence, and is appropriate for finding
matched domains between sequences. S-W is
guaranteed to find optimal alignments, but it is
computationally expensive (requires (O)n2
time). BLAST and FASTA are heuristic
approximations to local alignment. Each requires
only (O)n2/k time they examine only part of the
search space.
From www.bioinfbook.org/
15
How a BLAST search works
The central idea of the BLAST algorithm is to
confine attention to segment pairs that contain
a word pair of length w with a score of at least
T. Altschul et al. (1990)
From www.bioinfbook.org/
16
Pairwise alignment scores are determined using a
scoring matrix such as Blosum62
From www.bioinfbook.org/
17
How the original BLAST algorithm works three
phases
Phase 1 (Seeding) ltNucleotide word perfect
matchgt compile a list of word pairs (w3) above
threshold T Example for a human RBP (retinol
binding protein) query FSGTWYA (query word is
in red) A list of words (w3) is FSG SGT GTW
TWY WYA YSG TGT ATW SWY WFA FTG SVT GSW TWF WYS
From www.bioinfbook.org/
18
Phase 1 Seeding
compile a list of words (w3)
GTW 6,5,11 22 neighborhood GSW 6,1,11
18 word hits ATW 0,5,11 16 gt threshold NTW
0,5,11 16 GTY 6,5,2 13 GNW 10 neighborh
ood GAW 9 word hits lt below threshold
(T11)
From www.bioinfbook.org/
19
How a BLAST search works 3 phases
Phase 1 Seeding You can modify the threshold
parameter. The default value for blastp is 11.
To change it, enter -f 16 or -f 5 in the
advanced options. Scan the database for entries
that match the compiled list. This is fast and
relatively easy.
From www.bioinfbook.org/
20
How a BLAST search works 3 phases
Phase 2 Extension when you manage to find a
hit (i.e. a match between a word and a
database entry), extend the hit in either
direction. Keep track of the score (use a
scoring matrix) Stop when the score drops below
some cutoff.
KENFDKARFSGTWYAMAKKDPEG 50 RBP (query) MKGLDIQKVAG
TWYSLAMAASD. 44 lactoglobulin (hit)
extend
extend
Hit!
From www.bioinfbook.org/
21
How a BLAST search works 3 phases
Phase 2 Extension
The quick brown fox jumps over the lazy dog. The
quiet brown cat purrs when she sees him. Scoring
matrix match 1 mismatch -1 seed T
The quick brown fox jump The quiet brown cat
purr 123 45654 56789 876 5654 lt- score 000 00012
10000 123 4345 lt- drop off score
lt- drop off score 5
lt- drop off score 2
From Korf et. al. 2003
22
How a BLAST search works 3 phases
Phase 2 Extension In the original (1990)
implementation of BLAST, hits were extended in
either direction. In a 1997 refinement of BLAST,
two independent hits are required. The hits must
occur in close proximity to each other. With this
modification, only one seventh as many extensions
occur, greatly speeding the time required for a
search.
-y Dropoff (X) for blast extensions in bits
(default if zero) default 20 for blastn, 7
for other programs
From www.bioinfbook.org/
23
How a BLAST search works 3 phases
Phase 3 Evaluation of HSP High-scoring segment
pair (HSP) Alignment threshold (set by
software) in Footer S1 ungapped S2
gapped Final alignment threshold (set as -e)
24
Nucleotide BLAST

Megablast is intended for comparing a query to
closely related sequences and works best if the
target percent identity is 95 or more but is
very fast.
Discontiguous megablast uses an initial seed that
ignores some bases (allowing mismatches) and is
intended for cross-species comparisons.
BlastN is slow, but allows a word-size down to
seven bases.

25
Nucleotide BLAST - seeding

Exact word match for
Blastn gt word length 11
Mega blast gt word length 28
discontiguous Mega BLAST
Template length 16, 18, 21.
Word size (i.e. number of 1s in the template)
11, 12
Template type coding, non-coding.
Require two words for extension yes/no.

26
Scoring Matrix
Ftp site ftp//ftp.ncbi.nih.gov/blast/matrices/

Simple scoring system

ftp//ftp.ncbi.nih.gov/blast/matrices/IDENTITY
ftp//ftp.ncbi.nih.gov/blast/matrices/MATCH
27
Scoring Matrix

Conservative amino acids substitution due to
similar physicochemical properties
Isoleucine for Valine (both small, hydrophobic)
Serine for Threonine (both polar)
...

tiny
P
aliphatic
C
small
SS
G
G
I
A
S
V
C
N
SH
L
D
T
Y
hydrophobic
M
K
E
Q
F
W
H
R
positive
aromatic
polar
charged
From www.sanbi.ac.za/
28
Scoring Matrix

gt Substitution matrix
to increase sensitivity of the alignment
algorithm
flexible lookup scheme for any pair of amino
acids
PAM, BLOSUM
Calculating Similarity Scores (log-odds scores)

29
Scoring - (BLOSUM 62)
From http//blast.wustl.edu/doc/infotheory.html
30
PAM (Percent Accepted Mutations) matrices

Derived from global alignments of protein
families . Family members
share at least 85 identity (Dayhoff et al.,
1978).
Construction of phylogenetic tree and ancestral
sequences of each protein family
Computation of number of replacements for each
pair of amino acids

From www.sanbi.ac.za/ http//www.sdsc.edu/babu/
UCSD/week02/dbSearch_tut.html
31
PAM (Percent Accepted Mutations) matrices

The numbers of replacements were used to compute
a so-called
PAM-1 matrix.
The PAM-1 matrix reflects an average change of
1 of all amino
acid positions. PAM matrices for larger
evolutionary distances can
be extrapolated from the PAM-1 matrix.
Matrix multiplication using PAM-1
PAM250 250 mutations per 100 residues.
Family of matrices PAM10 PAM200
Greater numbers mean bigger evolutionary
distance

.
From www.sanbi.ac.za/
32
PAM Matrices

If changes were purely random
Frequency of each possible substitution is
proportional to background frequencies
In related proteins
Observed substitution frequencies called the
target (replacement) frequencies are biased
toward those that do not seriously disrupt the
proteins function
These point mutations are accepted during
evolution
Log-odds approach
Scores proportional to the natural log of the
ratio of target frequencies to background
frequencies

From http//omega.cbmi.upmc.edu/vanathi/
33
PAM Matrices salient points

Derived from global alignments of closely related
sequences.
Matrices for greater evolutionary distances are
extrapolated from those for lesser ones.
The number with the matrix (PAM40, PAM100) refers
to the evolutionary distance greater numbers are
greater distances.
Does not take into account different evolutionary
rates between conserved and non-conserved
regions.

From http//omega.cbmi.upmc.edu/vanathi/
34
BLOSUM Matrices

Henikoff, S. Henikoff J.G. (1992)
Use blocks of protein sequence fragments from
different families (the BLOCKS database)
Amino acid pair frequencies calculated by summing
over all possible pairs in block
Different evolutionary distances are incorporated
into this scheme with a clustering procedure
(identity over particular threshold same
cluster)

From http//omega.cbmi.upmc.edu/vanathi/
35
BLOSUM Matrices

Probabilities estimated from blocks of sequence
fragments
Blocks represent structurally conserved regions

Target frequencies are identified directly
Sequences more than x identitical within the
block where substitutions are being counted, are
grouped together and treated as a single sequence
BLOSUM 50 gt 50 identity
BLOSUM 62 gt 62 identity

From http//omega.cbmi.upmc.edu/vanathi/
36
BLOSUM Matrices
From http//www.sdsc.edu/babu/UCSD/week02/dbSear
ch_tut.html
37
BLOSUM Matrices - Summary

Derived from local, ungapped alignments of
distantly related sequences
All matrices are directly calculated
The number after the matrix (BLOSUM62) refers to
the minimum percent identity of the blocks used
to construct the matrix greater numbers are
lesser distances.
The BLOSUM series of matrices generally perform
better than PAM matrices for local similarity
searches

From http//omega.cbmi.upmc.edu/vanathi/lec5fall
02.ppt
38
Comparable BLOSUM and PAM Matrices
Relative Entropy the average information per
alignment position in order to distinguish
relevant alignments from alignments expected by
chance
From http//www.sdsc.edu/babu/UCSD/week02/dbSear
ch_tut.html
39
Gap Penalties
Linear gap penalty score ?(g) - bk Affine
gap penalty score ?(g) -(abk)
?(g) gap penalty score of a gap of length g
a gap opening penalty b gap extension
penalty k gap length
Query 85 ADDGCPKPPEIAHGYVEHSVRYQCKNYYKLRTEGDG---
---VYTLNNEKQWINKAVGDK 138
ADDGCPKPPIAHGYVEHSVRYQCKNYYKLRTEGDG
VYTLNNEKQWINKAVGDK Sbjct 62 ADDGCPKPPQIAHGYVEHSV
RYQCKNYYKLRTEGDGKMWTTRVYTLNNEKQWINKAVGDK 121
From http//www.ncbi.nlm.nih.gov/blast/html/sub_m
atrix.html
40
Statistics of BLAST searches
Karlin-Altschul equation
normalized-score to bit-score
P 1 - e-E
E-value p-value
41
BLAST E values and p values
Very small E values are very similar to p values.
E values of about 1 to 10 are far easier to
interpret than corresponding p values. E p 10
0.99995460 5 0.99326205 2 0.86466472 1 0.63212
056 0.1 0.09516258 (about 0.1) 0.05 0.04877058
(about 0.05) 0.001 0.00099950 (about
0.001) 0.0001 0.0001000
42
BLAST - Report Format
43
Header
BLAST Report
Body
Footer
Bedell et.al.2003
44
Header
45
Body Graphical Overview
46
Body One-line summaries
set by -v
47
Body Alignments
set by -b view set by -m ?

ALIGNMENT_VIEW - Choose how to view alignments.
Pairwise
Pairwise with dots for identities
Query-anchored with dots for identities
Query-anchored with letters for identities
Flat query-anchored with dots for identities
Flat query-anchored with letters for identities
Hit Table
The default "pairwise" view shows how each
subject sequence aligns individually to the query
sequence.
The "query-anchored" view shows how all subject
sequences align to the query sequence.
For each view type, you can choose to show
"identities" (matching residues) as letters or
dots.

48
Alignments Views - pairwise
set by -m 0
49
Alignments Views - Query-anchored with dots for
identities
set by -m 1
50
Alignments Views Query-anchored with letters
for identities
set by -m 2
51
Alignments Views - Hits Table
set by -m 8
52
Footer
BLOSUM matrix
gap penalties
10.0 is the E value
Effective search space mn length of query x
db length
threshold score (f) 11
cut-off parameters
53
Footer (nucleotide)
No T
54
Thank You!

Write a Comment

User Comments (0)