BLAST I - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

BLAST I

Description:

Jian Ye, Scott McGinnis, and Thomas L. Madden (2006) 'BLAST: ... aliphatic. aromatic. small. tiny. hydrophobic. Scoring Matrix. From: www.sanbi.ac.za/ 28 ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:5.0/5.0
Slides: 55
Provided by: jimf98
Category:
Tags: blast | aliphatic

less

Transcript and Presenter's Notes

Title: BLAST I


1
BLAST (I)
Basic Local Alignment Search Tool
??? ?????? email jimfann_at_itri.org.tw 02/26/2008
2
Reference Sources
Jian Ye, Scott McGinnis, and Thomas L. Madden
(2006) "BLAST improvements for better sequence
analysis" Nucleic Acids Res. July 1 34 (Web
Server issue) W6-W9 McGinnis S, Madden TL.
(2004) "BLAST at the core of a powerful and
diverse set of sequence analysis tools." Nucleic
Acids Res. Jul 132 (Web Server issue)
W20-5. Altschul, S.F., Gish, W., Miller, W.,
Myers, E.W. Lipman, D.J. (1990) "Basic local
alignment search tool." J. Mol. Biol.
215403-410. Altschul, S.F., Madden, T.L.,
Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W.
Lipman, D.J. (1997) "Gapped BLAST and
PSI-BLAST a new generation of protein database
search programs." Nucleic Acids Res.
253389-3402. http//www.ncbi.nlm.nih.gov/BLAST/
ftp//ftp.ncbi.nih.gov/blast/ Joseph Bedell,
Ian Korf, Mark Yandell (2003) BLAST. O'Reilly
http//www.oreilly.com/catalog/blast/ http//ww
w.bioinfbook.org Jonathan Pevsner (2003)
Bioinformatics and Functional Genomics. John
Wiley Sons, Inc.
3
Why use BLAST?
  • To discover functional, structural and
    evolutionary similarities
  • Because similarity may be an indicator of
    homology and thus provide some insight into
    function or gene identification.
  • Applications include
  • identifying orthologs and paralogs
  • discovering new genes or proteins
  • discovering variants of genes or proteins
  • investigating expressed sequence tags (ESTs)
  • exploring protein structure and function

4
http//www.ncbi.nlm.nih.gov/BLAST/
5
BLAST
6
Format for query sequence
Key
  • FASTA (PEARSON and LIPMAN, 1988)
  • gtgi129295spP01013OVAX_CHICK Gene X
    protein (Ovalbumin-related)
  • QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAE ...
  • Bare Sequence
  • QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAE
  • Identifiers
  • accession number ( P01013 )
  • accession number version codes (
    AAA68881.1 )
  • gi ( 129295 , gi129295 )

description
7
IUPAC Nucleic acid codes
8
IUPAC Amino acid codes
9
Peptide Sequence Databases (FASTA format)
10
Nucleotide Sequence Databases
11
BLAST - Algorithm parameters
12
SEG filtering of low-complexity segments
Wootton, J.C. Federhen, S. (1993) "Statistics
of local complexity in amino acid sequences and
sequence databases." Comput. Chem. 17149-163.
Ftp site ftp//ftp.ncbi.nih.gov/pub/seg/seg/
From http//www.ncbi.nlm.nih.gov/Education/BLAST
info/Seg.html
13
BLAST background on sequence alignment
There are two main approaches to
sequence alignment 1 Global alignment
(Needleman Wunsch 1970) using dynamic
programming to find optimal alignments between
two sequences. (Although the alignments are
optimal, the search is not exhaustive.) Gaps are
permitted in the alignments, and the total
lengths of both sequences are aligned (hence
global).
From www.bioinfbook.org/
14
BLAST background on sequence alignment
2 The second approach is local sequence
alignment (Smith Waterman, 1980). The
alignment may contain just a portion of either
sequence, and is appropriate for finding
matched domains between sequences. S-W is
guaranteed to find optimal alignments, but it is
computationally expensive (requires (O)n2
time). BLAST and FASTA are heuristic
approximations to local alignment. Each requires
only (O)n2/k time they examine only part of the
search space.
From www.bioinfbook.org/
15
How a BLAST search works
The central idea of the BLAST algorithm is to
confine attention to segment pairs that contain
a word pair of length w with a score of at least
T. Altschul et al. (1990)
From www.bioinfbook.org/
16
Pairwise alignment scores are determined using a
scoring matrix such as Blosum62
From www.bioinfbook.org/
17
How the original BLAST algorithm works three
phases
Phase 1 (Seeding) ltNucleotide word perfect
matchgt compile a list of word pairs (w3) above
threshold T Example for a human RBP (retinol
binding protein) query FSGTWYA (query word is
in red) A list of words (w3) is FSG SGT GTW
TWY WYA YSG TGT ATW SWY WFA FTG SVT GSW TWF WYS
From www.bioinfbook.org/
18
Phase 1 Seeding
compile a list of words (w3)
GTW 6,5,11 22 neighborhood GSW 6,1,11
18 word hits ATW 0,5,11 16 gt threshold NTW
0,5,11 16 GTY 6,5,2 13 GNW 10 neighborh
ood GAW 9 word hits lt below threshold
(T11)
From www.bioinfbook.org/
19
How a BLAST search works 3 phases
Phase 1 Seeding You can modify the threshold
parameter. The default value for blastp is 11.
To change it, enter -f 16 or -f 5 in the
advanced options. Scan the database for entries
that match the compiled list. This is fast and
relatively easy.
From www.bioinfbook.org/
20
How a BLAST search works 3 phases
Phase 2 Extension when you manage to find a
hit (i.e. a match between a word and a
database entry), extend the hit in either
direction. Keep track of the score (use a
scoring matrix) Stop when the score drops below
some cutoff.
KENFDKARFSGTWYAMAKKDPEG 50 RBP (query) MKGLDIQKVAG
TWYSLAMAASD. 44 lactoglobulin (hit)
extend
extend
Hit!
From www.bioinfbook.org/
21
How a BLAST search works 3 phases
Phase 2 Extension
The quick brown fox jumps over the lazy dog. The
quiet brown cat purrs when she sees him. Scoring
matrix match 1 mismatch -1 seed T
The quick brown fox jump The quiet brown cat
purr 123 45654 56789 876 5654 lt- score 000 00012
10000 123 4345 lt- drop off score
lt- drop off score 5
lt- drop off score 2
From Korf et. al. 2003
22
How a BLAST search works 3 phases
Phase 2 Extension In the original (1990)
implementation of BLAST, hits were extended in
either direction. In a 1997 refinement of BLAST,
two independent hits are required. The hits must
occur in close proximity to each other. With this
modification, only one seventh as many extensions
occur, greatly speeding the time required for a
search.
-y Dropoff (X) for blast extensions in bits
(default if zero) default 20 for blastn, 7
for other programs
From www.bioinfbook.org/
23
How a BLAST search works 3 phases
Phase 3 Evaluation of HSP High-scoring segment
pair (HSP) Alignment threshold (set by
software) in Footer S1 ungapped S2
gapped Final alignment threshold (set as -e)
24
Nucleotide BLAST
  • Megablast is intended for comparing a query to
    closely related sequences and works best if the
    target percent identity is 95 or more but is
    very fast.
  • Discontiguous megablast uses an initial seed that
    ignores some bases (allowing mismatches) and is
    intended for cross-species comparisons.
  • BlastN is slow, but allows a word-size down to
    seven bases.

25
Nucleotide BLAST - seeding
  • Exact word match for
  • Blastn gt word length 11
  • Mega blast gt word length 28
  • discontiguous Mega BLAST
  • Template length 16, 18, 21.
  • Word size (i.e. number of 1s in the template)
    11, 12
  • Template type coding, non-coding.
  • Require two words for extension yes/no.

26
Scoring Matrix
Ftp site ftp//ftp.ncbi.nih.gov/blast/matrices/
  • Simple scoring system

ftp//ftp.ncbi.nih.gov/blast/matrices/IDENTITY
ftp//ftp.ncbi.nih.gov/blast/matrices/MATCH
27
Scoring Matrix
  • Conservative amino acids substitution due to
    similar physicochemical properties
  • Isoleucine for Valine (both small, hydrophobic)
  • Serine for Threonine (both polar)
  • ...

tiny
P
aliphatic
C
small
SS
G
G
I
A
S
V
C
N
SH
L
D
T
Y
hydrophobic
M
K
E
Q
F
W
H
R
positive
aromatic
polar
charged
From www.sanbi.ac.za/
28
Scoring Matrix
  • gt Substitution matrix
  • to increase sensitivity of the alignment
    algorithm
  • flexible lookup scheme for any pair of amino
    acids
  • PAM, BLOSUM
  • Calculating Similarity Scores (log-odds scores)

29
Scoring - (BLOSUM 62)
From http//blast.wustl.edu/doc/infotheory.html
30
PAM (Percent Accepted Mutations) matrices
  • Derived from global alignments of protein
    families . Family members
  • share at least 85 identity (Dayhoff et al.,
    1978).
  • Construction of phylogenetic tree and ancestral
    sequences of each protein family
  • Computation of number of replacements for each
    pair of amino acids

From www.sanbi.ac.za/ http//www.sdsc.edu/babu/
UCSD/week02/dbSearch_tut.html
31
PAM (Percent Accepted Mutations) matrices
  • The numbers of replacements were used to compute
    a so-called
  • PAM-1 matrix.
  • The PAM-1 matrix reflects an average change of
    1 of all amino
  • acid positions. PAM matrices for larger
    evolutionary distances can
  • be extrapolated from the PAM-1 matrix.
  • Matrix multiplication using PAM-1
  • PAM250 250 mutations per 100 residues.
  • Family of matrices PAM10 PAM200
  • Greater numbers mean bigger evolutionary
    distance

.
From www.sanbi.ac.za/
32
PAM Matrices
  • If changes were purely random
  • Frequency of each possible substitution is
    proportional to background frequencies
  • In related proteins
  • Observed substitution frequencies called the
    target (replacement) frequencies are biased
    toward those that do not seriously disrupt the
    proteins function
  • These point mutations are accepted during
    evolution
  • Log-odds approach
  • Scores proportional to the natural log of the
    ratio of target frequencies to background
    frequencies

From http//omega.cbmi.upmc.edu/vanathi/
33
PAM Matrices salient points
  • Derived from global alignments of closely related
    sequences.
  • Matrices for greater evolutionary distances are
    extrapolated from those for lesser ones.
  • The number with the matrix (PAM40, PAM100) refers
    to the evolutionary distance greater numbers are
    greater distances.
  • Does not take into account different evolutionary
    rates between conserved and non-conserved
    regions.

From http//omega.cbmi.upmc.edu/vanathi/
34
BLOSUM Matrices
  • Henikoff, S. Henikoff J.G. (1992)
  • Use blocks of protein sequence fragments from
    different families (the BLOCKS database)
  • Amino acid pair frequencies calculated by summing
    over all possible pairs in block
  • Different evolutionary distances are incorporated
    into this scheme with a clustering procedure
    (identity over particular threshold same
    cluster)

From http//omega.cbmi.upmc.edu/vanathi/
35
BLOSUM Matrices
  • Probabilities estimated from blocks of sequence
    fragments
  • Blocks represent structurally conserved regions
  • Target frequencies are identified directly
  • Sequences more than x identitical within the
    block where substitutions are being counted, are
    grouped together and treated as a single sequence
  • BLOSUM 50 gt 50 identity
  • BLOSUM 62 gt 62 identity

From http//omega.cbmi.upmc.edu/vanathi/
36
BLOSUM Matrices
From http//www.sdsc.edu/babu/UCSD/week02/dbSear
ch_tut.html
37
BLOSUM Matrices - Summary
  • Derived from local, ungapped alignments of
    distantly related sequences
  • All matrices are directly calculated
  • The number after the matrix (BLOSUM62) refers to
    the minimum percent identity of the blocks used
    to construct the matrix greater numbers are
    lesser distances.
  • The BLOSUM series of matrices generally perform
    better than PAM matrices for local similarity
    searches

From http//omega.cbmi.upmc.edu/vanathi/lec5fall
02.ppt
38
Comparable BLOSUM and PAM Matrices
Relative Entropy the average information per
alignment position in order to distinguish
relevant alignments from alignments expected by
chance
From http//www.sdsc.edu/babu/UCSD/week02/dbSear
ch_tut.html
39
Gap Penalties
Linear gap penalty score ?(g) - bk Affine
gap penalty score ?(g) -(abk)
?(g) gap penalty score of a gap of length g
a gap opening penalty b gap extension
penalty k gap length
Query 85 ADDGCPKPPEIAHGYVEHSVRYQCKNYYKLRTEGDG---
---VYTLNNEKQWINKAVGDK 138
ADDGCPKPPIAHGYVEHSVRYQCKNYYKLRTEGDG
VYTLNNEKQWINKAVGDK Sbjct 62 ADDGCPKPPQIAHGYVEHSV
RYQCKNYYKLRTEGDGKMWTTRVYTLNNEKQWINKAVGDK 121
From http//www.ncbi.nlm.nih.gov/blast/html/sub_m
atrix.html
40
Statistics of BLAST searches
Karlin-Altschul equation
normalized-score to bit-score
P 1 - e-E
E-value p-value
41
BLAST E values and p values
Very small E values are very similar to p values.
E values of about 1 to 10 are far easier to
interpret than corresponding p values. E p 10
0.99995460 5 0.99326205 2 0.86466472 1 0.63212
056 0.1 0.09516258 (about 0.1) 0.05 0.04877058
(about 0.05) 0.001 0.00099950 (about
0.001) 0.0001 0.0001000
42
BLAST - Report Format
43
Header
BLAST Report
Body
Footer
Bedell et.al.2003
44
Header
45
Body Graphical Overview
46
Body One-line summaries
set by -v
47
Body Alignments
set by -b view set by -m ?
  • ALIGNMENT_VIEW - Choose how to view alignments.
  • Pairwise
  • Pairwise with dots for identities
  • Query-anchored with dots for identities
  • Query-anchored with letters for identities
  • Flat query-anchored with dots for identities
  • Flat query-anchored with letters for identities
  • Hit Table
  • The default "pairwise" view shows how each
    subject sequence aligns individually to the query
    sequence.
  • The "query-anchored" view shows how all subject
    sequences align to the query sequence.
  • For each view type, you can choose to show
    "identities" (matching residues) as letters or
    dots.

48
Alignments Views - pairwise
set by -m 0
49
Alignments Views - Query-anchored with dots for
identities
set by -m 1
50
Alignments Views Query-anchored with letters
for identities
set by -m 2
51
Alignments Views - Hits Table
set by -m 8
52
Footer
BLOSUM matrix
gap penalties
10.0 is the E value
Effective search space mn length of query x
db length
threshold score (f) 11
cut-off parameters
53
Footer (nucleotide)
No T
54
Thank You!
Write a Comment
User Comments (0)
About PowerShow.com