Title: BCB 444544 F07 ISU Dobbs
1 BCB 444/544
- Lecture 8
- Finish Dynamic Programming
- Global vs Local Alignment
- Scoring Matrices Alignment Statistics
- BLAST
- 8_Sept7
2 Required Reading (before lecture)
- vLast week - for Lectures 4-7
- Pairwise Sequence Alignment, Dynamic Programming,
- Global vs Local Alignment, Scoring Matrices,
Statistics - Xiong Chp 3
- Eddy What is Dynamic Programming? 2004 Nature
Biotechnol 22909 - http//www.nature.com/nbt/journal/v22/n7/abs/nbt
0704-909.html - vWed Sept 5 - for Lecture 7 Lab 3
- Database Similarity Searching BLAST (nope,
more DP) - Chp 4 - pp 51-62
- Fri Sept - for Lecture 8 (will finish on
Monday) - BLAST variations BLAST vs FASTA
- Chp 4 - pp 51-62
3 Assignments Announcements
- vTues Sept 4 - Lab 2 Exercise Writeup due by 5
PM Send via email to Pete Zaback
petez_at_iastate.edu - (For now, no late penalty - just send ASAP)
- vWed Sept 5 - Notes for Lecture 5 posted online
- - HW2 posted online sent via email
- handed out in class
- Fri Sept 14 - HW2 Due by 5 PM
- Fri Sept 21 - Exam 1
4Chp 3- Sequence Alignment
- SECTION II SEQUENCE ALIGNMENT
- Xiong Chp 3
- Pairwise Sequence Alignment
- vEvolutionary Basis
- vSequence Homology versus Sequence Similarity
- vSequence Similarity versus Sequence Identity
- Methods - cont
- Scoring Matrices
- Statistical Significance of Sequence Alignment
Adapted from Brown and Caragea, 2007, with some
slides from Altman, Fernandez-Baca, Batzoglou,
Craven, Hunter, Page.
5Methods
- vGlobal and Local Alignment
- vAlignment Algorithms
- vDot Matrix Method
- Dynamic Programming Method - cont
- Gap penalities
- DP for Global Alignment
- DP for Local Alignment
- Scoring Matrices
- Amino acid scoring matrices
- PAM
- BLOSUM
- Comparisons between PAM BLOSUM
- Statistical Significance of Sequence Alignment
6Dynamic Programming - 4 Steps
- Define score of optimal alignment, using
recursion - Initialize and fill in a DP matrix for storing
optimal scores of subproblems, by solving
smallest subproblems first (bottom-up approach) - Calculate score of optimal alignment(s)
- Trace back through matrix to recover optimal
alignment(s) that generated optimal score
71- Define Score of Optimal Alignment using
Recursion
Define
- ? Match Reward
- Mismatch Penalty
- ? Gap penalty
Recursive definition For 1 ? i ? N, 1 ? j ?
M
?(xi,yj) ? or ?
? Gap penalty
82- Initialize Fill in DP Matrix for Storing
Optimal Scores of Subproblems
- Construct sequence vs sequence matrix
- Fill in from 0,0 to N,M (row by row),
calculating best - possible score for each alignment ending at
residues at i,j
0
1
N
0
S(0,0)0
1
S(i,j)
S(N,M)
M
9How do we calculate S(i,j)? i.e., Score for
alignment of x1..i to y1..j?
- 1 of 3 cases ? optimal score for this subproblem
10 Specific Example
Note I changed sequences on this slide (to
match the rest of DP example)
Scoring Consequence?
Case 1 Line up xi with yj
i - 1
i
x C - T C G C A y C A T - T C
A
Match Bonus
j - 1
j
Case 2 Line up xi with space
i - 1
i
x C - T C G C - A y C A T
- T C A -
Space Penalty
j
Case 3 Line up yj with space
i
x C - T C G C A - y C A T -
T C - A
Space Penalty
j -1
j
11Ready? Fill in DP Matrix
Keep track of dependencies of scores (in a
pointer matrix)
0
1
N
0
S(0,0)0
1
S(i-1,j)
S(i-1,j-1)
- ? Match Reward
- Mismatch Penalty
- ? Gap penalty
S(i,j-1)
S(i,j)
S(N,M)
M
12Fill in the DP matrix !!
? C T C G C
A G C
0 -5 -10 -15 -20 -25 -30 -35 -40
?
C
10
5
A
T
T
C
A
C
10 for match, -2 for mismatch, -5 for space
133- Calculate Score S(N,M) of Optimal Alignment -
for Global Alignment
? C T C G C
A G C
10 for match, -2 for mismatch, -5 for space
143- Calculate Score S(N,M) of Optimal Alignment -
for Global Alignment
? C T C G C
A G C
10 for match, -2 for mismatch, -5 for space
154- Trace back through matrix to recover optimal
alignment(s) that generated the optimal score
- How? "Repeat" alignment calculations in reverse
order, starting at from position with highest
score and following path, position by position,
back through matrix - Result? Optimal alignment(s) of sequences
16Traceback - for Global Alignment
- Start in lower right corner trace back to upper
left - Each arrow introduces one character at end of
alignment - A horizontal move puts a gap in left sequence
- A vertical move puts a gap in top sequence
- A diagonal move uses one character from each
sequence
17Traceback to Recover Alignment
? C T C G C
A G C
Can have 1 optimal alignment this example has 2
18Traceback to Recover Alignment
? C T C G C
A G C
Where did red arrows come from?
19Traceback to Recover Alignment
? C T C G C
A G C
10 for match, -2 for mismatch, -5 for space
- Where did 33 come from? Match 10, so 33-10
23 - Must have come from diagonal
- Where did 23 come from? (Not a match)
- Left? 28-5 23 Diag? 13-2 11 Top? 8-5 3
20Traceback to Recover Alignment
? C T C G C
A G C
10 for match, -2 for mismatch, -5 for space
- Where did 8 come from?
- Two possibilities 13-5 8 or 10-28
- Then, follow both paths
21Traceback to Recover Alignment
? C T C G C
A G C
C with C
- with A
T with T
C with -
G with T
C with C
G with -
A with A
C with C
Great - but what are the alignments? 1
22Traceback to Recover Alignment
? C T C G C
A G C
C with C
- with A
T with T
C with T
C with C
G with -
G with -
A with A
C with C
Great - but what are the alignments? 2
23What are the 2 Global Alignments with Optimal
Score 33?
Top C T C G C A G C Left
C A T T C A C
C - T C G C A
G C
2
- A horizontal move puts a gap in left sequence
- A vertical move puts a gap in top sequence
- A diagonal move uses one character from each
sequence
24What are the 2 Global Alignments with Optimal
Score 33?
Top C T C G C A G C Left
C A T T C A C
C - T C G C A
G C C A T T - C
A - C
2
- A horizontal move puts a gap in left sequence
- A vertical move puts a gap in top sequence
- A diagonal move uses one character from each
sequence
25Check Traceback?
? C T C G C
A G C
- A horizontal move puts a gap in left sequence
- A vertical move puts a gap in top sequence
- A diagonal move uses one character from each
sequence
26Local Alignment Motivation
- To "ignore" stretches of non-coding DNA
- Non-coding regions (if "non-functional") are more
likely to contain mutations than coding regions - Local alignment between two protein-encoding
sequences is likely to be between two exons - To locate protein domains or motifs
- Proteins with similar structures and/or similar
functions but from different species (for
example), often exhibit local sequence
similarities - Local sequence similarities may indicate
functional modules
Non-coding - "not encoding protein" Exons -
"protein-encoding" parts of genes vs Introns
"intervening sequences" - segments of
eukaryotic genes that "interrupt" exons
Introns are transcribed into RNA, but are later
removed by RNA processing are not translated
into protein
27Local Alignment Example
G G T C T G A G A A A C G A
Match 2 Mismatch or space -1
Best local alignment
G G T C T G A G A A A C G A -
Score 5
28Local Alignment Algorithm
- S i, j Score for optimally aligning a suffix
of X with a suffix of Y - Initialize top row leftmost column of matrix
with "0"
- Recall for Global Alignment,
-
- S i, j Score for optimally aligning a
prefix of X with a prefix of Y - Initialize top row leftmost column of with gap
penalty
29Filling in DP Matrix for Local Alignment
? C T C G C
A G C
1 for a match, -1 for a mismatch, -5 for a space
30Traceback - for Local Alignment
? C T C G C
A G C
1 for a match, -1 for a mismatch, -5 for a space
31What are the 4 Local Alignments with Optimal
Score 2?
C T C G C A
G C C A T T
C A C
32Some Results re Alignment Algorithms (for
ComS, CprE Math types!)
- Most pairwise sequence alignment problems can be
solved in O(mn) time - Space requirement can be reduced to O(mn), while
keeping run-time fixed Myers88 - Highly similar sequences can be aligned in O (dn)
time, where d measures the distance between the
sequences Landau86
33Affine Gap Penalty Functions
- Affine Gap Penalties Differential Gap
Penalties used to reflect cost differences
between opening a gap and extending an existing
gap - Total Gap Penalty is linear function of gap
length - W ? ? X (k - 1)
- where ? gap opening penalty
- ? gap extension penalty
- k length of gap
- Sometimes, a Constant Gap Penalty is used, but it
is usually least realistic than the Affine Gap
Penalty
Can also be solved in O(nm) time using DP
34Methods
- vGlobal and Local Alignment
- vAlignment Algorithms
- vDot Matrix Method
- vDynamic Programming Method - cont
- Gap penalities
- DP for Global Alignment
- DP for Local Alignment
- Scoring Matrices
- Amino acid scoring matrices
- PAM
- BLOSUM
- Comparisons between PAM BLOSUM
- Statistical Significance of Sequence Alignment
35"Scoring" or "Substitution" Matrices
- 2 Major types for Amino Acids PAM BLOSUM
- PAM Point Accepted Mutation
- relies on "evolutionary model" based on
observed differences in alignments of closely
related proteins - BLOSUM BLOck SUbstitution Matrix
- based on aa substitutions observed in blocks
of conserved sequences within evolutionarily
divergent proteins
36PAM Matrix
- PAM Point Accepted Mutation
- relies on "evolutionary model" based on observed
differences in closely related proteins - Model includes defined rate for each type of
sequence change - Suffix number (n) reflects amount of "time"
passed rate of expected mutation if n of amino
acids had changed - PAM1 - for less divergent sequences (shorter
time) - PAM250 - for more divergent sequences (longer
time)
37BLOSUM Matrix
- BLOSUM BLOck SUbstitution Matrix
- based on aa substitutions observed in blocks
of conserved sequences within evolutionarily
divergent proteins - Doesn't rely on a specific evolutionary model
- Suffix number (n) reflects expected similarity
average aa identity in the MSA from which the
matrix was generated - BLOSUM45 - for more divergent sequences
- BLOSUM62 - for less divergent sequences
38 PAM250 vs BLOSUM 62
- See Text
- Fig 3.5 PAM250
- Fig 3.6 BLOSUM62
- Usually only 1/2 of matrix is displayed (it is
symmetric) - Here
- s(a,b) corresponds to score of aligning character
a with character b
39Which is Better? PAM or BLOSUM
- PAM matrices
- derived from evolutionary model
- often used in reconstructing phylogenetic trees -
but, not very good for highly divergent sequences - BLOSUM matrices
- based on direct observations
- more 'realistic" - and outperform PAM matrices in
terms of accuracy in local alignment
40Which Type of Matrix Should You Use?
- Several other types of matrices available
- Gonnet Jones-Taylor-Thornton
- very robust in tree construction
- "Best" matrix depends on task
- different matrices for different applications
- ADVICE if unsure, try several different
matrices choose the one that gives best
alignment result
41Sequence Alignment Statistics
- Distribution of similarity scores in sequence
alignment is not a simple "normal" distribution - "Gumble extreme value distribution" - a highly
skewed normal distribution with a long tail
42How Assess Statistical Significance of an
Alignment?
- Compare score of an alignment with distribution
of scores of alignments for many 'randomized'
(shuffled) versions of the original sequence - If score is in extreme margin, then unlikely due
to random chance - P-value probability that original alignment is
due to random chance (lower P is better) - P 10-5 - 10-50 sequences have clear homology
- P 10-1 no better than random
Check out PRSS (Probability of Random
Shuffles) http//www.ch.embnet.org/software/PRSS_f
orm.html
43Chp 4- Database Similarity Searching
- SECTION II SEQUENCE ALIGNMENT
- Xiong Chp 4
- Database Similarity Searching
- Unique Requirements of Database Searching
- Heuristic Database Searching
- Basic Local Alignment Search Tool (BLAST)
- FASTA
- Comparison of FASTA and BLAST
- Database Searching with Smith-Waterman Method
44Exhaustive vs Heuristic Methods
- Exhaustive - tests every possible solution
- guaranteed to give best answer
- (identifies optimal solution)
- can be very time/space intensive!
- e.g., Dynamic Programming
- as in Smith-Waterman algorithm
- Heuristic - does NOT test every possibility
- no guarantee that answer is best
- (but, often can identify optimal
solution) - sacrifices accuracy (potentially) for speed
- uses "rules of thumb" or "shortcuts"
- e.g., BLAST FASTA
45Today's Lab focus on BLAST Basic Local
Alignment Search Tool
- STEPS
- Create list of very possible "word" (e.g., 3-11
letters) from query sequence - Search database to identify sequences that
contain matching words - Score match of word with sequence, using a
substitution matrix - Extend match (seed) in both directions, while
calculating alignment score at each step - Continue extension until score drops below a
threshold (due to mismatches) - Contiguous aligned segment pair (no gaps) is
called - High Scoring Segment Pair (HSP)
46 Lab3 focus on BLAST Basic Local Alignment
Search Tool
- BLAST Results?
- Original version of BLAST?
- List of HSPs Maximum Scoring Pairs
- More recent, improved version of BLAST?
- Allows gaps Gapped Alignment
- How? Allows score to drop below threshold,
- (but only temporarily)
47BLAST - a few details
- Developed by Stephen Aultschul at NCBI in 1990
- Word length?
- Typically 3 aa for protein sequence
- 11 nt for DNA sequence
- Substitution matrix?
- Default is BLOSUM62
- Can change under Algorithm Parameters
- Choose other BLOSUM or PAM matrices
- Stop-Extension Threshold?
- Typically 22 for proteins
- 20 for DNA
48BLAST - Statistical Significance?
- E-value E m x n x P
- m total number of residues in database
- n number of residues in query sequence
- P probability that an HSP is result of random
chance - lower E-value, less likely to result from random
change, thus higher significance - Bit Score S'
- normalized score, to account for differences in
sequence length size of database - 3. Low Complexity Masking
- remove repeats that confound scoring
49BLAST - a Family of Programs Several different
BLAST "flavors"
- BLASTN -
- BLASTP -
- BLASTX -
- TBLASTN -
- TBLASTN -