Title: BCB 444/544
1 BCB 444/544
- Lecture 6
- Finish Dynamic Programming
- Scoring Matrices
- Alignment Statistics
- 6_Aug31
2Required Reading (before lecture)
- Mon Aug 27 - for Lecture 4
- Pairwise Sequence Alignment
- Chp 3 - pp 31-41
- Wed Aug 29 - for Lecture 5
- Dynamic Programming
- Eddy What is Dynamic Programming? 2004 Nature
Biotechnol 22909 - http//www.nature.com/nbt/journal/v22/n7/abs/nbt07
04-909.html - Thurs Aug 30 - Lab 2
- Databases, ISU Resources Pairwise Sequence
Alignment - Fri Aug 31 - for Lecture 6
- Scoring Matrices Alignment Statistics
- Chp 3 - pp 41-49
3 Announcements
- Fri Aug 31 - Revised notes for Lecture 5 posted
online - Changes? mainly re-ordering, symbols, color
"coding" - Mon Sept 3 - NO CLASSES AT ISU (Labor Day)!! -
Enjoy!! -
- Tues Sept 4 - Lab 2 Exercise Writeup Due by 5 PM
(or sooner!) - Send via email to Pete Zaback
petez_at_iastate.edu - (HW2 assignment will be posted online)
- Fri Sept 14 - HW2 Due by 5 PM (or sooner!)
- Fri Sept 21 - Exam 1
4Chp 3- Sequence Alignment
- SECTION II SEQUENCE ALIGNMENT
- Xiong Chp 3
- Pairwise Sequence Alignment
- vEvolutionary Basis
- vSequence Homology versus Sequence Similarity
- vSequence Similarity versus Sequence Identity
- Methods - cont
- Scoring Matrices
- Statistical Significance of Sequence Alignment
Adapted from Brown and Caragea, 2007, with some
slides from Altman, Fernandez-Baca, Batzoglou,
Craven, Hunter, Page.
5Methods
- vGlobal and Local Alignment
- vAlignment Algorithms
- vDot Matrix Method
- Dynamic Programming Method - cont
- Gap penalities
- DP for Global Alignment
- DP for Local Alignment
- Scoring Matrices
- Amino acid scoring matrices
- PAM
- BLOSUM
- Comparisons between PAM BLOSUM
- Statistical Significance of Sequence Alignment
6Sequence Homology vs Similarity
- Homologous sequences - sequences that share a
common evolutionary ancestry - Similar sequences - sequences that have a high
percentage of aligned residues with similar
physicochemical properties - (e.g., size, hydrophobicity, charge)
- IMPORTANT
- Sequence homology
- An inference about a common ancestral
relationship, drawn when two sequences share a
high enough degree of sequence similarity - Homology is qualitative
- Sequence similarity
- The direct result of observation from a sequence
alignment - Similarity is quantitative can be described
using percentages
7Goal of Sequence Alignment
- Find the best pairing of 2 sequences, such that
there is maximum correspondence between residues - DNA 4 letter alphabet ( gap)
- TTGACAC
- TTTACAC
- Proteins 20 letter alphabet ( gap)
- RKVA-GMA
- RKIAVAMA
8Statement of Problem
- Given
- 2 sequences
- Scoring system for evaluating match (or mismatch)
of two characters - Penalty function for gaps in sequences
- Find Optimal pairing of sequences that
- Retains the order of characters
- Introduces gaps where needed
- Maximizes total score
9Avoiding Random Alignments with a Scoring
Function
- Introducing too many gaps generates nonsense
alignments - s--e-----qu---en--cesometimesquipsentice
- Need to distinguish between alignments that occur
due to homology and those that occur by chance - Define a scoring function that rewards matches
() and penalizes mismatches (-) and gaps (-)
Scoring Function (S) e.g.
Match ? 1 Mismatch ? 1
Gap ? 0 S ?(matches) -
?(mismatches) - ?(gaps)
Note I changed symbols colors on this slide!
10Not All Mismatches are the Same
- Some amino acids are more "exchangeable" than
others (physicochemical properties are similar) - e.g., Ser Thr are more similar than Trp Ala
- Substitution matrix can be used to introduce
"mismatch costs" for handling different types of
substitutions - Mismatch costs are not usually used in aligning
DNA or RNA sequences, because no substitution is
"better" than any other (in general)
11Substitution Matrix
- s(a,b) corresponds to score of aligning character
a with character b - Match scores are often calculated
- based on frequency of mutations in very similar
sequences - (more details later)
12Global vs Local Alignment
- Global alignment
- Finds best possible alignment across entire
length of 2 sequences - Aligned sequences assumed to be generally similar
over entire length
- Local alignment
- Finds local regions with highest similarity
between 2 sequences - Aligns these without regard for rest of sequence
- Sequences are not assumed to be similar over
entire length
13Global vs Local Alignment - example
1 CTGTCGCTGCACG 2 TGCCGTG
Which is better?
14Global vs Local Alignment Which should be used
when?
- It is critical to choose correct method!
- Global Alignment vs Local Alignment?
- Shout out the answers!! Which should we use for?
- Searching for conserved motifs in DNA or protein
sequences? - Aligning two closely related sequences with
similar lengths? - Aligning highly divergent sequences?
- Generating an extended alignment of closely
related sequences? - Generating an extended alignment of closely
related sequences with very different lengths? - Hmmm - we'll work on that
- Excellent!
-
15Alignment Algorithms
- 3 major methods for pairwise sequence alignment
- Dot matrix analysis
- Dynamic programming
- Word or k-tuple methods (later, in Chp 4)
16Dot Matrix Method (Dot Plots)
- Place 1 sequence along top row of matrix
- Place 2nd sequence along left column of matrix
- Plot a dot each time there is a match between an
element of row sequence and an element of column
sequence - For proteins, usually use more sophisticated
scoring schemes than "identical match" - Diagonal lines indicate areas of match
- Contiguous diagonal lines reveal alignment
"breaks" gaps (indels)
17Interpretation of Dot Plots
- When comparing 2 sequences
- Diagonal lines of dots indicate regions of
similarity between 2 sequences - Reverse diagonals (perpendicular to diagonal)
indicate inversions - What do such patterns mean when comparing a
sequence with itself (or its reverse complement)?
- e.g. Reverse diagonals crossing diagonals (X's)
indicate palindromes
Exploring Dot Plots
18Dynamic Programming
For Pairwise sequence alignment
Idea Display one sequence above another with
spaces inserted in both to reveal similarity
- C A T - T C A - C
-
- C - T C G C A G C
19Global Alignment Scoring
CTGTCG-CTGCACG -TGC-CG-TG----
Reward for matches ? Mismatch penalty
? Space/gap penalty ?
Score ?w ?x - ?y
w matches x mismatches y spaces
Note I changed symbols colors on this slide!
20Global Alignment Scoring
Reward for matches 10 Mismatch penalty
-2 Space/gap penalty -5
C T G T C G C T G C - T G C
C G T G -
-5 10 10 -2 -5 -2 -5 -5 10 10 -5
Total 11
Note I changed symbols colors on this slide!
We could have done better!!
21Alignment Algorithms
- Global Needleman-Wunsch
- Local Smith-Waterman
- Both NW and SW use dynamic programming
- Variations
- Gap penalty functions
- Scoring matrices
22Dynamic Programming - Key Idea
- The score of the best possible alignment that
ends at a given pair of positions (i, j) is equal
to - the score of best alignment ending just
previous to those two positions (i.e., ending at
i-1, j-1) - PLUS
- the score for aligning xi and yj
23Global Alignment DP Problem Formulation
Notations
- Given two sequences (strings)
- X x1x2xN of length N x AGC N 3
- Y y1y2yM of length M y AAAC M 4
- Construct a matrix with (N1) x (M1) elements,
where - S(i,j) Score of best alignment of
x1..ix1x2xi with y1..jy1y2yj
Which means Score of best alignment of a prefix
of X and a prefix of Y
24Dynamic Programming - 4 Steps
- Define score of optimum alignment, using
recursion - Initialize and fill in a DP matrix for storing
optimal scores of subproblems, by solving
smallest subproblems first (bottom-up approach) - Calculate score of optimum alignment(s)
- Trace back through matrix to recover optimum
alignment(s) that generated optimal score
251- Define Score of Optimum Alignment using
Recursion
Define
Initial conditions
Recursive definition For 1 ? i ? N, 1 ? j ?
M
262- Initialize Fill in DP Matrix for Storing
Optimal Scores of Subproblems
- Construct sequence vs sequence matrix
Recursion
Initialization
272- cont Fill in DP Matrix
- Fill in from 0,0 to N,M (row by row),
calculating best - possible score for each alignment including
residues at i,j - Keep track of dependencies of scores (in a
pointer matrix).
283- Calculate Score S(N,M) of Optimum Alignment
- for Global Alignment
- What happens in last step in alignment of x1..i
to y1..j? - 1 of 3 cases applies
29Example
30Fill in the matrix
? C T C G C
A G C
0 -5 -10 -15 -20 -25 -30 -35 -40
?
10
5
C
A
T
T
C
A
C
10 for match, -2 for mismatch, -5 for space
31Calculate score of optimum alignment
? C T C G C
A G C
0 -5 -10 -15 -20 -25 -30 -35 -40
-5 10 5 0 -5 -10 -15 -20 -25
-10 5 8 3 -2 -7 0 -5 -10
-15 0 15 10 5 0 -5 -2 -7
-20 -5 10 13 8 3 -2 -7 -4
-25 -10 5 20 15 18 13 8 3
-30 -15 0 15 18 13 28 23 18
-35 -20 -5 10 13 28 23 26 33
10 for match, -2 for mismatch, -5 for space
324- Trace back through matrix to recover optimum
alignment(s) that generated the optimal score
- How? "Repeat" alignment calculations in reverse
order, starting at from position with highest
score and following path, position by position,
back through matrix - Result? Optimal alignment(s) of sequences
33Traceback - for Global Alignment
- Start in lower right corner trace back to upper
left - Each arrow introduces one character at end of
sequence alignment - A horizontal move puts a gap in left sequence
- A vertical move puts a gap in top sequence
- A diagonal move uses one character from each
sequence
34Traceback to Recover Alignment
? C T C G C
A G C
0 -5 -10 -15 -20 -25 -30 -35 -40
-5 10 5 0 -5 -10 -15 -20 -25
-10 5 8 3 -2 -7 0 -5 -10
-15 0 15 10 5 0 -5 -2 -7
-20 -5 10 13 8 3 -2 -7 -4
-25 -10 5 20 15 18 13 8 3
-30 -15 0 15 18 13 28 23 18
-35 -20 -5 10 13 28 23 26 33
Can have gt1 optimum alignment this example has 2
35Local Alignment Motivation
- To "ignore" stretches of non-coding DNA
- Non-coding regions (if "non-functional") are more
likely to contain mutations than coding regions - Local alignment between two protein-encoding
sequences is likely to be between two exons - To locate protein domains or motifs
- Proteins with similar structures and/or similar
functions but from different species (for
example), often exhibit local sequence
similarities - Local sequence similarities may indicate
functional modules
Non-coding - "not encoding protein" Exons -
"protein-encoding" parts of genes vs Introns
"intervening sequences" - segments of
eukaryotic genes that "interrupt" exons
Introns are transcribed into RNA, but are later
removed by RNA processing are not translated
into protein
36Local Alignment Example
g g t c t g a g a a a c g a
Match 2 Mismatch or space -1
Best local alignment
g g t c t g a g a a a c g a -
Score 5
37Local Alignment Algorithm
- S i, j Score for optimally aligning a suffix
of X with a suffix of Y - Initialize top row leftmost column of matrix
with "0"
- Recall for Global Alignment,
-
- S i, j Score for optimally aligning a
prefix of X with a prefix of Y - Initialize top row leftmost column of with gap
penalty
38Traceback - for Local Alignment
? C T C G C
A G C
0 0 0 0 0 0 0 0 0
0 1 0 1 0 1 0 0 1
0 0 0 0 0 0 2 0 0
0 0 1 0 0 0 0 1 0
0 0 1 0 0 0 0 0 0
0 1 0 2 0 1 0 0 1
0 0 0 0 1 0 2 0 0
0 1 0 1 0 2 0 1 1
1 for a match, -1 for a mismatch, -5 for a space
39Some Results re Alignment Algorithms (for
ComS, CprE Math types!)
- Most pairwise sequence alignment problems can be
solved in O(mn) time - Space requirement can be reduced to O(mn), while
keeping run-time fixed Myers88 - Highly similar sequences can be aligned in O (dn)
time, where d measures the distance between the
sequences Landau86
40"Scoring" or "Substitution" Matrices
- 2 Major types for Amino Acids PAM BLOSUM
- PAM Point Accepted Mutation
- relies on "evolutionary model" based on
observed differences in alignments of closely
related proteins - BLOSUM BLOck SUbstitution Matrix
- based on aa substitutions observed in blocks
of conserved sequences within evolutionarily
divergent proteins
41PAM Matrix
- PAM Point Accepted Mutation
- relies on "evolutionary model" based on observed
differnces in closely related proteins - Model includes defined rate for each type of
sequence change - Suffix number (n) reflects amount of "time"
passed rate of expected mutation if n of amino
acids had changed - PAM1 - for less divergent sequences (shorter
time) - PAM250 - for more divergent sequences (longer
time)
42BLOSUM Matrix
- BLOSUM BLOck SUbstitution Matrix
- based on aa substitutions observed in blocks
of conserved sequences within evolutionarily
divergent proteins - Doesn't rely on a specific evolutionary model
- Suffix number (n) reflects expected similarity
average aa identity in the MSA from which the
matrix was generated - BLOSUM45 - for more divergent sequences
- BLOSUM62 - for less divergent sequences
43Statistical Significance of Sequence Alignment
44Affine Gap Penalty Functions
- Gap penalty h gk
- where
- k length of gap
- h gap opening penalty
- g gap extension penalty
Can also be solved in O(nm) time using dynamic
programming