Title: Alignment methods
1Alignment methods
- April 21, 2009
- Quiz 1-April 23 (JAM lectures through today)
- Writing assignment topic due Tues, April 23
- Hand in homework 3
- Why has HbS stayed in the population?
- Learning objectives- Understand difference
between global alignment and local alignment.
Understand the Needleman-Wunsch algorithm.
Understand the Smith-Waterman algorithm in global
alignment mode. - Workshop-Perform alignment of two nucleotide
sequences - Homework 4 due Tues, April 23
2Evolutionary Basis of Sequence Alignment
Why are there regions of identity when comparing
protein sequences? 1) Conserved function-amino
acid residues participate in reaction. 2)
Structural (For example, conserved cysteine
residues that form a disulfide linkage) 3)
Historical-Residues that are conserved solely due
to a common ancestor gene.
3Identity Matrix
1
A
1
0
C
1
0
0
I
1
0
0
0
L
L
I
C
A
Simplest type of scoring matrix
4Similarity
It is easy to score if an amino acid is identical
to another (the score is 1 if identical and 0 if
not). However, it is not easy to give a score
for amino acids that are somewhat similar.
CO2-
CO2-
NH3
NH3
Isoleucine
Leucine
Should they get a 0 (non-identical) or a 1
(identical) or Something in between?
5One is mouse trypsin and the other is crayfish
trypsin. They are homologous proteins. The
sequences share 41 identity.
6(No Transcript)
7Evolutionary Basis of Sequence Alignment (Cont. 2)
Note it is possible that two proteins share a
high degree of similarity but have two different
functions. For example, human gamma-crystallin
is a lens protein that has no known enzymatic
activity. It shares a high percentage of
identity with E. coli quinone oxidoreductase.
These proteins likely had a common ancestor but
their functions diverged.
Analogous to railroad car and diner. Both have
the same form but different functions.
8Global Alignment Method
For example, the two hypothetical sequences
abcdefghajklm abbdhijk could be aligned like
this abcdefghajklm
abbd...hijk As shown, there are 6 matches, 2
mismatches, and one gap of length 3.
9Global Alignment Method Scored
The alignment is scored according to a payoff
matrix payoff match gt match,
mismatch gt mismatch,
gap_open gt gap_open,
gap_extend gt gap_extend For correct
operation, an algorithm is created such that the
match must be positive and the other payoff
entities must be negative.
10Global Alignment Method (cont. 3)
Example Given the payoff matrix payoff
match gt 4, mismatch gt
-3, gap_open gt -2,
gap_extend gt -1
11Global Alignment Method (cont. 4)
The sequences abcdefghajklm abbdhijk are
aligned and scored like this a b
c d e f g h a j k l m
a b b d . . . h i j k
match 4 4 4 4 4 4
mismatch -3 -3 gap_open
-2 gap_extend -1-1-1 for a total
score of 24-6-2-3 13.
12Global Alignment Method (cont. 5)
The algorithm should guarantee that no
other alignment of these two sequences has
a higher score under this payoff matrix.
13Lets align the following with a simple payoff
matrix ABCNJRQCLCRPM and AJCJNRCKCRBP Where
match 1 mismatch 0 gap 0 gap extension
0
Alignment A Sequence 1 ABCNJ-RQCLCR-PM
Sequence 2 AJC-JNR-CKCRBP- Score
101010101011010 Total Score 8 Alignment B
Sequence 1 ABC-NJRQCLCR-PM Sequence 2
AJCJN-R-CKCRBP- Score 101010101011010 Total
Score 8
14Three steps in Dynamic Programming
1. Initialization 2. Matrix fill or scoring 3.
Traceback and alignment
15Initialization step
16Matrix Fill (bottom two rows)
17Matrix Fill (bottom three rows)
18Matrix Fill (entire matrix)
Sequence 1 ABC-NJRQCLCR-PM Sequence 2
AJCJN-R-CKCRBP- Score 101010101011010 Total
Score 8
Sequence 1 ABCNJ-RQCLCR-PM Sequence 2
AJC-JNR-CKCRBP- Score 101010101011010 Total
Score 8
19Smith-Waterman algorithm
Mi,j MAXIMUM Mi-1, j-1 si,,j (match or
mismatch in the diagonal), Mi, j-1 w (gap in
sequence 1), Mi-1, j w (gap in sequence
2), 0 Where Mi-1, j-1 is the value in the
cell diagonally juxtaposed to Mi,j. (The i-1,
j-1 cell is up and to the left of mi,nj). Where
si,j is the value for the match or mismatch in
the minj cell. Where Mi, j-1 is the value in
the cell above Mi,j. Where w is the value for
the gap penalty. Where Mi-1, j is the value in
the cell to the left of Mi,j.
20Initialization step Create Matrix with M 1
columns and N 1 rows. M number of letters in
sequence 1 and N number of letters in sequence
2. First column (M-1) and first row (N-1) will
be filled with 0s.
21Matrix fill step Each position Mi,j is defined
to be the MAXIMUM score at position i,j Mi,j
MAXIMUM Mi-1, j-1 si,,j (match or mismatch
in the diagonal) Mi, j-1 w (gap in sequence
1) Mi-1, j w (gap in sequence 2)
row
column
22Sequence 1 ABCNJ-RQCLCR-PM Sequence 2
AJC-JNR-CKCRBP- Score 8