Title: Sequence Alignments
1Sequence Alignments
- Chris Bailey
- Bacterial Pathogenesis Genomics Unit
- cmb036_at_bham.ac.uk
2Why align sequences
- What does my new gene do?
- Known gene x, with a function z
- Unknown gene y
- If alignment of x and y shows high degree of
similarity, gene y may also have function z
3Why align sequences
4Why align sequences
5Why align sequences
- Myelin sheath proteins were sequenced
- Protein database searched for similar bacterial
and viral sequences - Lab research to determine T-cell reaction to
bacterial / viral proteins
6Proteins Evolve!
- Substitution
- Insertion
- Deletion
- Duplication
- Inversion
Common ancestor (Probably extinct)
Z
X
Y
Available (And probably homologous)
7How to Align
- Take the following sequences
- ACBCBD, and
- CADBD
- An example alignment
- A C - - B C D B
- - C A D B - D
- The character represents a space or gap. This
could be due to - Insertion
- Deletion
8Evaluating Alignments
- Use a scoring function
- Exact match between two characters scores 2
- Mismatch or space scores -1
- A C - - B C D B
- - C A D B - D
1
- 1
2
- 1
- 1
2
2
- 1
- 1
9Scoring Functions
- Let x and y be single characters (or spaces)
- Then s(x,y) denotes the score of aligning x and y
- s is called the scoring function
- E.g.
- s(A,A) 2
- s(B,D) -1
- s(-,A) -1
- s(B,-) -1
10More Definitions
- If S is a string, S denotes the length of S
- Si is the ith character of S
- Let S and T be strings. An alignment A maps S and
T into S' and T', that may contain spaces where - l S' T'
- In the example S acbcdb, T cadbd, S'
ac--bcdb and T' -cadb-d-
11Yet More Definitions
- For the alignment A, the value is
12Yet More Definitions
- For the alignment A, the value is
13Brute Force Alignment
- Get all subsequences of S and T
- Form an alignment of the 2 subsequences
- Score the alignment
- Where n gt 3 number of basic operations
approximates to 22n
14Dynamic Alignment
- Using strings S and T where
- S n and T m
- V(i,j) is the value of the optimal alignment of
S1Si and T1 Tj - Optimal alignment of S and T is V(n,m)
- Basic operations n2, vs 22n
15Needleman-Wunsch Algorithm
- Starting at i 0 and j 0
- V(0,0) 0
- V(i,0) V(i - 1,0) s(Si,-), for i gt 0
- V(0,j) V(0,j - 1) s(-,Tj), for j gt 0
16Needleman-Wunsch Algorithm
- And for V(i,j) where i gt 0 and j gt 0
17Needleman-Wunsch Algorithm
- And for V(i,j) where i gt 0 and j gt 0
18What !??
- Thats kind of hard to work out in your head
- So use a matrix
19Needleman-Wunsch Algorithm
i 0 1 2 3 4 5 6
j 0 1 2 3 4 5
0
V(0,0) 0
20Needleman-Wunsch Algorithm
- Fill in the table using the rules for s(i,j)
- Match 2
- Mismatch -1
- Therefore
- V(i,0) V(i - 1,0) s(Si,-), and
- V(0,j) V(0,j - 1) s(-,Tj)
- Become
- V(i,0) V(i - 1,0) -1, and
- V(0,j) V(0,j - 1) -1
21Needleman-Wunsch Algorithm
i 0 1 2 3 4 5 6
j 0 1 2 3 4 5
-1
-2
-3 -4 -5 -6
V(2,0) V(1,0) - 1
V(1,0) V(0,0) - 1
22Needleman-Wunsch Algorithm
i 0 1 2 3 4 5 6
j 0 1 2 3 4 5
-1
V(0,2) V(0,1) - 1
V(0,1) V(0,0) - 1
-2
-3
-4
-5
23Needleman-Wunsch Algorithm
- Okay so now we need this formula
24Needleman-Wunsch Algorithm
25Needleman-Wunsch Algorithm
i 0 1 2 3 4 5 6
j 0 1 2 3 4 5
-1
-2
Since s(a,c) -1
V(1,1) V(0,1) - 1
Max -1(from V(0,0) - 1)
V(1,1) V(1,0) - 1
V(1,1) V(0,0) - 1
26Needleman-Wunsch Algorithm
i 0 1 2 3 4 5 6
j 0 1 2 3 4 5
1
-3
-2
Since s(c,c) 2
V(2,1) V(2,0) - 1
V(2,1) V(1,0) 2
V(2,1) V(1,1) - 1
Max 1(from V(1,0) 2)
27Needleman-Wunsch Algorithm
i 0 1 2 3 4 5 6
j 0 1 2 3 4 5
28Needleman-Wunsch Algorithm
- Creating the optimal alignments
- Follow the arrows back from the bottom-right of
the table - Following an arrow left inserts a gap into T, and
uses a letter from S - Following an arrow up inserts a gap into S, and
uses a letter from T - Following an arrow diagonally uses a letter from
S and T
29A bit of code
- if (going left)
- unshift S, Si
- unshift T, -
- --i
- if (going up)
- unshift S, -
- unshift T, Tj
- --j
- if (going diagonally)
- unshift S, Si
- unshift T, Tj
- --i --j
30Needleman-Wunsch Algorithm
S - A C B C D B T C A D B - D -
S A C B C D B - T - C A - D B D
S A C B C D B - T - C - A D B D
31Global vs Semi-Global vs Local
32Semi-Global Alignments
- Same as global alignment algorithm, except
- Initialise 1st row and column with 0
- No gap penalty in last row/column
- (or start from max value in last row/column)
33Semi-global alignment
34Smith-Waterman Alignments
- An adaptation of Needleman-Wunsch
35Smith-Waterman Alignments
- An adaptation of Needleman-Wunsch
Only Difference
36Smith-Waterman Alignments
- An adaptation of Needleman-Wunsch
- Initialize the matrix with 0 in 1st row column
Only Difference
37Smith-Waterman Alignments
- An adaptation of Needleman-Wunsch
- Initialize the matrix with 0 in 1st row column
- Start backtrace from the maximum value in the
matrix, end it at 0
Only Difference
38Smith-Waterman Algorithm
39Smith-Waterman Algorithm
- Gives optimal local alignments of
- c x d e x d e
- c d e x c d e
and
40Gap Penalties
- Linear Gap Scores
- g(k) ?k
- Affine Gap Scores
- g(k) ??k
- Convex Gap Scores
- g(k) log(k)
- Where
- k is gap size
- a is gap extension penalty
- b is gap introduction penalty
41Gap Scoring
- Concave g(k) log(k)
- Best Model of real life
- Computationally complex
- Linear g(k) a(k)
- Not a good model of reality
- Computationally simple
- Affine g(k) b a(k)
- Closer to reality
- Computationally manageable
42Gap Penalties Biological Motivation
- Insertion/deletion events (Indels) involving a
whole substring often happen as 1 event - Therefore, linear model is not representative of
real life - However, affine model requires 3 matrices (E, F
and G) to calculate V (the alignment score)
43Thoughts for next time
- How can you align multiple sequences by adapting
pairwise alignment algorithms? - When aligning proteins should 2 similar (e.g.
hydrophobic), amino acids still receive a
positive score? - How does BLAST actually work?