Sequence Alignment

About This Presentation

Title:

Sequence Alignment

Description:

S. B. Needleman ... max ( subproblem ( i 1, j ) , subproblem ( i, j 1 ) ); Memorizing ... gap x in sequence B } max { S[i, j y] w( y ); gap y in sequence ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 64

Provided by: aleph0

Category:

more less

Transcript and Presenter's Notes

Title: Sequence Alignment

1
Sequence Alignment

Arthur Chou
Clark University
Spring 2005

2
Sequence Alignment

Input two sequences over the same alphabet
Output an alignment of the two sequences
Example
GCGCATGGATTGAGCGA
TGCGCCATTGATGACCA
A possible alignment
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A

3
Why align sequences?

Lots of sequences dont have known ancestry,
structure, or function. A few of them do.
If they align, they are similar.
If they are similar, they might have the same
ancestry, similar structure or function.
If one of them has known ancestry, structure,
or
function, then alignment to the others yields
insight about them.

4
Alignments
-GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Three
kinds of match Exact matches Mismatches
Indels (gaps)
5
Choosing Alignments

There are many possible alignments
For example, compare
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
to
------GCGCATGGATTGAGCGA
TGCGCC----ATTGATGACCA--
Which one is better?

6
Scoring Alignments

Similar sequences evolved from a common ancestor
Evolution changed the sequences from this
ancestral sequence by mutations
Replacement one letter replaced by another
Deletion deletion of a character
Insertion insertion of a character
Scoring of sequence similarity should examine how
many and which operations took place

7
Simple Scoring Rule

Score each position independently
Match 1
Mismatch -1
Indel -2
Score of an alignment is sum of position scores

8
Example

-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
Score (1?13) (-1 ? 2) (-2 ? 4)
3
------GCGCATGGATTGAGCGA
TGCGCC----ATTGATGACCA--
Score (1 ? 5) (-1 ? 6) (-2 ? 11)
-23

9
More General Scores

The choice of 1,-1, and -2 scores is quite
arbitrary
Depending on the context, some changes are more
plausible than others
Exchange of an amino-acid by one with similar
properties (size, charge, etc.) vs.
Exchange of an amino-acid by one with opposite
properties
Probabilistic interpretation How likely is one
alignment versus another ?

10
Dot Matrix Method

A dot is placed at each position where two
residues match.
It's a visual aid. The human eye can rapidly
identify similar regions in sequences.
It's a good way to explore sequence organization
e.g. sequence repeats.
It does not provide an alignment.

THEFA-TCAT THEFASTCAT

This method produces dot-plots with too much
noise
to be useful
The noise can be reduced by calculating a score
using a window of residues.
The score is compared to a threshold or
stringency.

11
Dot Matrix Representation

Produces a graphical representation of similarity
regions
The horizontal and vertical dimensions correspond
to the compared sequences
A region of similarity stands out as a diagonal

12
Dot Matrix or Dot-plot

Each window of the first sequence is aligned
(without
gaps) to each window of the 2nd sequence
A colour is set into a rectangular array
according to the
score of the aligned windows

13
Dot Matrix Display

Diagonal rows ( ) of dots
reveal sequence similarity
or repeats.
Anti-diagonal rows ( )
of dots represent inverted
repeats.
Isolated dots represent
random similarity.

H C G E T F G R W F T P E W K C G
P T
F G R
I A C G E
M
14
We can filter it by using a sliding window
looking for longer strings of matches and
eliminates random matches
15
Longest Common Subsequence

Sequence A nematode_knowledge
Sequence B empty_bottle
n e m a t o d e _ k n o w l e d g e
e m p t y _ b o t t l e
LCS Alignment with match score 1,
mismatch score 0, and gap penalty 0

16
What is an algorithm?

A step-by-step description of the procedures to
accomplish a task.
Properties

Determination of output for each input
Generality
Termination

Criteria

Correctness (proof, test, etc.)
Time efficiency (no. of steps is small)
Space efficiency (spaced used is small)

17
Naïve algorithm exhaustive search

G C G A A T G G A T T G A G C G T
T G A G C C A T T G A T G A C C A

sequences of length n
i
j
i j j i j i j j i i . . . . . . . . . . . . . .
2n
Worst case time complexity is 2
18
Dynamic programming algorithms for pairwise
sequence alignment

Similar to Longest Common Subsequence
Introduced for biological sequences by
S. B. Needleman C. D. Wunsch. A general method
applicable to the search for similarities in the
amino acid sequence of two proteins. J. Mol.
Biol. 48443-453 (1970)

19
Dynamic Programming

Optimality substructure
Reduction to a small number of sub-problems
Memorization of solutions to sub-problems in a
table
Table look-up and tracing

- G C G C A T G G A T T G A G C G A T G C G C C
A T T G A T G A C C - A
20
Recursive LCS
lcs_len( i , j ) length of LCS from i-th
position onward in String A and from j-th
position onward in String B
int lcs_len ( i , j ) if (A i \0
B j \0 ) return 0 else
if (A i B j ) return ( 1 lcs_len (
i1, j1 ) ) else return max ( lcs_len
( i1, j ) , lcs_len ( i, j1 )
)
21
Reduction to Subproblems
int lcs_len ( String A , String B )
return subproblem ( 0, 0 ) int
subproblem ( int i, int j ) if (A i
\0 B j \0) return 0 else
if ( A i B j ) return (1
subproblem ( i1, j1 )) else return max (
subproblem ( i1, j ) , subproblem (
i, j1 ) )
22
Memorizing the solutions

Matrix L i , j -1 // initializing
the memory device
int subproblem ( int i, int j )
if ( Li, j lt 0 )
if (A i \0 B j \0) Li ,
j 0
else if ( A i B j )
Li, j 1 subproblem(i1,
j1)
else Li, j max( subproblem(i1,
j),
subproblem(i, j1))
return L i, j

23
Iterative LCS Table Look-up
To get the length of LCS of Sq. A and Sq. B
first allocate storage for the matrix L
for each row i from m downto 0 for each
column j from n downto 0 if (A i
\0 or B j \0) L i, j 0
else if (A i B j ) L i, j 1
Li1, j1 else L i, j
max(Li1, j, Li, j1)
return L0, 0
24
Iterative LCS Table Look-up

int lcs_len ( String A , String B ) // the
length
// First allocate storage for the matrix L
for ( i m i gt 0 i-- ) // A has
length m1
for ( j n j gt 0 j-- ) // B
has length n1
if (A i \0 B j \0)
L i, j 0
else if (A i B j ) L i, j 1
Li1, j1
else L i, j max(Li1, j,
Li, j1)
return L0, 0

25
Dynamic Programming Algorithm

Li, j 1 Li1, j1 , if A i B
j
Li, j max ( Li1, j, Li, j1 )
otherwise

j
j1
B
A
Matrix L
L i, j
L i, j1
i
Li1, j1
L i1, j
i1
26
n e m a t o d e _ k n o w l e d
g ee   7 7 6 5 5 5 5 5 4 3 3 3 2
2 2 1 1 1 0m   6 6 6 5 5 4 4 4 4
3 3 3 2 2 1 1 1 1 0p   5 5 5 5 5
4 4 4 4 3 3 3 2 2 1 1 1 1 0t   5
5 5 5 5 4 4 4 4 3 3 3 2 2 1 1 1
1 0y   4 4 4 4 4 4 4 4 4 3 3 3 2
2 1 1 1 1 0_   4 4 4 4 4 4 4 4 4
3 3 3 2 2 1 1 1 1 0b   3 3 3 3 3
3 3 3 3 3 3 3 2 2 1 1 1 1 0o   3
3 3 3 3 3 3 3 3 3 3 3 2 2 1 1 1
1 0t   3 3 3 3 3 2 2 2 2 2 2 2 2
2 1 1 1 1 0t   3 3 3 3 3 2 2 2 2
2 2 2 2 2 1 1 1 1 0l   2 2 2 2 2
2 2 2 2 2 2 2 2 2 1 1 1 1 0e   1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 0    0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0
27
Obtain the subsequence

Sequence S empty // the LCS
i 0 j 0
while ( i lt m j lt n)
if ( A i B j )
add Ai to end of S
i j
else
if ( Li1, j gt Li, j1) i
else j

28
n e m a t o d e _ k n o w l e d g e
e o-o o-o-o-o-o-o o-o-o-o-o-o-o o-o-o o
\ \ \ \    m
o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o
\     p
o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o
    t
o-o-o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o
\     y
o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o
    _
o-o-o-o-o-o-o-o-o o-o-o-o-o-o-o-o-o-o
\     b
o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o-o
    o
o-o-o-o-o-o o-o-o-o o o o-o-o-o-o-o-o
\     t
o-o-o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o
\     t
o-o-o-o-o o-o-o-o-o-o-o-o-o-o-o-o-o-o
\     l
o-o-o-o-o-o-o-o-o-o-o-o-o-o o-o-o-o-o
\     e o-o
o-o-o-o-o-o o-o-o-o-o-o-o o-o-o o       \
\ \ \       o o o o o o
o o o o o o o o o o o o o
29
Dynamic Programming with scores and penalties
x

y
j
30
Dynamic Programming with scores and penalties

from i-th pos. in A and j-th pos. in B
onward
s ( Ai , Bj )
Si1, j1
Si , j max max Six, j w( x )
gap x in
sequence B
max Si, jy
w( y )
gap y in sequence A

s score
w penalty function
best score from i, j onward
31
Algorithm for simple gap penalty

If for each gap, the penalty is a fixed constant
c, then
s(A i , B j ) Si1, j1
Si , j max S i1, j c //
one gap
S i, j1 c // one gap

32
Table Tracing

To do table tracing based on similarity matrix
of amino acids, we re-define Si , j to be the
optimal score of choosing the match of Ai with
Bj.
S i , j s (A i , B j ) // s
score
Si1, j1 // w
gap penalty
max Si1x, j1 w( x )
max gap x in sequence B
max Si1, j1y w( y )
gap y in
sequence A

33
Diagram
Matrix S
j
j1
i
i1
34
Summation operation

1. Start at lower right corner.
2. Move diagonally up one position.
3. Find largest value in either
? row segment starting diagonally below
current position and extending to the right or
? column segment starting diagonally
below current position and extending down.
4. Add this value to the value in the current
cell.
5. Repeat steps 3 and 4 for all cells to the left
in current row and all cells above in current
column.
6. If we are not in the top left corner, go to
step 2.

35
(No Transcript)
36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
(No Transcript)
42
(No Transcript)
43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
(No Transcript)
49
(No Transcript)
50
----V HGQKV
51
(No Transcript)
52
----VA HGQKVA
53
----VADALTK HGQKVADALTK
54
----VADALTK HGQKVADALTK
55
----VADALTKPVNFKFA HGQKVADALTK------A
56
----VADALTKPVNFKFAVAH HGQKVADALTK------AVAH
57
Dynamic Programming by Tracing a Similarity Matrix

Recall the algorithm Table Tracing
Tracing a Similarity Matrix of Amino Acids

58
Use of dynamic programming to evaluate homology
between pairs of sequences

If we just want to know maximum match possible
between two sequences, then we dont need to do
trace-back but can just look at the highest value
in the first row or column (match score). This
represents the best possible alignment score.

59
Gap penalty alternatives

constant gap penalty for gap gt 1
gap penalty proportional to gap size (affine gap
penalty)
one penalty for starting a gap (gap opening
penalty)
different (lower) penalty for adding to a gap
(gap extension penalty)
dynamic programming algorithm can be made more
efficient

60
Gap penalty alternatives (cont.)

gap penalty proportional to gap size and sequence
for nucleic acids, can be used to mimic
thermodynamics of helix formation.
two kinds of gap opening penalties
one for gap closed by AT, different for GC.
different gap extension penalty.

61
End gaps

Some programs treat end gaps as normal gaps and
apply penalties, other programs do not apply
penalties for end gaps.

62
End gaps (cont.)

Can determine which a program does by adding
extra (unmatched) bases to the end of one
sequence and seeing if match score changes.
Penalties for end gaps appropriate for aligned
sequences where ends "should match.
Penalties for end gaps inappropriate when
surrounding sequences are expected to be
different (e.g., conserved exon surrounded by
varying introns).

63
Global vs. Local Similarity

Should result of alignment include all amino
acids or proteins or just those that match?
If yes, a global alignment is desired
If no, a local alignment is desired
Global alignment is accomplished by including
negative scores for mismatched positions, thus
scores get worse as we move away from region of
match (local alignment).
Instead of starting trace-back with highest value
in first row or column, start with highest value
in entire matrix, stop when score hits zero.

Write a Comment

User Comments (0)

About PowerShow.com

Sequence Alignment - PowerPoint PPT Presentation

Sequence Alignment

S. B. Needleman ... max ( subproblem ( i 1, j ) , subproblem ( i, j 1 ) ); Memorizing ... gap x in sequence B } max { S[i, j y] w( y ); gap y in sequence ... – PowerPoint PPT presentation