Sequence alignment algorithms - PowerPoint PPT Presentation

1 / 74

About This Presentation

Title:

Sequence alignment algorithms

Description:

Pic 1. Pic 2. Structure. function. Protein A is a constituent of muscle, skin, ... Pic 3. Real alignment. Scoring ... Pic 4. A substitution matrix ... – PowerPoint PPT presentation

Number of Views:223

Avg rating:3.0/5.0

Slides: 75

Provided by: compbi

Category:

more less

Transcript and Presenter's Notes

Title: Sequence alignment algorithms

1
Sequence alignment algorithms
Presented By Cary Miller Sastry
Akella Daisuke Yasuda
2
Overview

Biological background / motivation / applications
Dot matrix / dynamic programming
FASTA / BLAST

3
biology

Biomolecules are strings from a restricted
alphabet
Length4 DNA
Length20 protein
Proteins are the working part

4
Proteins

Protein is a linear sequence of 20 characters
(amino acids)
Proteins do not maintain linearity
Folding happens
Folding determines overall 3-D shape
Shape determines function

5
Sequence Structure Function

sequence does not reveal structure
Much less function
A sequenceARTUVEDYERRWWUHUK

6
Structure

Pic 1
Pic 2

7
Structure
8
function

Protein A is a constituent of muscle, skin,
cartilage, or
Protein B catalyzes the transformation of glucose
to fructose, or
How do we find proteins with similar function?

9
Nature does not solve the same problem twice
(usually)

Short sequence with a specific function (or
shape) is called a domain
The same domain appears in multiple proteins
If we find the same domain in multiple proteins
that provides a clue to function and/or structure

10
Amino acids

Each has the same basic chemical configuration
but has a functional group that makes it
chemically unique
They occur in families
Some functional groups are similar

11
How biologists study proteins

Expensive (NMR, x-ray crystallography)
Discovery of function is difficult
Few proteins are understood in detail
Many are known by sequence
Sequence is easier to get than structure or
function

12
A biological scenario

Biologist discovers the sequence of a new protein
with unknown function
She has no idea of function
If sequence can be associated with a known
protein sequence we have a clue about structure
and/or function
Most proteins have unknown function

13
Public databases

Vast quantities of sequence, structure, function
info is deposited into public databases
A new sequence should be compared to the database

14
Comparing sequences

Alignment with exact matchABCTUVABUVABCTUVAB
----UV

15
Alignment with inexact match

InexactGARUIPPRSTGARVVBUIEEYSTGAR------UIPPRS
TGARVVBUIEEYST

16
Global vs. local alignment

ABQRTASGGBV
ABRRRASGVBB
ABQRTASGGBV
ABQ------SGGBV

17
A real alignment

MyoglobinPDLRKY FKG-A ENFTA DDVQ KSDRPDTKAY
FPKFG DLSTA AALK SSPK
Homology common ancestry

18
Real alignment

Pic 3

19
Real alignment
20
Scoring pairs of amino acids

For amino acid pairs assign a score based on
frequency of substitutionATRGUVXQATRCVVXTATRGV
VEQAT-----VVEQ

21
A substitution matrix

Pic 4

22
A substitution matrix
23
Substitution matrices

Pam and Blosum are standard substitution matrices
Also include scores for
Gap opening
Gap extension

24
Scoring amino acid strings

Sum the individual pair scores
Database is huge
Spurious match to random sequence is likely
Try your name
E-value is probability of getting a given score
from a random sequence

25
Alignment algorithms

Dot matrix
Dynamic programming
FASTA
BLAST

26
Dot Matrix and DP
27
Dot Matrix

Locating regions of similarity between two DNA or
protein sequences which provide a great deal of
information about the function and structure of
the query sequence.
Similar structure indicates homology, or similar
evolution, which provides critical information
about the functions of these sequences.

28
Dot Matrix Contd..

A dot matrix plot is a method of aligning two
sequences to provide a picture of the homology
between them.
The dot matrix plot is created by designating one
sequence to be the subject and placing it on the
horizontal axis and designating the second
sequence to be the query and placing it on the
vertical axis of the matrix.

29
Dot Matrix Contd..

At each position within the matrix, a point is
plotted if the horizontal and vertical elements
are identical.
Diagonal lines within the resulting matrix
indicate regions of similarity. A simple dot
matrix plot is shown in Figure A.

30
(No Transcript)
31
Dot Matrix with noise reduction

A certain percentage of the matches between
sequence elements can be expected to be the
result of the random nature of their evolution.
These random matches are considered noise" and
are filtered out to enhance the diagonal lines.

32
Dot Matrix

Noise Reduction
a) Noise reduction in dot matrix can be done
by centering a substring of elements of the
query sequence over each element in the
subject sequence and determining the number of
corresponding elements within this window.

33
Dot Matrix

b) If the number of corresponding elements
exceeds a specified threshold then a point is
plotted for the center element. This is
demonstrated in figure B.

34
Dot Matrix (Figure B)
35
Dot Matrix

Advantages Readily reveals the presence of
insertions/deletions and direct and inverted
repeats that are more difficult to find by the
other, more automated methods.
DisadvantagesMost dot matrix computer programs
do not show an actual alignment. Does not return
a score to indicate how optimal a given
alignment is.

36
Dynamic Programming

Dynamic programming (DP) algorithms are a general
class of algorithms typically applied to
optimization problems.
For DP to be applicable, an optimization problem
must have two key ingredients
a) Optimal substructure an optimal solution
to the problem contains within it optimal
solutions to sub-problems.
b) Overlapping sub-problems the pieces
of larger problem have a sequential
dependency.

37
Dynamic Programming

DP works by first solving every sub-sub-problem
just once, and saves its answer in a table,
thereby avoiding the work of re- computing the
answer every time the sub-sub-problem is
encountered. Each intermediate answer is stored
with a score, and DP finally chooses the sequence
of solution that yields the highest score.

38
Dynamic Programming

Path Matrix

39
Dynamic Programming

Both global and local types of alignments may be
made by simple changes in the basic DP algorithm.
Alignments depend on the choice of a scoring
system for comparing character pairs and penalty
scores (e.g. PAM and BLOSUM matrixes covered
before)
Scoring functions example
w (match) 2 or substitution matrix
w (mismatch) -1 or substitution matrix
w (gap) -3

40
Dynamic Programming

Global Alignment (Needleman-Wunsch)
a) General goal is to obtain optimal global
alignment between two sequences, allowing
gaps.b) We construct a matrix F indexed by i
and j, one index for each sequence, where the
value F(i,j) is the score of the best
alignment between the initial segment x1i of x
up to xi and the initial segment y1j of y up
to yj. We begin by initializing F(0,0) 0.
We then proceed to fill the matrix from top
left to bottom right. If F(i-1, j-1),
F(i-1,j) and F(i,j-1) are known, it is
possible to calculate F(i,j).

41
Dynamic Programming

F(i,j) max F(i-1, j-1) s(xi , yj
)F(i-1,j) dF(i, j-1) d.
where s(a,b) is the likelihood score
that residues a and b occur as an aligned
pair, and d is the gap penalty.
Once you construct the matrix, you trace back the
path that leads to F(n,m), which is by definition
the best score for an alignment of x1n to y1m.

42
Dynamic Programming

Global Dynamic programming matrix

43
Dynamic Programming

Local alignment (Smith-Waterman)Two changes from
global alignment1. Possibility of taking the
value 0 if all other options have value less
than 0. This corresponds to starting a new
alignment.2. Alignments can end anywhere in
the matrix, so instead of taking the value
in the bottom right corner, F(n,m) for the
best score, we look for the highest value of
F(i,j) over the whole matrix and start the
trace-back from there.
F(i,j) max 0F(i-1, j-1) s(xi , yj
) F(i-1,j) dF(i, j-1) d.

44
Dynamic Programming

Local Dynamic programming matrix

45
Dynamic Programming

Advantages Guaranteed in a mathematical
sense to provide the optimal (very best or
highest-scoring) alignment for a given set of
scoringfunctions.
Disadvantages
a) Slow due to the very large number of
computational steps O(n 2).b) Computer
memory requirement also increases as the square
of the sequence lengths.
Therefore, it is difficult to use the
method for very long sequences.

46
FASTA and BLAST
47
FASTA - Idea -

Problem of Dynamic Programming
D.P. compute the score in a lot of useless
area for optimal sequence
FASTA focuses on diagonal area

48
FASTA - Heuristic -

Heuristic
Good local alignment should have some exact
match subsequence.

FASTA focus on this area
49
FASTA - Hi Level Algorithm -

Hi level algorithm
Let q be a query
max ? 0
For each sequence, s in DB
compare q with s and compute a score, y
if max
max ? y
bestSequence ? s
Return bestSequence

50
FASTA - Algorithm -

Step 1
Find all hot-spots
// Hot spots is pairs of words of length k
that exactly match

Sequence 1
Hot Spots
Sequence 2
51
FASTA - Algorithm -

Step 1 in detail
Use look-up Table
Query G A A T T C A G T T A
Sequence G G A T C G A

DotMatrix
Look-up Table
52
FASTA - Algorithm -

Step 2
Score the Hot-spot and locate the ten best
diagonal run.
// There is some scoring system ex. PAM250

53
FASTA - Algorithm -

Step 3
Combine sub-alignments into one alignment
with GAP

GAP
One of local alignment
54
FASTA - Algorithm -

Step 4
Consider weighted direct graph.
Let node be a sub-alignment found in step 1
Let u and v be nodes
Edge (u,v) exists if alignment u is before
in the sequence.
Each edge has gap penalty (negative)
Find the maximum weight path

Sub-sequence
Edge
One Sequence
55
FASTA - Algorithm -

Step 4 in detail

GAP
Sub-alignment
Gap
-5
-3
-3
Max Weight Path
56
FASTA - Algorithm -

Step 5
Use the dynamic programming in restricted area
around the best-score alignment to find out the
higher-score alignment than the best-score
alignment

Width of this band is a parameter
57
FASTA - Algorithm -

Summary of Algorithm
1 Find all hot-spots
// Hot spots is pairs of words of length k
that exactly match
2 Score the Hot-spot and locate the ten best
diagonal run.
3 Combine sub-alignments into one alignment
4 Score Each alignment with gap penalty and
pick up the best-score alignment
5 Use the dynamic programming in restricted
area around the best-score alignment to find out
the alignment greater than the best-score
alignment.

58
FASTA - Complexity -

Complexity
Step 1 and 2 // select the best 10
diagonal run
Let n be a sequence from DB
O(n) because Step 1 just uses look up the
table
O(n)

59
FASTA - Complexity -

Step 3 and 4 // compute the MAX Weight Path
Let r be the number of sub-alignments. (r
10)
Lets be the number of edges
O(r2)
n1 n2 n3
n1
n2
n3
? 1 of D.P because r2 102
and mn 104

Positive Weight
-5
-3
-3
Max Weight Path
60
FASTA - Complexity -

Step 5 // compute partial D.P.
Depends on the restricted area
Therefore, FASTA is faster than D.P.

Width of this band is a parameter
61
BLAST - Heuristic -

Another Heuristic algorithm
Heuristic but evaluating the result
statistically.
Homologous sequence are likely to contain a
short high scoring word pair, a hit.
BLAST tries to extend it on the both sides to
get optimal sequence.

A T T A G .
Sequence
Short high score Word
62
BLAST - Algorithm -
Neighborhood Word

Step 1 preprocessing Query
Compile the short-hit scoring word list from
query.
The length of query word,w, is 3 for brosom
scoring
Threshold T is 13

63
BLAST - Algorithm -

Step 1 2
Create neighborhood words for each query word

Query Word
Neighborhood words
64
BLAST - Algorithm -

Step 2 Scanning DB
For each words list, identify all exact matches
with DB sequences

Query Word
Neighborhood Word list
Sequences in DB
Sequence 1
Sequence 2
Step 2
Step 1
The purpose of Step 1 and 2 is as same as FASTA
65
BLAST - Algorithm -

Step 2-2
Method 1 Hash Table
Query LAALLNKCKTPQGQRLVNQWIKQPLMD

Hash Table
Word list
66
BLAST - Algorithm -

Step 2-3
Method 2 Finite Automata

A,G
L
A
G
A
A
A
I
67
BLAST Algorithm -

Step 3 (Search optimal alignment)
Let S be a score of hit-word
For each hit-word, extend ungapped alignmentin
both directions.
Step 4 (Evaluate the alignment statistically)
Stop extension when E-value (depending on score
S) become less than threshold. The hit-word is
called High Scoring Segment Pair. BLAST return it
E-value the number of HSPs having score S
(or higher) expected to
occur only by chance.
? Smaller E-value, more significant in
statistics
Bigger E-value , by chance

A T T A G .
Sequence
Hit Word
68
BLAST - Algorithm -

Step 3 -2
Definition of E-Value
The expected number of HSP with the score at
least S is
E Knme-?S
K, ? is constant depending on model
n, m are the length of query and sequence
The probability of finding at least one such HSP
is
P 1 - eE
? If a word is hit by chance
(E-value is bigger),
P become smaler.

69
BLAST - Running Time -

Running Time
The length of Query 153
DB size 5997 sequences
PC Pentium 4
By Dr. Takeshi Kawabata
Nara Sentan Gijyutu University

70
Comparison of Algorithm

Dynamic Programming
1. most sensitive result
? D.P uses all information of two sequence
2. Running time is slow
? D.P compute the useless area for computing
the optimal sequence.

71
Comparison of Algorithm

FASTA
1. Less sensitive than D.P and BLAST
? FASTA uses partial information to speed up
the computaiotn.
? FASTA does not evaluate the result
statistically.
2. Running time is faster D.P
? the same reason as the above.

72
Comparison of Algorithms

BLAST
1. Sensitive than FASTA
? BLAST evaluate the result statistically.
2.Faster than FASTA
? Because BLAST evaluate the entire DB with
the same threshold based on statistics. BLAST
eliminate noises and reduces the running time.

73
FASTA vs BLAST