Sequence alignment algorithms - PowerPoint PPT Presentation

1 / 74
About This Presentation
Title:

Sequence alignment algorithms

Description:

Pic 1. Pic 2. Structure. function. Protein A is a constituent of muscle, skin, ... Pic 3. Real alignment. Scoring ... Pic 4. A substitution matrix ... – PowerPoint PPT presentation

Number of Views:223
Avg rating:3.0/5.0
Slides: 75
Provided by: compbi
Category:

less

Transcript and Presenter's Notes

Title: Sequence alignment algorithms


1
Sequence alignment algorithms
Presented By Cary Miller Sastry
Akella Daisuke Yasuda
2
Overview
  • Biological background / motivation / applications
  • Dot matrix / dynamic programming
  • FASTA / BLAST

3
biology
  • Biomolecules are strings from a restricted
    alphabet
  • Length4 DNA
  • Length20 protein
  • Proteins are the working part

4
Proteins
  • Protein is a linear sequence of 20 characters
    (amino acids)
  • Proteins do not maintain linearity
  • Folding happens
  • Folding determines overall 3-D shape
  • Shape determines function

5
Sequence Structure Function
  • sequence does not reveal structure
  • Much less function
  • A sequenceARTUVEDYERRWWUHUK

6
Structure
  • Pic 1
  • Pic 2

7
Structure
8
function
  • Protein A is a constituent of muscle, skin,
    cartilage, or
  • Protein B catalyzes the transformation of glucose
    to fructose, or
  • How do we find proteins with similar function?

9
Nature does not solve the same problem twice
(usually)
  • Short sequence with a specific function (or
    shape) is called a domain
  • The same domain appears in multiple proteins
  • If we find the same domain in multiple proteins
    that provides a clue to function and/or structure

10
Amino acids
  • Each has the same basic chemical configuration
    but has a functional group that makes it
    chemically unique
  • They occur in families
  • Some functional groups are similar

11
How biologists study proteins
  • Expensive (NMR, x-ray crystallography)
  • Discovery of function is difficult
  • Few proteins are understood in detail
  • Many are known by sequence
  • Sequence is easier to get than structure or
    function

12
A biological scenario
  • Biologist discovers the sequence of a new protein
    with unknown function
  • She has no idea of function
  • If sequence can be associated with a known
    protein sequence we have a clue about structure
    and/or function
  • Most proteins have unknown function

13
Public databases
  • Vast quantities of sequence, structure, function
    info is deposited into public databases
  • A new sequence should be compared to the database

14
Comparing sequences
  • Alignment with exact matchABCTUVABUVABCTUVAB
    ----UV

15
Alignment with inexact match
  • InexactGARUIPPRSTGARVVBUIEEYSTGAR------UIPPRS
    TGARVVBUIEEYST

16
Global vs. local alignment
  • ABQRTASGGBV
  • ABRRRASGVBB
  • ABQRTASGGBV
  • ABQ------SGGBV

17
A real alignment
  • MyoglobinPDLRKY FKG-A ENFTA DDVQ KSDRPDTKAY
    FPKFG DLSTA AALK SSPK
  • Homology common ancestry

18
Real alignment
  • Pic 3

19
Real alignment
20
Scoring pairs of amino acids
  • For amino acid pairs assign a score based on
    frequency of substitutionATRGUVXQATRCVVXTATRGV
    VEQAT-----VVEQ

21
A substitution matrix
  • Pic 4

22
A substitution matrix
23
Substitution matrices
  • Pam and Blosum are standard substitution matrices
  • Also include scores for
  • Gap opening
  • Gap extension

24
Scoring amino acid strings
  • Sum the individual pair scores
  • Database is huge
  • Spurious match to random sequence is likely
  • Try your name
  • E-value is probability of getting a given score
    from a random sequence

25
Alignment algorithms
  • Dot matrix
  • Dynamic programming
  • FASTA
  • BLAST

26
Dot Matrix and DP
27
Dot Matrix
  • Locating regions of similarity between two DNA or
    protein sequences which provide a great deal of
    information about the function and structure of
    the query sequence.
  • Similar structure indicates homology, or similar
    evolution, which provides critical information
    about the functions of these sequences.

28
Dot Matrix Contd..
  • A dot matrix plot is a method of aligning two
    sequences to provide a picture of the homology
    between them.
  • The dot matrix plot is created by designating one
    sequence to be the subject and placing it on the
    horizontal axis and designating the second
    sequence to be the query and placing it on the
    vertical axis of the matrix.

29
Dot Matrix Contd..
  • At each position within the matrix, a point is
    plotted if the horizontal and vertical elements
    are identical.
  • Diagonal lines within the resulting matrix
    indicate regions of similarity. A simple dot
    matrix plot is shown in Figure A.

30
(No Transcript)
31
Dot Matrix with noise reduction
  • A certain percentage of the matches between
    sequence elements can be expected to be the
    result of the random nature of their evolution.
    These random matches are considered noise" and
    are filtered out to enhance the diagonal lines.

32
Dot Matrix
  • Noise Reduction
  • a) Noise reduction in dot matrix can be done
    by centering a substring of elements of the
    query sequence over each element in the
    subject sequence and determining the number of
    corresponding elements within this window.

33
Dot Matrix
  • b) If the number of corresponding elements
    exceeds a specified threshold then a point is
    plotted for the center element. This is
    demonstrated in figure B.

34
Dot Matrix (Figure B)
35
Dot Matrix
  • Advantages Readily reveals the presence of
    insertions/deletions and direct and inverted
    repeats that are more difficult to find by the
    other, more automated methods.
  • DisadvantagesMost dot matrix computer programs
    do not show an actual alignment. Does not return
    a score to indicate how optimal a given
    alignment is.

36
Dynamic Programming
  • Dynamic programming (DP) algorithms are a general
    class of algorithms typically applied to
    optimization problems.
  • For DP to be applicable, an optimization problem
    must have two key ingredients
  • a) Optimal substructure an optimal solution
    to the problem contains within it optimal
    solutions to sub-problems.
  • b) Overlapping sub-problems the pieces
    of larger problem have a sequential
    dependency.

37
Dynamic Programming
  • DP works by first solving every sub-sub-problem
    just once, and saves its answer in a table,
    thereby avoiding the work of re- computing the
    answer every time the sub-sub-problem is
    encountered. Each intermediate answer is stored
    with a score, and DP finally chooses the sequence
    of solution that yields the highest score.

38
Dynamic Programming
  • Path Matrix

39
Dynamic Programming
  • Both global and local types of alignments may be
    made by simple changes in the basic DP algorithm.
  • Alignments depend on the choice of a scoring
    system for comparing character pairs and penalty
    scores (e.g. PAM and BLOSUM matrixes covered
    before)
  • Scoring functions example
  • w (match) 2 or substitution matrix
  • w (mismatch) -1 or substitution matrix
  • w (gap) -3

40
Dynamic Programming
  • Global Alignment (Needleman-Wunsch)
  • a) General goal is to obtain optimal global
    alignment between two sequences, allowing
    gaps.b) We construct a matrix F indexed by i
    and j, one index for each sequence, where the
    value F(i,j) is the score of the best
    alignment between the initial segment x1i of x
    up to xi and the initial segment y1j of y up
    to yj. We begin by initializing F(0,0) 0.
    We then proceed to fill the matrix from top
    left to bottom right. If F(i-1, j-1),
    F(i-1,j) and F(i,j-1) are known, it is
    possible to calculate F(i,j).

41
Dynamic Programming
  • F(i,j) max F(i-1, j-1) s(xi , yj
    )F(i-1,j) dF(i, j-1) d.
  • where s(a,b) is the likelihood score
    that residues a and b occur as an aligned
    pair, and d is the gap penalty.
  • Once you construct the matrix, you trace back the
    path that leads to F(n,m), which is by definition
    the best score for an alignment of x1n to y1m.

42
Dynamic Programming
  • Global Dynamic programming matrix

43
Dynamic Programming
  • Local alignment (Smith-Waterman)Two changes from
    global alignment1. Possibility of taking the
    value 0 if all other options have value less
    than 0. This corresponds to starting a new
    alignment.2. Alignments can end anywhere in
    the matrix, so instead of taking the value
    in the bottom right corner, F(n,m) for the
    best score, we look for the highest value of
    F(i,j) over the whole matrix and start the
    trace-back from there.
  • F(i,j) max 0F(i-1, j-1) s(xi , yj
    ) F(i-1,j) dF(i, j-1) d.

44
Dynamic Programming
  • Local Dynamic programming matrix

45
Dynamic Programming
  • Advantages Guaranteed in a mathematical
    sense to provide the optimal (very best or
    highest-scoring) alignment for a given set of
    scoringfunctions.
  • Disadvantages
  • a) Slow due to the very large number of
    computational steps O(n 2).b) Computer
    memory requirement also increases as the square
    of the sequence lengths.
  • Therefore, it is difficult to use the
    method for very long sequences.

46
FASTA and BLAST
47
FASTA - Idea -
  • Problem of Dynamic Programming
  • D.P. compute the score in a lot of useless
    area for optimal sequence
  • FASTA focuses on diagonal area

48
FASTA - Heuristic -
  • Heuristic
  • Good local alignment should have some exact
    match subsequence.

FASTA focus on this area
49
FASTA - Hi Level Algorithm -
  • Hi level algorithm
  • Let q be a query
  • max ? 0
  • For each sequence, s in DB
  • compare q with s and compute a score, y
  • if max
  • max ? y
  • bestSequence ? s
  • Return bestSequence

50
FASTA - Algorithm -
  • Step 1
  • Find all hot-spots
  • // Hot spots is pairs of words of length k
    that exactly match

Sequence 1
Hot Spots
Sequence 2
51
FASTA - Algorithm -
  • Step 1 in detail
  • Use look-up Table
  • Query G A A T T C A G T T A
  • Sequence G G A T C G A

DotMatrix
Look-up Table
52
FASTA - Algorithm -
  • Step 2
  • Score the Hot-spot and locate the ten best
    diagonal run.
  • // There is some scoring system ex. PAM250

53
FASTA - Algorithm -
  • Step 3
  • Combine sub-alignments into one alignment
    with GAP

GAP
One of local alignment
54
FASTA - Algorithm -
  • Step 4
  • Consider weighted direct graph.
  • Let node be a sub-alignment found in step 1
  • Let u and v be nodes
  • Edge (u,v) exists if alignment u is before
    in the sequence.
  • Each edge has gap penalty (negative)
  • Find the maximum weight path

Sub-sequence
Edge
One Sequence
55
FASTA - Algorithm -
  • Step 4 in detail

GAP
Sub-alignment
Gap
-5
-3
-3
Max Weight Path
56
FASTA - Algorithm -
  • Step 5
  • Use the dynamic programming in restricted area
    around the best-score alignment to find out the
    higher-score alignment than the best-score
    alignment

Width of this band is a parameter
57
FASTA - Algorithm -
  • Summary of Algorithm
  • 1 Find all hot-spots
  • // Hot spots is pairs of words of length k
    that exactly match
  • 2 Score the Hot-spot and locate the ten best
    diagonal run.
  • 3 Combine sub-alignments into one alignment
  • 4 Score Each alignment with gap penalty and
    pick up the best-score alignment
  • 5 Use the dynamic programming in restricted
    area around the best-score alignment to find out
    the alignment greater than the best-score
    alignment.

58
FASTA - Complexity -
  • Complexity
  • Step 1 and 2 // select the best 10
    diagonal run
  • Let n be a sequence from DB
  • O(n) because Step 1 just uses look up the
    table
  • O(n)

59
FASTA - Complexity -
  • Step 3 and 4 // compute the MAX Weight Path
  • Let r be the number of sub-alignments. (r
    10)
  • Lets be the number of edges
  • O(r2)
  • n1 n2 n3
  • n1
  • n2
  • n3
  • ? 1 of D.P because r2 102
  • and mn 104

Positive Weight
-5
-3
-3
Max Weight Path
60
FASTA - Complexity -
  • Step 5 // compute partial D.P.
  • Depends on the restricted area
  • Therefore, FASTA is faster than D.P.

Width of this band is a parameter
61
BLAST - Heuristic -
  • Another Heuristic algorithm
  • Heuristic but evaluating the result
    statistically.
  • Homologous sequence are likely to contain a
    short high scoring word pair, a hit.
  • BLAST tries to extend it on the both sides to
    get optimal sequence.

A T T A G .
Sequence
Short high score Word
62
BLAST - Algorithm -
Neighborhood Word
  • Step 1 preprocessing Query
  • Compile the short-hit scoring word list from
    query.
  • The length of query word,w, is 3 for brosom
    scoring
  • Threshold T is 13

63
BLAST - Algorithm -
  • Step 1 2
  • Create neighborhood words for each query word

Query Word
Neighborhood words
64
BLAST - Algorithm -
  • Step 2 Scanning DB
  • For each words list, identify all exact matches
    with DB sequences

Query Word
Neighborhood Word list
Sequences in DB
Sequence 1
Sequence 2
Step 2
Step 1
The purpose of Step 1 and 2 is as same as FASTA
65
BLAST - Algorithm -
  • Step 2-2
  • Method 1 Hash Table
  • Query LAALLNKCKTPQGQRLVNQWIKQPLMD

Hash Table
Word list
66
BLAST - Algorithm -
  • Step 2-3
  • Method 2 Finite Automata

A,G
L
A
G
A
A
A
I
67
BLAST Algorithm -
  • Step 3 (Search optimal alignment)
  • Let S be a score of hit-word
  • For each hit-word, extend ungapped alignmentin
    both directions.
  • Step 4 (Evaluate the alignment statistically)
  • Stop extension when E-value (depending on score
    S) become less than threshold. The hit-word is
    called High Scoring Segment Pair. BLAST return it
  • E-value the number of HSPs having score S
    (or higher) expected to
  • occur only by chance.
  • ? Smaller E-value, more significant in
    statistics
  • Bigger E-value , by chance

A T T A G .
Sequence
Hit Word
68
BLAST - Algorithm -
  • Step 3 -2
  • Definition of E-Value
  • The expected number of HSP with the score at
    least S is
  • E Knme-?S
  • K, ? is constant depending on model
  • n, m are the length of query and sequence
  • The probability of finding at least one such HSP
    is
  • P 1 - eE
  • ? If a word is hit by chance
    (E-value is bigger),
  • P become smaler.

69
BLAST - Running Time -
  • Running Time
  • The length of Query 153
  • DB size 5997 sequences
  • PC Pentium 4
  • By Dr. Takeshi Kawabata
  • Nara Sentan Gijyutu University

70
Comparison of Algorithm
  • Dynamic Programming
  • 1. most sensitive result
  • ? D.P uses all information of two sequence
  • 2. Running time is slow
  • ? D.P compute the useless area for computing
    the optimal sequence.

71
Comparison of Algorithm
  • FASTA
  • 1. Less sensitive than D.P and BLAST
  • ? FASTA uses partial information to speed up
    the computaiotn.
  • ? FASTA does not evaluate the result
    statistically.
  • 2. Running time is faster D.P
  • ? the same reason as the above.

72
Comparison of Algorithms
  • BLAST
  • 1. Sensitive than FASTA
  • ? BLAST evaluate the result statistically.
  • 2.Faster than FASTA
  • ? Because BLAST evaluate the entire DB with
    the same threshold based on statistics. BLAST
    eliminate noises and reduces the running time.

73
FASTA vs BLAST
  • BLAST
  • Compare the query and sequences in DB
  • with the same threshold.
  • FASTA
  • compare the query and a sequence one by one
  • And compare the each result.

DB
DB
Query
74
Conclusion
Write a Comment
User Comments (0)
About PowerShow.com