Pairwise Sequence Alignment - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Pairwise Sequence Alignment

Description:

Pair-wise Sequence Alignment. What happened to the sequences of ... query also in target. Pre-computed table that stores locations of words 'hashing' ... – PowerPoint PPT presentation

Number of Views:165
Avg rating:3.0/5.0
Slides: 32
Provided by: cs146
Category:

less

Transcript and Presenter's Notes

Title: Pairwise Sequence Alignment


1
Pair-wise Sequence Alignment
  • What happened to the sequences of similar genes?
  • random mutation
  • deletion, insertion
  • What is pair-wise sequence alignment?

Seq. 1 515 EVIRMQDNNPFSFQSDVYSYG EVI
P DVSY Seq. 2 451 EVI---EHKPYNHKADVFS
YA
  • Why pair-wise alignment?
  • Homology vs. similarity

2
Some concepts
  • Gaps
  • Gap penalty
  • Optimal alignment
  • Global alignment
  • Local alignment
  • Substitution matrix

3
Dotplot
  • A simplified representation
  • What dotplot shows
  • What dotplot does not show

4
Sequence Alignment
  • Total number of possible alignments for length n
  • 22n / sqrt(2?n)
  • Dynamic programming
  • a method for some optimization problems
  • determine a scoring scheme
  • best solution based on a scoring scheme
  • Needleman-Wunsch - global

5
  • Questions
  • How does it work?
  • How to come up with a DP approach
  • to an exponential problem?
  • How to implement a DP approach?

6
Dynamic Programming Algorithm
  • Break a problem into subproblems
  • Solve each subproblem separately

F(i-1,j-1) s(xi, yj)
F(i,j-1) g
F(i,j) max
F(i-1,j) g
F(i,j) The max score for aligning 1st i
symbols of sequence 1 with 1st j symbols of
sequence 2
s(xi, yj) substitution score for aligning xi
with yj
g gap penalty
7
Example
Match 1 Mismatch 0 Gap -1
ACAGTAG
ACTCG
  • Initialization
  • matrix filling (scoring)
  • Trace back

8
Group Discussion 9/23/08 Due in class
  • Fill up the table in the previous example
  • What is the optimal alignment in this example?

9
j 0, 1, 2, 3, 4, 5, 6, 7
A C A G T A G
i0
A
i1
C
i2
T
i3
C
i4
G
i5
10
Local Alignment Smith- Waterman
  • Biological significance

0
F(i-1,j-1) s(xi, yj)
F(i,j) max
F(i,j-1) g
F(i-1,j) g
  • O(n2) time

11
Local Alignment
A A C C T A T A G C T
G
C
G
A
T
A
T
A
AACCTATAGCT GCGATATA
12
Issues in alignment
  • Different ways to fill the table
  • Multiple optimal alignments
  • s(xi, yj) from substitution matrix
  • gap penalty
  • linear w(k) gk

h gk, k1
Affine w(k)
0, k0
13
Gap models
  • New gap vs. gap extension
  • A gap of length k vs. k gaps of length 1
  • 1 insersion / deletion event vs. k events
  • gap penalty
  • linear w(k) gk

h gk, k1
Affine w(k)
0, k0
14
Affine Gap Penalty
  • ? Wrong with the F(i,j) formula if AGP is used
  • Aligning 1st i symbols of x with 1st j symbols of
    y

M( i, j ) best score when xi aligned with yj Ix
(i, j) best score when xi aligned with a
gap Iy (i, j) best score when yj aligned with
a gap
  • Three matrices

15
DP for global alignment for AGP
M(i-1, j-1) s(xi, yj) Ix (i-1, j-1) s(xi,
yj) ly (i-1, j-1) s(xi, yj)
M (i, j) max
M(i-1, j) h g Iy(i-1, j) h g lx (i-1, j)
g
Ix (i, j) max
M(i, j-1) h g Ix(i, j-1) h g ly (i, j-1)
g
Iy (i, j) max
16
DP for global alignment using AGP
  • Initialization
  • M(0, 0) 0
  • Ix(i, 0) hgi
  • ly(0, j) hgj
  • all other cases -?
  • Start at the largest element in the three
    matrices
  • M(m, n), Ix(m, n), ly(m, n)
  • Traceback to (0,0)

17
DP for local alignment for AGP
M(i-1, j-1) s(xi, yj) Ix (i-1, j-1) s(xi,
yj) ly (i-1, j-1) s(xi, yj) 0
M (i, j) max
M(i-1, j) h g Iy(i-1, j) h g //
ignored lx (i-1, j) g
Ix (i, j) max
M(i, j-1) h g Ix(i, j-1) h g //
ignored ly (i, j-1) g
Iy (i, j) max
18
DP for Local Alignment for AGP
  • Initialization
  • M(0, 0) 0
  • Ix(i, 0) 0
  • ly(0, j) 0
  • all other cases -?
  • Start at the largest M(i, j), Ix(i, j), ly(i, j)
  • Traceback till M(i, j) 0

19
Database searching methods
  • Ideas Regions that are similar likely to share
    short identical subsequences
  • Quick search for the regions, then check
    carefully locally

20
FASTA related methods
  • Heuristic approximation

1. Quick initial guess common subsequences
  • Word, word size (2,6), sensitivity vs. speed
  • What are the words in the query also in target
  • Pre-computed table that stores locations of words
    hashing
  • An example

21
FASTA related methods
2. Find the region with high population of common
words
  • Process diagonals, rescore, join regions, using
    gaps

3. Local alignment (DP) in the region identified
  • Use Smith-Waterman method
  • in a band, 32 aa wide around the best score

22
Limitation of FASTA
  • Speed vs. sensitivity
  • Identical words
  • Can miss biologically significant similarity
  • some proteins do not share identical a.a.
  • initial step
  • Different codons encodes same protein

23
BLAST
  • Previous 2 kinds approaches
  • Theoretically sound
  • search for common subsequences
  • 1. Word list
  • Incorporate similarity measurement for words
  • PAM120
  • e.g. ACDE
  • Scan for word occurrences
  • hash table
  • Finite state machine

24
BLAST
  • 2. Extend words to HSP (locally optimal pairs)
  • Find additional words within threshold
  • Merge within distance A

3. Select significant HSPs, use DP in banded
region
25
Multiple Sequence Alignment
  • Motivation
  • What is MSA?
  • How do we extend knowledge of pair-wise alignment?
  • An example AGAC, AC, AG

AGAC --AC
AGAC AG--
AC AG
AG--
Some possibilities
--AC
AGAC
  • Fix pair-wise alignment and then add?
  • Evaluate all the possible alignment of N
    sequences?

26
Scoring MSA
  • Sum of pairs (SP) scoring methods

Given a alignment of N sequences, each of which
has length L, in the LxN alignment
Pair-wise sum for each column, then sum all
columns
  • Example
  • (c(match)1, c(mismatch)-1, c(gap)-2,
    c(gap,gap) 0
  • SP4SP(I,-,I,V) -21-1-2-2-1-7
  • SP SP1 SP2 SP8

AQPILLLV ALR-LL- AK-ILLL- CPPVLILV
  • SP tends to overweight a single mutation

SP(A,A,A,C) 0, SP(A,A,A,A) 6
27
Extension of DP for N sequences
  • Extend F(i,j) for N dimensions
  • DP of N dimensions using SP
  • Time in the order of (LN)(2N-1)N2 O((2L)NN2)

28
STAR method
  • DP provide optimal solution but costly
  • Heuristic methods STAR, CLUSTALW,
  • Progressive alignment
  • STAR
  • - pair-wise
  • - build similarity matrix
  • - find a star sequence
  • - use star to align other sequence
  • - once gap, all time gap

29
STAR method
  • Example

30
CLUSTAL family
  • What are the disadvantages of STAR method?
  • Build Similarity tree clustering
  • Alignment starts at most similar sequences
  • Pair-wise alignment -- distance matrix
  • Fast approximate approach or DP

31
CLUSTALW
2. Construct similarity tree, the guide tree
UPGMA (un-weighted pair-group method using
arithmetic average)
3. Progressive alignment
  • Start with most similar sequences
  • Align group with group using pair-wise alignment
  • e.g.
Write a Comment
User Comments (0)
About PowerShow.com