Sequence Alignment - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Sequence Alignment

Description:

we need a score for the substitution of the symbol i with j ... 'atoh' enhancer in human, mouse, rat, fugu fish. The local alignment problem ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 37
Provided by: CIRB
Category:

less

Transcript and Presenter's Notes

Title: Sequence Alignment


1
Sequence Alignment
2
Sequence Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGG
TCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGA
TTTGCCCGAC
3
Distance from sequences
  • Mutation events
  • Mutation
  • we need a score for the substitution of the
    symbol i with j
  • (amino acidic residues, nucleotides, etc.)
  • substitution matrices s(i,j)
  • A ALASVLIRLITRLYP
  • B ASAVHLNRLITRLYP

The substitution matrix should account for the
underlying biological process (conserved
structures or functions)
4
Substitution matrices
The basic idea is to measure the correlation
between 2 sequences Given a pair of correlated
sequences we measure the substitution frequency
of a-gt b ( assuming symmetry) Pab The null
hypothesis (random correlation or independent
events) is the product PaPb We then define s
(a,b) log(Pab/PaPb) (log is additive so that
the probability gt sum over pairs) Minimal
distance Maximum score (s)
5
Substitution matrices
Following this idea several matrices have been
derived Their main difference is relative to the
alignment types used for computing the
frequencies PAMx (Point Accepted Mutation).
Number x of mutation events. The matrix is
built as A1ik P(ki) for sequences with 1
mutations. Anik(A1ik)n n mutation events
(number of mutations NOT mutated
residues. E.g. n2 P(ik) Pa P(ia) P(ak)
PAMn Log(Anij /Pi)
6
PAM10
Very stringent matrix, no out of diagonal element
is gt 0
7
PAM160
Some positive values outside the diagonal appear
8
PAM250
This is one of the most widely used matrix
9
PAM500
10
Substitution matrices
PAM matrices are computed starting from very
similar sequences and more distance scoring
matrices are derived by matrix products BLOSUMx
This family is computed directly from blocks of
alignments with a defined sequence identity
threshold (gt x). gt For very related
sequences we must use PAM with low numbers and
BLOSUM with large numbers. The opposite holds for
distant sequences
11
BLOSUM62
Probably the most widely used
12
BLOSUM90
13
BLOSUM30
14
Distance between sequences
What kind of operations we consider? Mutation
Deletion and Insertion (rarely rearrangements
) A ALASVLIRLIT--YP B ASAVHL---ITRLYP
The negative gap score value depends only on the
number of holes s(n) -nd linear s(n) -d -
(n-1)e affine (d open e extension) !! Please
note that all scores are position-independent
along the sequence
15
Pairwise alignment
Given 2 sequences what is an alignment with a
maximum score? Naïf solution try every
possible alignments and select one with the best
score Using the score function
16
How may alignments there are?
Case without internal gaps --AAA -AAA AAA
AAA AAA- AAA-- BB--- BB-- BB- -BB --BB
---BB Given 2 sequences of length m e n, the
number of shifts is m n 1
17
How may alignments there are?
Case with internal gaps --AAA -AAA -AAA
-AAA A-AA BB--- BB-- B-B- B--B
BB-- BBAAA BABAA BAABA BAAAB ABBAA AAA
AAA AA-A AAA AAA- BB- B-B -BB-
-BB --BB ... ABABA ABAAB AABBA AABAB AAABB
The number of the alignments is equal at the
number of all possible way of mixing 2 sequences
keeping track of the original sequence order.
Given 2 sequences of lengths n and m, they are
?k0,min(m,n)(mn-k)!/k!(n-k)!(m-k)! For
nm80 we get gt 1043 possible alignments !!!!!!!
18
Basic idea
We can keep a table of the precomputed substring
alignments (dynamic programming) ALSKLASPALSAKDL
DSPALS ALSKIADSLAPIKDLSPASLT ALSKLASPALSAKDLDSPAL
-S ALSKIADSLAPIKDLSPASLT- ALSKLASPALSAKDLDSPALS-
ALSKIADSLAPIKDLSPASL-T
19
Basic idea
Building step by step Given the 2
sequences ALSKLASPALSAKDLDSPALS,
ALSKIADSLAPIKDLSPASLT The best alignment between
the two substrings ALSKLASPA ALSKIAD can be
computed taking only into consideration
ALSKLASP A ALSKLASP A ALSKLASPA - ALSKIA
D ALSKIAD - ALSKIA D The best among
these 3 possibilities



20
The Needleman-Wunsch Matrix
x1 xM
Every nondecreasing path from (0,0) to (M, N)
corresponds to an alignment of the two
sequences
yN y1
21
The Needleman-Wunsch Algorithm
  • x AGTA m 1
  • y ATA s -1
  • d -1

F(i,j) i 0 1 2
3 4
Optimal Alignment F(4,3) 2 AGTA A - TA
j 0
1
2
3
22
The Needleman-Wunsch Algorithm
  • Initialization.
  • F(0, 0) 0
  • F(0, j) - j ? d
  • F(i, 0) - i ? d
  • Main Iteration. Filling-in partial alignments
  • For each i 1M
  • For each j 1N
  • F(i-1,j) d case 1
  • F(i, j) max F(i, j-1) d case
    2
  • F(i-1, j-1) s(xi, yj) case 3
  • UP, if case 1
  • Ptr(i,j) LEFT if case 2
  • DIAG if case 3
  • Termination. F(M, N) is the optimal score, and
  • from Ptr(M, N) can trace back optimal alignment

23
The Overlap Detection variant
x1 xN
  • Changes
  • Initialization
  • For all i, j,
  • F(i, 0) 0
  • F(0, j) 0
  • Termination
  • maxi F(i, N)
  • FOPT max maxj F(M, j)

yM y1
24
Structure of a genome
a gene
transcription
pre-mRNA
splicing
mature mRNA
translation
Human 3x109 bp Genome 30,000 genes
200,000 exons 23 Mb coding 15 Mb
noncoding
protein
25
Structure of a genome
gene D
A
B
Make D
C
If B then NOT D
If A and B then D
short sequences regulate expression of
genes lots of junk sequence e.g. 50
repeats selfish DNA
gene B
Make B
D
C
If D then B
26
Cross-species genome similarity
  • 98 of genes are conserved between any two
    mammals
  • 75 average similarity in protein sequence

hum_a GTTGACAATAGAGGGTCTGGCAGAGGCTC------------
--------- _at_ 57331/400001 mus_a
GCTGACAATAGAGGGGCTGGCAGAGGCTC---------------------
_at_ 78560/400001 rat_a GCTGACAATAGAGGGGCTGGCAGAGA
CTC--------------------- _at_ 112658/369938 fug_a
TTTGTTGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCG
_at_ 36008/68174 hum_a CTGGCCGCGGTGCGGAGCGTCTGGA
GCGGAGCACGCGCTGTCAGCTGGTG _at_ 57381/400001 mus_a
CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG
_at_ 78610/400001 rat_a CTGGCCCCGGTGCGGAGCGTCTGGAG
CGGAGCACGCGCTGTCAGCTGGTG _at_ 112708/369938 fug_a
TGGGCCGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCG
_at_ 36058/68174 hum_a AGCGCACTCTCCTTTCAGGCAGCT
CCCCGGGGAGCTGTGCGGCCACATTT _at_ 57431/400001 mus_a
AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGAGCGGCCACATTT
_at_ 78659/400001 rat_a AGCGCACTCG-CTTTCAGGCCGCTCC
CCGGGGAGCTGCGCGGCCACATTT _at_ 112757/369938 fug_a
AGCGCTCGCG------------------------AGTCCCTGCCGTGTCC
_at_ 36084/68174 hum_a AACACCATCATCACCCCTCCCCGGC
CTCCTCAACCTCGGCCTCCTCCTCG _at_ 57481/400001 mus_a
AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG
_at_ 78708/400001 rat_a AACACCGTCGTCA-CCCTCCCCGGCC
TCCTCAACCTCGGCCTCCTCCTCG _at_ 112806/369938 fug_a
CCGAGGACCCTGA-------------------------------------
_at_ 36097/68174
atoh enhancer in human, mouse, rat, fugu fish
27
The local alignment problem
  • Given two strings x x1xM,
  • y y1yN
  • Find substrings x, y whose similarity
  • (optimal global alignment value)
  • is maximum
  • e.g. x aaaacccccgggg
  • y cccgggaaccaacc

28
Why local alignment
  • Genes are shuffled between genomes
  • Portions of proteins (domains) are often conserved

29
The Smith-Waterman algorithm
  • Idea Ignore badly aligning regions
  • Modifications to Needleman-Wunsch
  • Initialization F(0, j) F(i, 0) 0
  • 0
  • Iteration F(i, j) max F(i 1, j) d
  • F(i, j 1) d
  • F(i 1, j 1) s(xi, yj)

30
The Smith-Waterman algorithm
  • Termination
  • If we want the best local alignment
  • FOPT maxi,j F(i, j)
  • If we want all local alignments scoring gt t
  • For all i, j find F(i, j) gt t, and trace back

31
Scoring the gaps more accurately
  • Current model
  • Gap of length n
  • incurs penalty n?d
  • However, gaps usually occur in bunches
  • Convex gap penalty function
  • ?(n)
  • for all n, ?(n 1) - ?(n) ? ?(n) - ?(n 1)

32
General gap dynamic programming
  • Initialization same
  • Iteration
  • F(i-1, j-1) s(xi, yj)
  • F(i, j) max maxk0i-1F(k,j) ?(i-k)
  • maxk0j-1F(i,k) ?(j-k)
  • Termination same
  • Running Time O(N2M) (assume NgtM)
  • Space O(NM)

33
Compromise affine gaps
  • ?(n) d (n 1)?e
  • gap gap
  • open extend
  • To compute optimal alignment,
  • At position i,j, need to remember best score if
    gap is open
  • best score if gap is not open
  • F(i, j) score of alignment x1xi to y1yj
  • if xi aligns to yj
  • G(i, j) score if xi, or yj, aligns to a gap

34
Needleman-Wunsch with affine gaps
  • Initialization F(i, 0) d (i 1)?e
  • F(0, j) d (j 1)?e
  • Iteration
  • F(i 1, j 1) s(xi, yj)
  • F(i, j) max
  • G(i 1, j 1) s(xi, yj)
  • F(i 1, j) d
  • F(i, j 1) d
  • G(i, j) max
  • G(i, j 1) e
  • G(i 1, j) e
  • Termination same

35
Bounded Dynamic Programming
  • Assume we know that x and y are very similar
  • Assumption gaps(x, y) lt k(N) ( say NgtM )
  • xi
  • Then, implies i j lt k(N)
  • yj
  • We can align x and y more efficiently
  • Time, Space O(N ? k(N)) ltlt O(N2)

36
Bounded Dynamic Programming
  • Initialization
  • F(i,0), F(0,j) undefined for i, j gt k
  • Iteration
  • For i 1M
  • For j max(1, i k)min(N, ik)
  • F(i 1, j 1) s(xi, yj)
  • F(i, j) max F(i, j 1) d, if j gt i k(N)
  • F(i 1, j) d, if j lt i k(N)
  • Termination same
  • Easy to extend to the affine gap case

x1 xM
y1 yN
k(N)
Write a Comment
User Comments (0)
About PowerShow.com