Title: Sequence Alignment
1Sequence Alignment
2Sequence Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGG
TCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGA
TTTGCCCGAC
3Distance from sequences
Mutation events Mutation we need a score for the
substitution of the symbol i with j (amino
acidic residues, nucleotides, etc.)
substitution matrices s(i,j)? A ALASVLIRLITR
LYP B ASAVHLNRLITRLYP
The substitution matrix should account for the
underlying biological process (conserved
structures or functions)?
4Substitution matrices
The basic idea is to measure the correlation
between 2 sequences Given a pair of correlated
sequences we measure the substitution frequency
of a-gt b ( assuming symmetry) Pab The null
hypothesis (random correlation or independent
events) is the product PaPb We then define s
(a,b) log(Pab/PaPb)? (log is additive so that
the probability gt sum over pairs) Minimal
distance Maximum score (s)?
5Substitution matrices
Following this idea several matrices have been
derived Their main difference is relative to the
alignment types used for computing the
frequencies PAMx (Point Accepted Mutation).
Number x of mutation events. The matrix is
built as A1ik P(ki) for sequences with 1
mutations. Anik(A1ik)n n mutation events
(number of mutations NOT mutated
residues. E.g. n2 P(ik) Sa?P(ia) P(ak)
PAMn Log(Anij /Pi)?
6PAM10
Very stringent matrix, no out of diagonal element
is gt 0
7PAM160
Some positive values outside the diagonal appear
8PAM250
This is one of the most widely used matrix
9PAM500
10Substitution matrices
PAM matrices are computed starting from very
similar sequences and more distance scoring
matrices are derived by matrix products BLOSUMx
This family is computed directly from blocks of
alignments with a defined sequence identity
threshold (gt x). gt For very related
sequences we must use PAM with low numbers and
BLOSUM with large numbers. The opposite holds for
distant sequences
11BLOSUM62
Probably the most widely used
12BLOSUM90
13BLOSUM30
14Exercise
- Write your own substitution Matrix
- Given e set of aligned sequences compute the
- Pabfrequency of mutations between a-gtb (assume
symmetry, a-gtb counts also as b-gta). - Paas the marginal probability of Pab
- Finally, derive the substitution matrix
- s (a,b) log(Pab/PaPb)?
- Use the files provided at
- http//www.biocomp.unibo.it/piero/ICSA/Malignments
/ - Hints
- start with toy.aln to check your algorithm
- 2. Initialize your Pab matrix with
pseudocounts (such as 1 instead of 0)?
15DotPlots
16DotPlot Exercise
Write a program dotplot.py That takes as input a
fasta file with two sequences A scoring matrix
and sliding window, such as dotplot.py fasta.in
score.mat 11 and prints on the standard oputput
the dotplot.
17Distance between sequences
What kind of operations we consider? Mutation
Deletion and Insertion (rarely rearrangements
)? A ALASVLIRLIT--YP B ASAVHL---ITRLYP
The negative gap score value depends only on the
number of holes ?(n) -nd linear ?(n) -d -
(n-1)e affine (d open e extension)? !! Please
note that all scores are position-independent
along the sequence
18Pairwise alignment
Given 2 sequences what is an alignment with a
maximum score? Naïf solution try every
possible alignments and select one with the best
score Using the score function
19How may alignments there are?
Case without internal gaps --AAA -AAA AAA
AAA AAA- AAA-- BB--- BB-- BB- -BB --BB
---BB Given 2 sequences of length m e n, the
number of shifts is m n 1
20How may alignments there are?
Case with internal gaps --AAA -AAA -AAA
-AAA A-AA BB--- BB-- B-B- B--B
BB-- BBAAA BABAA BAABA BAAAB ABBAA AAA
AAA AA-A AAA AAA- BB- B-B -BB-
-BB --BB ... ABABA ABAAB AABBA AABAB AAABB
The number of the alignments is equal at the
number of all possible way of mixing 2 sequences
keeping track of the original sequence order.
Given 2 sequences of lengths n and m, they are
?k0,min(m,n)(mn-k)!/k!(n-k)!(m-k)! For
nm80 we get gt 1043 possible alignments !!!!!!!
21Basic idea
We can keep a table of the precomputed substring
alignments (dynamic programming)? ALSKLASPALSAKD
LDSPALS ALSKIADSLAPIKDLSPASLT ALSKLASPALSAKDLDSPA
L-S ALSKIADSLAPIKDLSPASLT- ALSKLASPALSAKDLDSPALS-
ALSKIADSLAPIKDLSPASL-T
22Basic idea
Building step by step Given the 2
sequences ALSKLASPALSAKDLDSPALS,
ALSKIADSLAPIKDLSPASLT The best alignment between
the two substrings ALSKLASPA ALSKIAD can be
computed taking only into consideration
ALSKLASP A ALSKLASP A ALSKLASPA - ALSKIA
D ALSKIAD - ALSKIA D The best among
these 3 possibilities
23The Needleman-Wunsch Matrix
x1 xM
Every nondecreasing path from (0,0) to (M, N)
corresponds to an alignment of the two
sequences
yN y1
24(No Transcript)
25The Needleman-Wunsch Algorithm
- Initialization.
- F(0, 0) 0
- F(0, j) - j ? d
- F(i, 0) - i ? d
- Main Iteration. Filling-in partial alignments
- For each i 1M
- For each j 1N
- F(i-1,j) d case 1
- F(i, j) max F(i, j-1) d case
2 - F(i-1, j-1) s(xi, yj) case 3
- UP, if case 1
- Ptr(i,j) LEFT if case 2
- DIAG if case 3
- Termination. F(M, N) is the optimal score, and
- from Ptr(M, N) can trace back optimal alignment
26Exercise
Suppose you want only to know the score of a
global alignment. gt Write a program that given
two input sequence (in a single file in fasta
format), a gap cost and a similarity matrix
computes the score of the global alignment in
O(NM) time and in O(M) space, where M and N
are the lengths of the input sequences and MltN
27The Overlap Detection variant
x1 xN
- Changes
- Initialization
- For all i, j,
- F(i, 0) 0
- F(0, j) 0
- Termination
- maxi F(i, N)?
- FOPT max maxj F(M, j)?
yM y1
28The local alignment problem
Given two strings x x1xM, y
y1yN Find substrings x, y whose similarity
(optimal global alignment value)? is
maximum e.g. x aaaacccccgggg y
cccgggaaccaacc
29(No Transcript)
30The Smith-Waterman algorithm
Idea Ignore badly aligning regions Modifications
to Needleman-Wunsch Initialization F(0, j)
F(i, 0) 0 0 Iteration F(i, j)
max F(i 1, j) d F(i, j 1)
d F(i 1, j 1) s(xi, yj)
31The Smith-Waterman algorithm
- Termination
- If we want the best local alignment
-
- FOPT maxi,j F(i, j)?
- If we want all local alignments scoring gt t
- For all i, j find F(i, j) gt t, and trace back
32(No Transcript)
33General gap dynamic programming
Initialization same Iteration F(i-1, j-1)
s(xi, yj)? F(i, j) max maxk0i-1F(k,j)
?(i-k) maxk0j-1F(i,k)
?(j-k)? Termination same Running Time
O(N2M) (assume NgtM)? Space O(NM)?
34Exercise
Write a program that given two input sequence
(in a single file in fasta format), and a choice
of a general gap function and scoring matrix
computes the alignments of the two sequences and
returns one of the possible best
alignments. Remember that when you store that
the best score is obtained using
maxk0i-1F(k,j) g(i-k) maxk0j-1F(i,k)
g(j-k)? You have to store this information in
the corresponding pointer (back-trace) matrix.
35(No Transcript)
36Needleman-Wunsch with affine gaps
Initialization F(0,0)0, F(i, 0) d (i
1)?e, F(0, j) d (j 1)?e R(0,j) -8 ,
C(i,0) -8 Iteration F(i 1, j
1) s(xi, yj)? F(i, j) max R(i , j)
C(i , j)? F(i
1, j) d R(i, j) max R(i
1, j) e F(i , j -1) d
C(i, j) max C(i , j -1 )
e Termination same
37Bounded Dynamic Programming
Assume we know that x and y are very
similar Assumption gaps(x, y) lt k(N) ( say
NgtM )? xi Then, implies i j lt
k(N)? yj We can align x and y more
efficiently Time, Space O(N ? k(N)) ltlt
O(N2)?
38Bounded Dynamic Programming
Initialization F(i,0), F(0,j) undefined for i,
j gt k Iteration For i 1M For j max(1, i
k)min(N, ik)? F(i 1, j 1) s(xi,
yj)? F(i, j) max F(i, j 1) d, if j
gt i k(N)? F(i 1, j) d, if j lt i
k(N)? Termination same
x1 xM
y1 yN
k(N)?
Easy to extend to the affine gap case
39Significance of an alignment
Given an alignment with score S, is it
significant? How are the score of random
alignments distributed? 100,000 alignments of
unrelated and shuffled sequences
Occorrenza
Significant alignments
Score
40Z-score
Z(S-ltSgt)/?s S Alignment score ltSgt average of
the scores on a random set of alignments ?s
Standard deviation of the scores on a random set
of alignments Significance of the
alignment Zlt3 not significant 3ltZlt10
probably significant Zgt10 significant
41Execise Z-score
Write a program that takes in input a fasta with
two sequences, and a number N. Compute the score
of the global alignment of the two sequence and
the Z-score with respect N shuffled sequences
(generated from the first of the fasta) against
the original second sequence of the
fasta. Z(S-ltSgt)/?s S Alignment score ltSgt
average of the scores on a random set of
alignments ?s Standard deviation of the scores
on a random set of alignments
42Is the Z-score reliable?
The Z-score of this alignment is 7.5 over 54
residues Sequence identity is as high as 25.9.
The sequences have a completely different
structure
Citrate synthase (2cts) vs transthyritin (2paba)?
43E-value
Expected number of random alignments obtaining a
score greater or equal to a given score
(s)? Relies on the Extreme Value Distribution
EKmn e-?s m, n lengths of the
sequences K, ? scaling constants The number
of high scoring random alignment increases when
the sequence lengths increase and decreases in an
exponential way when the score increases.
44E-value
Alignment significance The significance of the
E-value depends on the length of the considered
database. Considering Swiss Prot, Egt 10-1 non
significant 10-1 gt E gt 10-3 probably not
significant 10 -3 gt E gt 10-8 probably
significant E lt 10-8 significant
45P-value
Probability for random alignments to obtain a
score greater or equal to a given score
(s)? Given the E-value (expected number of
alignments), which statistics do describe the
probability of having a number a of random
alignments with score S?
Poisson
P(a)
Which is the probability of finding at least one
random alignment with score S?
P(a 1) 1 P(0) 1 exp (-E)?
46Similarity search in Data Bases
Given a sequence, search for similar sequences
in huge data sets In principle, the alignments
between the query sequence and ALL the sequences
in the data sets could be tried Too many
sequences! Heuristic algorithms can be used.
They do not assure to find the optimal
alignment FASTA BLAST
47FASTA
The query sequence is chopped in words of k-tup
characters. Usually k-tup 2 for proteins, 6 for
DNA
ADKLPTLPLRLDPTNMVFGHLRI
Words (indexed by position) AD, DK, KL, LP, PT,
TL, LP, PR, RL, ,, 1 2 3 4 5 6 7
8 9 .
The list of indexed words is compiled for each
sequence in the data set (subject)? The search
of the correspondence between the words is very
fast. The difference between the indexes of the
matches in the query and the subject sequences
determines the distance from the main diagonal
48FASTA
Query
Subject
Many matches along the same diagonal correspond
to longer identical segments along the sequences
49FASTA
Query
Subject
The alignment of the longest matched diagonals
are evaluated with a score matrix (PAM or BLOSUM)?
50FASTA
Query
Subject
Most similar regions on close diagonals are
isolated
51FASTA
Query
Subject
An exact Smith-Waterman alignment is computed on
a narrow band around the diagonal endowed with
the highest similarity (a 32-residue band is
usually adopted)?
52Sequence similarity with FASTA
53BLAST
The sequence data set is indexed as follow for
each possible residue triplet the occurrence and
the position along the sequences are stored.
(FORMATDB)?
AAA AAC AAD ACA ... ... ...
54BLAST
The query sequence is chopped in words, W-residue
long (usually W3 for proteins)?
LSHLPTLPLRLDPTNMVFGHLRI
LSH, SHL, HLP, LPT, PTL, TLR, ,,
For each word, all the similar proteins are
generated, using the BLOSUM62 matrix and setting
a similarity threshold (usually T 11--13)?
LSH 16 ISH 14 MSH 14 VSH 13 LAH 13 LTH 13 LNH 13
55BLAST
Each word included in the list of the similar
words is retrieved in the sequence of the data
set by means of the indexes
The match is extended, until the score is higher
than a threshold S
56Sequence similarity with BLAST (Basic Local
Alignment Search Tool)?
57Alignment of all the retrieved sequences
ATTENTION It is not a multiple sequence
alignment
58Sequence profile
59Usefulness of the sequence profiles
Sequence profiles describes the basic features
of all the sequence used in the alignment Most
conserved regions and most frequent mutations for
each position are highlighted Sequence-to-profil
e alignment The alignment score are weighted
position by position using the profile. The same
mutations in different positions are scored with
different values
60Sequence-to-profile alignment
Given the position i along a sequence profile, it
is represented by a 20-valued vector Pi Pi(A)
Pi(C) Pi (Y)? Given the residue in position j
along the sequence to align Sj The score for
aligning Sj to the vector Pi is where
M is a matrix score (BLOSUM or PAM)?
61PSI-BLAST
http//www.ncbi.nlm.nih.gov/BLAST/
Sequence
BLAST
Data Base
Sequence profile
PSI-BLAST
Until converges
62The design of PSI-BLAST
- PSI-BLAST takes as an input a single protein
sequence and compares it to a protein database,
using the gapped BLAST program - The program constructs a multiple alignment, and
then a profile, from any significant local
alignments found. The original query sequence
serves as a template for the multiple alignment
and profile, whose lengths are identical to that
of the query. Different numbers of sequences can
be aligned in different template positions - The profile is compared to the protein database,
again seeking local alignments. After a few minor
modifications, the BLAST algorithm can be used
for this directly. - PSI-BLAST estimates the statistical significance
of the local alignments found. Because profile
substitution scores are constructed to a fixed
scale, and gap scores remain independent of
position, the statistical theory and parameters
for gapped BLAST alignments remain applicable to
profile alignments. - Finally, PSI-BLAST iterates, by returning to step
(2), an arbitrary number of times or until
convergence.