Title: Algorithms for Sequence Alignments
1Algorithms forSequence Alignments
- A. Brüngger, Labhead BioinformaticsNovartis
Pharma AG - adrian.bruengger_at_pharma.novartis.com
2Algorithms for Sequence Alignments
- Definition of Sequence Alignment and Origins of
Sequence Similarity - Global Alignment
- Local Alignment
- Alignment of Pairs of Sequences
- Alignment of Multiple Sequences
- Methods for Pairwise Sequence Alignments
- Dot matrix
- Scoring Matrices
- Dynamic Programming Smith-Waterman,
Needleman-Wunsch - Methods for Multiple Sequence Alignment
- Dynamic Programming
- Progressive multiple alignments CLUSTALW, PILEUP
- Database Searching for Similar Sequences
- FASTA, BLAST
- Patscan
- Hidden Markov Models
Outline follows David W. Mount, "Bioionformatics
- Sequence and Genome Analysis Cold Spring
Harbour Laboratory Press, 2001. Online
http//www.bioinformaticsonline.org
3Rational for Sequence Analysis, Origins of
Sequence Similarity
Similar sequence leads to similar function
Sequence Analysis as the basic tool to
discover functional, structural,evolutionary
information in biological sequences
Sequence A
Sequence B
Evolutionary relationship between two similar
sequences and a possible common ancestor. The
number of steps to convert one sequence into the
other is the "evolutionary" distance between the
sequences (x y). Usually, the ancestor sequence
is not available, only (x y) can be computed.
y Steps
x Steps
common ancestor sequence
4Origins of Homology ? Significance of Sequence
Alignments
- Possible Origins of Sequence Homology
- orthologs (panel A and B) a1 in species I and a1
in species II (same ancestor!) - paralogs (panel A and B) a1 and a2 (arose from
gene duplication event) - analogs (panel C) different genes converge to
same function by different evolutionary paths - transfer of genetic material (panel D) between
different species - Homology vs. Similarity
- Similarity can be computed (by sequence
alignments) - Homology is deduced (e.g. from similarity, but
also from other evidence!)
5Definition of Sequence Alignment
- Computational procedure (algorithm) for
comparing two/many sequences - identify series of identical residues or patterns
of identical residuesthat appear in the same
order in the sequences - visualized by writing sequences as follows
- sequence alignment is an optimiztion
problembringing as many identical residues as
possible into corresponding positions
MLGPSSKQTGKGS-SRIWDN
MLN-ITKSAGKGAIMRLGDA
Pairwise Global Alignment (over whole length of
sequences)
GKG GKG
Pairwise Local Alignment (similar parts of
sequences)
6Pairwise Sequence Alignment
- Alignment Score Quantify the Quality of an
alignment - simple scoring system (eg. for DNA sequences)
- when two identical residues are aligned 1
- when two non-identical residues are aligned -1
- when a gap is introduced -1
- total score of alignment is the sum of the scores
in each position
agaag-tagattcta aggaggtag-tt-ta
- Compute Score
- 11 matches
- 1 mismatch
- 3 gaps
- Score 11 - 1 -3 7
Alignment is 'optimal' when assigned score is
maximal Note Scoring scheme defines 'optimal'
alignment
7Algorithms for Pairwise Sequence Alignments Dot
Plot
- Algorithm introduced by Gibbs and McIntyre
(1970) - write one sequence on top from left to right
- write other sequence on the left from top to
bottom - place a dot whenever two residues are identical
variants use colors - visual inspection (variants identify diagonals
with many dots
a g a c c t a g a g
a c c
t
8Algorithms for Pairwise Sequence Alignments Dot
Plot
- Example
- Dot Plots of DNA and Protein of Phage l cI
(horizontal) and P22 c3 (vertical)
DNA
Protein
Removing noise Plot a dot only if 7
("stringency") out of the next 11 ("window size")
residues are identical!
9Algorithms for Pairwise Sequence Alignments Dot
Plot
- Example Human LDL receptor against itself.
Window size 1, stringency 1.
Window size 23, stringency 7.
"Repeats" in first part of sequence.
10Algorithms for Pairwise Sequence Alignments Dot
Plot
- Strengths
- visualization of sequence similarity
- finding direct and inverted repeats in sequences
- finding self-complementary regions in RNA
(secondary structures) - simple to compute, simple to visualize
- Implementations
- DNA Strider (Macintosh)
- DOTTER (UNIX X-Windows)
- GCG "COMPARE", "DOTPLOT"
- online SIB http//www.isrec.isb-sib.ch/java/dotle
t/Dotlet.html - Computational Complexity
- two sequences of length n, m O(nm)
11Scoring Matrices for Proteins
- Observation Protein (Function, Structure) is
preserved when substitutions in certain AAs
happen - scoring system less simple in case of proteins
- Example The same protein in three different
species
A W T V A S A V R T S I A Y T V A S A V R T S I A
W T V A A A V L T S I
Organism A Organism B Organism C
A
B
C
Therefore Assigning L to R scores higher than
assigning it to another AA
W to Y
L to R A to S
12Scoring Matrices for Proteins
- Generalization Dayhoff PAM Weight Matrices
(percent accepted mutation) - observed mutations during a unit of evolutionary
time(1 PAM time that is required that an AA is
changed with probability 1) - initial PAM matrix ("PAM1") derived from 1572
changes in 71 groups of protein sequences that
are at least 85 similar - observed in proteins that have the same function
gt "accepted" mutations
C S T P .. C fcc fcs fct fcp S fsc
fss fst fsp . .
13Scoring Matrices for Proteins
- Extrapolation to longer evolutionary distances
possible!
M
PAM1 p(HM) 0 PAM2 p(HM) p(HK)p(KM) p(HR
)p(RM)
R
K
H
14Scoring Matrices for Proteins
- For each pair of residues, a score is given by a
"scoring matrix" - Scoring of two pairs of residues depends on
- frequency of the two residues
- exchange that tends to preserves function should
have high score - Scoring matrices are constructed by empirically
evaluating (manually constructed) alignments of
closely related proteins - PAM
- introduced by Margarete Dayhoff, 1978
- inspecting 1572 changes in 71 groups of protein
sequences - different matrices for varying degree of
similarity (or, evolutionary distance) - BLOSUM
- inspection of about 2'000 conserved AA-stretches,
"BLOCKS"
Sequence 1 V D S - C Y Sequence 2 V E
S L C Y score 4 2 4 -2 4 4 Total
16
15Dynamic Programming Approach to Sequence
Alignments
- Description by Example
- input two AA sequences VDSCY and
VESLCY scoring matrix match 4, mismatch
2, gap -2 - Some possible alignments and their
score VDSCY VD-SCY VDS-CY
VESLCY V-ESLCY VESLCY 14 8 16 - Observation Longer alignments can be derived
from shorter
new score old score score of new
pair VDS-CY VDS-C Y VESLCY
VESLC Y 16 12
4 VDS-C VDS- C VESLC
VESL C 12 8
4
16Dynamic Programming Approach to Sequence
Alignments
- Alignment Path in matrix from "top left" to
"bottom right"
Seq 1 V D S - C Y Seq 2 V E S L C Y
possible extensions of "current" alignment ( )
- Three situations for extending previous
alignment
introduce gap in sequence 1
introduce gap in sequence 2
17Dynamic Programming Approach to Sequence
Alignments
- In each matrix element, we can write the score of
the best alignment up to this element, and a
pointer to the shorter alignment it is derived
from
V
D
S
Y
C
Goal is to compute this path!
4
V
6
E
Seq 1 V D S - C Y Seq 2 V E S L C
Y 4 2 4 -2 4 4
10
S
L
C
12
match 4, mismatch 2, gap -2
16
Y
V
D
S
Y
C
two possibilities 4 (E-gtS) gap 4 2
(E-gtS) 4 choose one (first)
V
D
S
Y
C
4
fill firstrow/column
V
2
2
2
2
4
V
2
2
2
2
2
6
4
4
4
E
2
E
2
4
S
2
S
three possibilities 4 (L-gtD) 2 gaps 2 2
(L-gtD) 1 gap 2 2 (L-gtD)
4 choose third (score max)
L
2
4
L
2
C
2
4
C
2
Y
2
4
Y
2
18Dynamic Programming Approach to Sequence
Alignments
- Formally
- ai, bj Residue of Sequence A at position i and
Residue of Sequence B at position j - wx Penalty for introducing gap of length x
- Sij Score of alignment at position (i, j)
- s(a-gtb) Score for aligning a and b (known from
scoring matrix) - Iterate process until whole matrix is filled with
scores and back-pointers - Choose maximum score in last column or row
- Follow pointers to construct alignment
Si-1, j-1 s(ai -gt bi)
Sij max
max x³1 (Si-x,j - wx)
max y³1 (Si,j-y - wy)
19Dynamic Programming Approach to Sequence
Alignments
- Global AlignmentNeedleman and Wunsch(1970)
- Local AlignmentSmith-Waterman(minor
modification)
20Dynamic Programming Approach to Sequence
Alignments
- Implementations available on the web
21Dynamic Programming Approach to Sequence
Alignments
- Strengths
- finds optimal (mathematically best) alignment
- suited for both, local and global alignments
- global alignment Needleman-Wunsch
- local alignment Smith-Waterman
- statistical significance can be
attached(recompute alignment when one or both
sequences are randomly changed) - Implementations
- LALIGN
- GCG "GAP" (global) and BESTFIT (local)
- ...
- Complexity
- two sequences of length n, m
- time O(nm), space O(nm)
- space is crucial
- example matrix element 4 bytes, nm10000, space
requirement 400MB - can be improved trade time for space
22Multiple Sequence Alignment
- When aligning k (kgt2) sequences, the dynamic
programming idea can theoretically be used - instead of a two-dimensional scoring schema, a
k-dimensional one is used - instead of two-dimensional residue substitution
matrix, a k-dimensional one is used - However, this is not feasible in practical terms
- size of these matrices increase too rapidly!
- example 5 sequences, 100 residues each, matrix
size 1005 1010in terms of computer memory
10 GB - ConsequenceOptimal alignment for multiple
sequences is not computable! - Approximative methods used (heuristics)
23Multiple Sequence Alignment Progressive methodes
- Idea for progressive methodes
- step 1 Align two of the sequences
- step 2 Compute "consensus" sequence
- step 3 Align a third sequence to the consensus
- step 4 Repeat steps 2 and 3
24Multiple Sequence Alignment Progressive
methodes, CLUSTALW
- Basic steps
- step 1 Perform pairwise sequence alignments of
all possible pairs - step 2 Use scores to produce a phylogenetic
tree - step 3 align the sequences sequentially, guided
by the tree - the most closely related sequences are aligned
first - other sequences or groups of sequences are added
- Example Phylogenetic tree for 4 sequences
A B C D A 100 90 85 90 B 100 95
80 C 100 97 D
100 Percentage Identities inpairwise alignment
A
B
90
90
80
85
95
C
D
97
Join sequences with highest similarity
25Multiple Sequence Alignment Progressive
methodes, CLUSTALW
- Computation of Phylogenetic tree
Join sequences with highest similarity
A B C D A 100 95 80 90 B 100 95
85 C 100 97 D
100 Percentage Identities inpairwise alignment
A
A
B
B
95
95
90
85
80
(8595)/2 90
(8090)/2 85
95
C
D
97
C, D
Join sequences with highest similarity
A, B
(8590)/2 87.5
A
B
C, D
C
D
26Multiple Sequence Alignment Progressive
methodes, CLUSTALW
- Example SevenGlobins fromSWISSPROT
27Multiple Sequence Alignment Progressive
methodes, CLUSTALW
- Available Implementations
28Riddle Probability that a short string occurs in
a longer text
- Given an alphabet A a, c, g, t a random
string s of length k over A a random text t of
length n (ngtk) over A - Find probability p, by which the string s occurs
in the text t. - Find expected number of occurrences e of s in t.
- Example s tggtacaaatgttct (glucocorticoid
response element GRE) t 10000 bp (promoter,
upstream DNA to start codon)
29Basics of Sequence DB Searches Definition
- Given one short sequence q (query) sequence
DB, conceptually one long sequence s (subject) - Find occurrences of similar sequences to q in
s. - Attach to each similar occurrence a statistical
significance. - Example s human genomic DNA, 3109 b q
tggtacaaatgttct (glucocorticoid response element
GRE) - Dynamic programming approach not feasible!
30Basics of Sequence DB Searches Detection of
identical k-mers
- Idea identify identical k-mers in s and q
(seed) expand alignment from seed in both
directions - Example
q MAAARLCLSLLLLSTCVALLLQPLLGAQGAPLEPVYPGDNATPEQM
AQYAADLRRYINMLTRPRYGKRHKEDTLAFSEWGS
...
MAVAYCCLSLFLVSTWVALLLQPLQGTWGAPLEPMYPGDYATPEQMAQYE
TQLRRYINTLTRPRYGKRAEEENTGGLP...
- BLAST
- HSP, high scoring pair
- gapped alignment
- starting extension also from similar (and not
only identical) seeds
31Basics of Sequence DB Dearches Detection of
identical k-mers
- Precompute position of all k-mers in DB sequence
- Indexing all Peptides of length k in "database
- Example
0 1 2 3 1234567890123456789
012345678901234 MAAARLCLSLLLLSTCVALLLQPLLGAQGAPLEP
MAAAR AAARL AARLC ....
APLEP Sorted AAARL 2
AARLC 3 APLEP 30 ... MAAAR 1 ...
VALLL 17
- For each peptide of length k in "query", the
identical peptides in "database" are identified
efficiently (binary search in sorted list) - For identical pairs, extension step in both
directions is performed
32Basics of Sequence DB Dearches Detection of
identical k-mers
- Indexing all Peptides of length k in "database
some refinements
0 1 2 3 1234567890123456789
012345678901234 MAAARLCVALLLLSTCVALLLQPLLGAQGAPLEP
- 8 Pointers
to previous occurrences List of all words of
length k and last occurrence in query AAAAA
- AAAAC - AAAAD - .... AARLC 3
.... APLEP 30 ... ... VALLL 17 ...
Simple, yet efficient data-structure - array of
integers (sizelength of db) - array of integers
(sizenumber of words with length
k) Book-keeping - more than one db/query
sequence - build database chunks that fit into
main memory(speeds up computation
1000x) Extension step optimizations
- For each peptide of length k in "query", the
position in the wordlist can be easily computed
(no binary search!)
33Significance of matches DNA case
- issue searching with short query vs. large
database ? found match could have occurred by
pure chance - assume equal distribution of c,g,a,t
- what is ...
- the probability q, that sequence B (lenm) is
contained in sequence A (lenn)? - the expected length of a common subsequence of
two sequences? - the expected score when locally aligning two
sequences of length n, m?
- Solution for first
- a. There are (n-m) 'words' of length m in
sequence A - b. In total, there are 4m sequences of length m
- c. p (n-m) / 4m
This is wrong. Why?
34Significance of matches Statistics
- statistics
- the statistical distribution of alignment scores
found in a DB search follows the extreme value
distribution (not normal distribution) - extreme value distribution changes with length of
sequences and their residual composition - scores of actual database search results are
plotted vs. expected scores(FASTA) - BLAST computes E-Value (number of expected hits
with this score, when comparing the query with
unrelated database sequences)
35Putting it together BLAST2
A W T V A S A V R T S I
- (optional) filtering for low complexity region in
query - all words of length 3 are listed
- to each word, 50 'high scoring' additional words
are added - matching words are found in DB (with
datastructure as described before) - ungapped alignment constructed from word matches,
'HSP' - statistics determines, whether HSP is significant
- SW-alignment for significant HSPs
AWT VAS AVR TSI WTV ASA VRT TVA SAV RTS
AWT VAS AVR TSI WTV ASA VRT TVA SAV RTS AWA
IAS TVR ... TWA LAS AIR ... ART ITS AVS ... ...
... ...
36An example comparing mus chr X vs. hum chr X,
syntenic regions
mouse chromosome X (147Mbp)
human chromosome X (149Mbp)
Each dot conserved stretch of AA HSP, high
scoring pair
37Conclusions
- types of sequence alignments pairwise, multiple,
query vs. database - local and global alignment
- scoring matrices attach score to each pair of
residues(allow similar residues to be aligned) - sequence alignment problem is mathematical
optimization problemfind optimal alignment
maximise score - optimal alignment in practical terms only
feasible for pairwise alignment - Heuristics for multiple sequence
alignmentsprogressive alignment guided by all
pairwise alignments - Sequence database searches
- efficiently find occurrence of query k-mers in db
sequences (seeds) - expand (ungapped) HSP from seeds
- attach statistical significance