Algorithms for Sequence Alignments

About This Presentation

Title:

Algorithms for Sequence Alignments

Description:

Dynamic Programming: Smith-Waterman, Needleman-Wunsch. Methods for Multiple Sequence Alignment ... local alignment: Smith-Waterman. statistical significance can ... – PowerPoint PPT presentation

Number of Views:342

Avg rating:3.0/5.0

Slides: 38

Provided by: z149

Category:

more less

Transcript and Presenter's Notes

Title: Algorithms for Sequence Alignments

1
Algorithms forSequence Alignments

A. Brüngger, Labhead BioinformaticsNovartis
Pharma AG
adrian.bruengger_at_pharma.novartis.com

2
Algorithms for Sequence Alignments

Definition of Sequence Alignment and Origins of
Sequence Similarity
Global Alignment
Local Alignment
Alignment of Pairs of Sequences
Alignment of Multiple Sequences
Methods for Pairwise Sequence Alignments
Dot matrix
Scoring Matrices
Dynamic Programming Smith-Waterman,
Needleman-Wunsch
Methods for Multiple Sequence Alignment
Dynamic Programming
Progressive multiple alignments CLUSTALW, PILEUP
Database Searching for Similar Sequences
FASTA, BLAST
Patscan
Hidden Markov Models

Outline follows David W. Mount, "Bioionformatics
- Sequence and Genome Analysis Cold Spring
Harbour Laboratory Press, 2001. Online
http//www.bioinformaticsonline.org
3
Rational for Sequence Analysis, Origins of
Sequence Similarity
Similar sequence leads to similar function
Sequence Analysis as the basic tool to
discover functional, structural,evolutionary
information in biological sequences
Sequence A
Sequence B
Evolutionary relationship between two similar
sequences and a possible common ancestor. The
number of steps to convert one sequence into the
other is the "evolutionary" distance between the
sequences (x y). Usually, the ancestor sequence
is not available, only (x y) can be computed.
y Steps
x Steps
common ancestor sequence
4
Origins of Homology ? Significance of Sequence
Alignments

Possible Origins of Sequence Homology
orthologs (panel A and B) a1 in species I and a1
in species II (same ancestor!)
paralogs (panel A and B) a1 and a2 (arose from
gene duplication event)
analogs (panel C) different genes converge to
same function by different evolutionary paths
transfer of genetic material (panel D) between
different species
Homology vs. Similarity
Similarity can be computed (by sequence
alignments)
Homology is deduced (e.g. from similarity, but
also from other evidence!)

5
Definition of Sequence Alignment

Computational procedure (algorithm) for
comparing two/many sequences
identify series of identical residues or patterns
of identical residuesthat appear in the same
order in the sequences
visualized by writing sequences as follows
sequence alignment is an optimiztion
problembringing as many identical residues as
possible into corresponding positions

MLGPSSKQTGKGS-SRIWDN
MLN-ITKSAGKGAIMRLGDA
Pairwise Global Alignment (over whole length of
sequences)
GKG GKG
Pairwise Local Alignment (similar parts of
sequences)
6
Pairwise Sequence Alignment

Alignment Score Quantify the Quality of an
alignment
simple scoring system (eg. for DNA sequences)
when two identical residues are aligned 1
when two non-identical residues are aligned -1
when a gap is introduced -1
total score of alignment is the sum of the scores
in each position

agaag-tagattcta aggaggtag-tt-ta

Compute Score
11 matches
1 mismatch
3 gaps
Score 11 - 1 -3 7

Alignment is 'optimal' when assigned score is
maximal Note Scoring scheme defines 'optimal'
alignment
7
Algorithms for Pairwise Sequence Alignments Dot
Plot

Algorithm introduced by Gibbs and McIntyre
(1970)
write one sequence on top from left to right
write other sequence on the left from top to
bottom
place a dot whenever two residues are identical
variants use colors
visual inspection (variants identify diagonals
with many dots

a g a c c t a g a g
a c c
t
8
Algorithms for Pairwise Sequence Alignments Dot
Plot

Example
Dot Plots of DNA and Protein of Phage l cI
(horizontal) and P22 c3 (vertical)

DNA
Protein
Removing noise Plot a dot only if 7
("stringency") out of the next 11 ("window size")
residues are identical!
9
Algorithms for Pairwise Sequence Alignments Dot
Plot

Example Human LDL receptor against itself.

Window size 1, stringency 1.
Window size 23, stringency 7.
"Repeats" in first part of sequence.
10
Algorithms for Pairwise Sequence Alignments Dot
Plot

Strengths
visualization of sequence similarity
finding direct and inverted repeats in sequences
finding self-complementary regions in RNA
(secondary structures)
simple to compute, simple to visualize
Implementations
DNA Strider (Macintosh)
DOTTER (UNIX X-Windows)
GCG "COMPARE", "DOTPLOT"
online SIB http//www.isrec.isb-sib.ch/java/dotle
t/Dotlet.html
Computational Complexity
two sequences of length n, m O(nm)

11
Scoring Matrices for Proteins

Observation Protein (Function, Structure) is
preserved when substitutions in certain AAs
happen
scoring system less simple in case of proteins
Example The same protein in three different
species

A W T V A S A V R T S I A Y T V A S A V R T S I A
W T V A A A V L T S I
Organism A Organism B Organism C
A
B
C
Therefore Assigning L to R scores higher than
assigning it to another AA
W to Y
L to R A to S
12
Scoring Matrices for Proteins

Generalization Dayhoff PAM Weight Matrices
(percent accepted mutation)
observed mutations during a unit of evolutionary
time(1 PAM time that is required that an AA is
changed with probability 1)
initial PAM matrix ("PAM1") derived from 1572
changes in 71 groups of protein sequences that
are at least 85 similar
observed in proteins that have the same function
gt "accepted" mutations

C S T P .. C fcc fcs fct fcp S fsc
fss fst fsp . .
13
Scoring Matrices for Proteins

Extrapolation to longer evolutionary distances
possible!

M
PAM1 p(HM) 0 PAM2 p(HM) p(HK)p(KM) p(HR
)p(RM)
R
K
H
14
Scoring Matrices for Proteins

For each pair of residues, a score is given by a
"scoring matrix"
Scoring of two pairs of residues depends on
frequency of the two residues
exchange that tends to preserves function should
have high score
Scoring matrices are constructed by empirically
evaluating (manually constructed) alignments of
closely related proteins
PAM
introduced by Margarete Dayhoff, 1978
inspecting 1572 changes in 71 groups of protein
sequences
different matrices for varying degree of
similarity (or, evolutionary distance)
BLOSUM
inspection of about 2'000 conserved AA-stretches,
"BLOCKS"

Sequence 1 V D S - C Y Sequence 2 V E
S L C Y score 4 2 4 -2 4 4 Total
16
15
Dynamic Programming Approach to Sequence
Alignments

Description by Example
input two AA sequences VDSCY and
VESLCY scoring matrix match 4, mismatch
2, gap -2
Some possible alignments and their
score VDSCY VD-SCY VDS-CY
VESLCY V-ESLCY VESLCY 14 8 16
Observation Longer alignments can be derived
from shorter

new score old score score of new
pair VDS-CY VDS-C Y VESLCY
VESLC Y 16 12
4 VDS-C VDS- C VESLC
VESL C 12 8
4
16
Dynamic Programming Approach to Sequence
Alignments

Alignment Path in matrix from "top left" to
"bottom right"

Seq 1 V D S - C Y Seq 2 V E S L C Y
possible extensions of "current" alignment ( )

Three situations for extending previous
alignment

introduce gap in sequence 1
introduce gap in sequence 2
17
Dynamic Programming Approach to Sequence
Alignments

In each matrix element, we can write the score of
the best alignment up to this element, and a
pointer to the shorter alignment it is derived
from

V
D
S
Y
C
Goal is to compute this path!
4
V
6
E
Seq 1 V D S - C Y Seq 2 V E S L C
Y 4 2 4 -2 4 4
10
S
L
C
12
match 4, mismatch 2, gap -2
16
Y
V
D
S
Y
C
two possibilities 4 (E-gtS) gap 4 2
(E-gtS) 4 choose one (first)
V
D
S
Y
C
4
fill firstrow/column
V
2
2
2
2
4
V
2
2
2
2
2
6
4
4
4
E
2
E
2
4
S
2
S
three possibilities 4 (L-gtD) 2 gaps 2 2
(L-gtD) 1 gap 2 2 (L-gtD)
4 choose third (score max)
L
2
4
L
2
C
2
4
C
2
Y
2
4
Y
2
18
Dynamic Programming Approach to Sequence
Alignments

Formally
ai, bj Residue of Sequence A at position i and
Residue of Sequence B at position j
wx Penalty for introducing gap of length x
Sij Score of alignment at position (i, j)
s(a-gtb) Score for aligning a and b (known from
scoring matrix)
Iterate process until whole matrix is filled with
scores and back-pointers
Choose maximum score in last column or row
Follow pointers to construct alignment

Si-1, j-1 s(ai -gt bi)
Sij max
max x³1 (Si-x,j - wx)
max y³1 (Si,j-y - wy)
19
Dynamic Programming Approach to Sequence
Alignments

Global AlignmentNeedleman and Wunsch(1970)
Local AlignmentSmith-Waterman(minor
modification)

20
Dynamic Programming Approach to Sequence
Alignments

Implementations available on the web

21
Dynamic Programming Approach to Sequence
Alignments

Strengths
finds optimal (mathematically best) alignment
suited for both, local and global alignments
global alignment Needleman-Wunsch
local alignment Smith-Waterman
statistical significance can be
attached(recompute alignment when one or both
sequences are randomly changed)
Implementations
LALIGN
GCG "GAP" (global) and BESTFIT (local)
...
Complexity
two sequences of length n, m
time O(nm), space O(nm)
space is crucial
example matrix element 4 bytes, nm10000, space
requirement 400MB
can be improved trade time for space

22
Multiple Sequence Alignment

When aligning k (kgt2) sequences, the dynamic
programming idea can theoretically be used
instead of a two-dimensional scoring schema, a
k-dimensional one is used
instead of two-dimensional residue substitution
matrix, a k-dimensional one is used
However, this is not feasible in practical terms
size of these matrices increase too rapidly!
example 5 sequences, 100 residues each, matrix
size 1005 1010in terms of computer memory
10 GB
ConsequenceOptimal alignment for multiple
sequences is not computable!
Approximative methods used (heuristics)

23
Multiple Sequence Alignment Progressive methodes

Idea for progressive methodes
step 1 Align two of the sequences
step 2 Compute "consensus" sequence
step 3 Align a third sequence to the consensus
step 4 Repeat steps 2 and 3

24
Multiple Sequence Alignment Progressive
methodes, CLUSTALW

Basic steps
step 1 Perform pairwise sequence alignments of
all possible pairs
step 2 Use scores to produce a phylogenetic
tree
step 3 align the sequences sequentially, guided
by the tree
the most closely related sequences are aligned
first
other sequences or groups of sequences are added
Example Phylogenetic tree for 4 sequences

A B C D A 100 90 85 90 B 100 95
80 C 100 97 D
100 Percentage Identities inpairwise alignment
A
B
90
90
80
85
95
C
D
97
Join sequences with highest similarity
25
Multiple Sequence Alignment Progressive
methodes, CLUSTALW

Computation of Phylogenetic tree

Join sequences with highest similarity
A B C D A 100 95 80 90 B 100 95
85 C 100 97 D
100 Percentage Identities inpairwise alignment
A
A
B
B
95
95
90
85
80
(8595)/2 90
(8090)/2 85
95
C
D
97
C, D
Join sequences with highest similarity
A, B
(8590)/2 87.5
A
B
C, D
C
D
26
Multiple Sequence Alignment Progressive
methodes, CLUSTALW

Example SevenGlobins fromSWISSPROT

27
Multiple Sequence Alignment Progressive
methodes, CLUSTALW

Available Implementations

28
Riddle Probability that a short string occurs in
a longer text

Given an alphabet A a, c, g, t a random
string s of length k over A a random text t of
length n (ngtk) over A
Find probability p, by which the string s occurs
in the text t.
Find expected number of occurrences e of s in t.
Example s tggtacaaatgttct (glucocorticoid
response element GRE) t 10000 bp (promoter,
upstream DNA to start codon)

29
Basics of Sequence DB Searches Definition

Given one short sequence q (query) sequence
DB, conceptually one long sequence s (subject)
Find occurrences of similar sequences to q in
s.
Attach to each similar occurrence a statistical
significance.
Example s human genomic DNA, 3109 b q
tggtacaaatgttct (glucocorticoid response element
GRE)
Dynamic programming approach not feasible!

30
Basics of Sequence DB Searches Detection of
identical k-mers

Idea identify identical k-mers in s and q
(seed) expand alignment from seed in both
directions
Example

q MAAARLCLSLLLLSTCVALLLQPLLGAQGAPLEPVYPGDNATPEQM
AQYAADLRRYINMLTRPRYGKRHKEDTLAFSEWGS

...
MAVAYCCLSLFLVSTWVALLLQPLQGTWGAPLEPMYPGDYATPEQMAQYE
TQLRRYINTLTRPRYGKRAEEENTGGLP...

BLAST
HSP, high scoring pair
gapped alignment
starting extension also from similar (and not
only identical) seeds

31
Basics of Sequence DB Dearches Detection of
identical k-mers

Precompute position of all k-mers in DB sequence
Indexing all Peptides of length k in "database
Example

0 1 2 3 1234567890123456789
012345678901234 MAAARLCLSLLLLSTCVALLLQPLLGAQGAPLEP
MAAAR AAARL AARLC ....
APLEP Sorted AAARL 2
AARLC 3 APLEP 30 ... MAAAR 1 ...
VALLL 17

For each peptide of length k in "query", the
identical peptides in "database" are identified
efficiently (binary search in sorted list)
For identical pairs, extension step in both
directions is performed

32
Basics of Sequence DB Dearches Detection of
identical k-mers

Indexing all Peptides of length k in "database
some refinements

0 1 2 3 1234567890123456789
012345678901234 MAAARLCVALLLLSTCVALLLQPLLGAQGAPLEP
- 8 Pointers
to previous occurrences List of all words of
length k and last occurrence in query AAAAA
- AAAAC - AAAAD - .... AARLC 3
.... APLEP 30 ... ... VALLL 17 ...
Simple, yet efficient data-structure - array of
integers (sizelength of db) - array of integers
(sizenumber of words with length
k) Book-keeping - more than one db/query
sequence - build database chunks that fit into
main memory(speeds up computation
1000x) Extension step optimizations

For each peptide of length k in "query", the
position in the wordlist can be easily computed
(no binary search!)

33
Significance of matches DNA case

issue searching with short query vs. large
database ? found match could have occurred by
pure chance
assume equal distribution of c,g,a,t
what is ...
the probability q, that sequence B (lenm) is
contained in sequence A (lenn)?
the expected length of a common subsequence of
two sequences?
the expected score when locally aligning two
sequences of length n, m?

Solution for first
a. There are (n-m) 'words' of length m in
sequence A
b. In total, there are 4m sequences of length m
c. p (n-m) / 4m

This is wrong. Why?
34
Significance of matches Statistics

statistics
the statistical distribution of alignment scores
found in a DB search follows the extreme value
distribution (not normal distribution)
extreme value distribution changes with length of
sequences and their residual composition
scores of actual database search results are
plotted vs. expected scores(FASTA)
BLAST computes E-Value (number of expected hits
with this score, when comparing the query with
unrelated database sequences)

35
Putting it together BLAST2
A W T V A S A V R T S I

(optional) filtering for low complexity region in
query
all words of length 3 are listed
to each word, 50 'high scoring' additional words
are added
matching words are found in DB (with
datastructure as described before)
ungapped alignment constructed from word matches,
'HSP'
statistics determines, whether HSP is significant
SW-alignment for significant HSPs

AWT VAS AVR TSI WTV ASA VRT TVA SAV RTS
AWT VAS AVR TSI WTV ASA VRT TVA SAV RTS AWA
IAS TVR ... TWA LAS AIR ... ART ITS AVS ... ...
... ...
36
An example comparing mus chr X vs. hum chr X,
syntenic regions
mouse chromosome X (147Mbp)
human chromosome X (149Mbp)
Each dot conserved stretch of AA HSP, high
scoring pair
37
Conclusions

types of sequence alignments pairwise, multiple,
query vs. database
local and global alignment
scoring matrices attach score to each pair of
residues(allow similar residues to be aligned)
sequence alignment problem is mathematical
optimization problemfind optimal alignment
maximise score
optimal alignment in practical terms only
feasible for pairwise alignment
Heuristics for multiple sequence
alignmentsprogressive alignment guided by all
pairwise alignments
Sequence database searches
efficiently find occurrence of query k-mers in db
sequences (seeds)
expand (ungapped) HSP from seeds
attach statistical significance