Title: Algorithms for Pairwise Sequence Alignment
1Algorithms for Pairwise Sequence Alignment
- Craig A. Struble, Ph.D.
- Marquette University
2Overview
- Pairwise Sequence Alignment
- Dynamic Programming Solution
- Global Alignment
- Local Alignment
3Goals
- Define the pairwise sequence alignment problem
- Understand the difference between global and
local alignment - Understand the significance of pairwise sequence
alignment
4Pairwise Sequence Alignment
- Problem
- Given two sequences (DNA or AA), line them up
in a biologically meaningful way.
HEAGAWGHE-E
HEAGAWGHEE
PAWHEAE
P-A--W-HEAE
5Origins Of Similar Sequences
2
a
duplication
1
a1
a2
speciation
duplication
a1
a2
a2
a1
Species 2
Species 1
2
2
1
1
Transfer
Convergence
6Why is comparing sequences important?
- One of the fundamental phenomena explored by
bioinformatics, around which many tools are built - Databases, data selection, etc.
- Researchers compare sequences in order to
- infer the function of genes
- infer the structure of genes and gene products
- infer the evolutionary history of genes and
organisms - identify variation responsible for disease and
other complex phenotypes
7Why is this a challenging problem?
- Similar sequences contain variation
- Sequences mutate over time
- Mutations are spontaneous changes in sequence
caused by replication (or other) errors. Mutation
rates vary, and can be influenced by many
factors. - Sequence data contains errors
- Sequencing techniques are imperfect
8Four Basic Types of Mutations
C. Insertion
Thr Tyr Leu Leu ACC TAT TTG CTG
Thr Tyr Leu Leu ACC TAT TTG CTG
ACC TCT TTG CTG Thr Ser Leu Leu
ACC TAC TTT GCT G-- Thr Tyr Phe Ala
B. Deletion
D. Inversion
Thr Tyr Leu Leu ACC TAT TTG CTG
Thr Tyr Leu Leu ACC TAT TTG CTG
ACC TTT ATG CTG Thr Phe Met Leu
ACC TAT TGC TG- Thr Tyr Cys
9Influences on Variation
- Rates of mutations are influenced by
- Substitution class (transition/transversion)
- Coding site (synonymous/nonsynonymous)
- Length of insertion/deletion
- Codon usage bias
- Nucleotide consist (GC content)
- Stability fate of variation depends upon
- Drift
- Selection (positive Darwinian/purifying, sexual,
artificial) - Other mutations (reversions are not uncommon)
10Homology vs. Similarity
- Homology is a discrete state pertaining to
relatedness - two genes are homologues if and
only if they share a commone gene ancestor - Orthologues in different organisms, a result of
speciation - Paralogues in the same organism, a result of
gene duplication - Homologues may have the same, similar, or
different functions - Similarity is a continuous state describing the
degree of to which two homologues share
characteristics - Generally a percentage
- Distance estimates are also estimates of
similarity
11Kinds of Alignments
- The local alignment includes only regions of
identity (or strong similarity). The favors
finding conserved regions. - The global alignment is stretched over the entire
sequence length, including as many matches as
possible.
12When do you choose local vs. global?
- Choose local alignment when
- DNA sequences encode genes with introns
- Amino acid sequences encoding proteins
- Choose a global alignment when
- Sequences can be seen to be very similar
- Similar regions are in the same order and
orientation
13Methods Of Sequence Alignment
- Dot matrix analysis
- Dynamic programming algorithms
- Word or k-tuple methods
- BLAST, FASTA
- Discussed later in the semester
14Dot Matrix Analysis
- Visualization of sequence similarity
- First technique to use on pairs of sequences
- Insertions/deletions
- Inverted repeats
- Does not show actual alignment
- Optimal alignment not obvious
15Simple Dot Matrix Example
- For sequences
- ATGCGTCGTT
- ATCCGCGAT
A T G C G T C G T T
A T C C G C G A T
- Steps
- Arrange sequences on a matrix
- Place a dot anywhere nucleotides match
- Diagonal stretches (here indicated by a line) are
areas of alignment - More than one area of alignment can appear
16DNA sequence matrix Noisy
- Sequence alignment of 2 long DNA sequences
- Many random matches make it difficult or
impossible to find areas of alignment - Using a window stringency setting, we can
eliminate some of the noise
17DNA sequence matrix Less noisy
- To decrease noise of random matches, a window of
11 nucleotides was defined, and a dot placed when
at least 7 matches occur - Window 11, Stringency 7
- Some diagonal lines begin to appear
18DNA sequence matrix Less noisy
- To decrease noise of random matches, a window of
23 nucleotides was defined, and a dot placed when
at least 15 matches occur - Window 23, Stringency 15
- A clear diagonal line appears, indicating an area
of alignment - A few other areas are still apparent - probably
long random matches
19Protein sequence matrix Noisy
- Sequence comparison of amino acid sequence (same
gene as previous example) - Window 1, stringency 1
- To decrease noise due to random matches,
conditions can be tightened
20Protein sequence matrix Less noisy
- Same sequence comparison, tighter analysis
conditions - Window 3, stringency 2
- A single aligned region is visible, with a number
of areas of random matches
21Evidence of repeats in a DNA sequence
Window 1, stringency 1
Window 23, stringency 7
22Programs for Dot Matrix Analysis
- DNA Strider (Macintosh)
- Dotter (Unix/Linux, X-Windows)
- In the lab
- COMPARE, DOTPLOT in GCG
- PLALIGN (FASTA)
- Plots alignments found by DP method
- Dotlet
- http//www.isrec.isb-sib.ch/java/dotlet/Dotlet.htm
l
23Optimal Sequence Alignments
- Example
- Which one is better?
HEAGAWGHEE
PAWHEAE
HEAGAWGHE-E
HEAGAWGHE-E
P-A--W-HEAE
--P-AW-HEAE
24Scoring
- To compare two sequence alignments, calculate a
score - Scoring matrix
- Provide a score for each match/mismatch
- Sometimes a mismatch is acceptable
- PAM, BLOSUM are two classes of scoring matrices
- Gap penalty
- Initiating a gap
- Gap extension penalty
- Extending a gap
25Scoring Matrix Example
- Gap penalty -8
- Gap extension -4
HEAGAWGHE-E
--P-AW-HEAE
(-8) (-4) (-1) 5 15 (-8) 10 6
(-8) 6 13
HEAGAWGHE-E
Exercise Calculate for
P-A--W-HEAE
26Formal Description
- Problem PairSeqAlign
- Input Two sequences x,y
- Scoring matrix s
- Gap penalty d
- Gap extension penalty e
-
- Output The optimal sequence alignment
27How Difficult Is This?
- Consider two sequences of length n
- There are
- possible global alignments, and we need to find
an optimal one from amongst those!
28So what?
- So at n 20, we have over 120 billion possible
alignments - We want to be able to align much, much longer
sequences - Some proteins have 1000 amino acids
- Genes can have several thousand base pairs
29Dynamic Programming
- General algorithmic development technique
- Reuses the results of previous computations
- Store intermediate results in a table for reuse
- Look up in table for earlier result to build from
30Global Alignment
- Needleman-Wunsch 1970
- Idea Build up optimal alignment from optimal
alignments of subsequences - Three ways to align x1..i with y1..j
Extend both strings at the same time
xi already aligned, align yj with a gap
IGAxi LGVyj
AIG Axi GVyj--
GAxi-- SLG Vyj
yj already aligned, align xi with a gap
31Global Alignment
- Notation
- xi ith letter of string x
- yj jth letter of string y
- x1..i Prefix of x from letters 1 through I
- F matrix of optimal scores
- F(i,j) represents optimal score lining up x1..i
with y1..j - d gap penalty
- s scoring matrix
32Global Alignment
- The work is to build up F
- Initialize F(0,0) 0, F(i,0) id, F(0,j)jd
- Fill from top left to bottom right using the
recursive relation
33Global Alignment
yj aligned to gap
Move ahead in both
s(xi,yj)
d
d
xi aligned to gap
While building the table, keep track of where
optimal score came from, reverse arrows
34Example
35Completed Table
36Traceback
- Trace arrows back from the lower right to top
left - Diagonal both
- Up upper gap
- Left lower gap
HEAGAWGHE-E --P-AW-HEAE
37Summary
- Uses recursion to fill in intermediate results
table - Uses O(nm) space and time
- O(n2) algorithm
- Feasible for moderate sized sequences, but not
for aligning whole genomes.
38Local Alignment
- Smith-Waterman (1981)
- Another dynamic programming solution
39Example
40Traceback
Start at highest score and traceback to first 0
AWGHE AW-HE
41Summary
- Similar to global alignment algorithm
- For this to work, expected match with random
sequence must have negative score. - Behavior is like global alignment otherwise
- Similar extensions for repeated and overlap
matching - Care must be given to gap penalties to maintain
O(nm) time complexity
42Scoring Matrices
- Substitutions
- Models of substitutions
- PAM
- BLOSUM
- Gap penalties
43DNA
44Transitional and Transversional Nucleotide
Substitutions
- ? ? are rates of transitional and
transversional substitutions, respectively - Generally, ? gt ?
- Possible substitutions (total 16)
- Identical (freq O) 4
- Transitions (P) 4
- Transversions (Q) 8
- Giving us
- p P Q
- R P/Q
- R is usually between 0.5 and 2 for nuclear
genes, higher for mitochondrial genes (up to 15)
45Synonymous and Non-synonymous substitutions
Non-synonymous
Synonymous
Thr Tyr Leu Leu ACC TAT TTG CTG
Thr Tyr Leu Leu ACC TAT TTG CTG
ACC TCT TTG CTG Thr Ser Leu Leu
ACC TAC TTG CTG Thr Tyr Leu Leu
- Synonymous substitutions more likely to occur
- Preserve AA
46Categories of Amino Acids
Grouped according to properties of side chain
47Amino Acid Substitutions
- Tend to preserve chemical similarity
- Tend to preserve structure
- Tend to preserve function
- More frequent in non-functional domains
48Models of Substitution
- Percept Accepted Mutation (PAM)
- Dayhoff 1978
- Accepted Mutation changes accepted by natural
selection - PAM1 represents evolutionary divergence where 1
of amino change - Blocks Amino Acid Substitution Matrices (BLOSUM)
- Henikoff and Henikoff 1992
- Observed AA substitutions in conserved AA blocks
- Maximum level of identity, BLOSUM62 represents
62 identity
49PAM
Probability of transitioning from one state to
another
pstpts
S
T
C
State for amino acid
P
50PAM
- Assumes substitutions are independent
- pxy is calculated from observations
- 1572 changes in 71 groups of proteins
- Organized into phylogenetic trees
- Changes counted
- Divided by normalizing factor
- The probabilities are stored in a matrix
- Probability form
- PAM1 represents 10 my evolutionary distance
- PAMN is derived from PAM1N because Markov Model
is used
51PAM1 for DNA
0.00333
0.00333
0.00333
0.99
52BLOSUM
- 2000 conserved amino acid patterns
- blocks ungapped patterns
- 3-60 AA long
- gt500 families of related proteins
- Software
- MOTIF (H. Smith et al. 1990)
- PROTOMAT (Henikoff and Henikoff)
53Computing BLOSUM Scores
- Consider all pairs (dont know ancestor)
- fAA3216 fAL4 fAS4 fLS1
- Calculate frequency of occurrence
- qAAfAA/(fAAfALfASfLS) 0.4
- Calculate expected frequency of being in a pair
- pA(qAAqAS/2qAL/2)0.66
- Calculate expected frequency of a pair
- eAApApA0.44
- Matrix entry for pair
- mAA qAA/eAA 0.9
A
L
A
S
A
A
54Log Odds Scoring
- Each of the previous matrices are converted to
log odds matrices - DP algorithm based on addition
- log(xy)log(x)log(y)
- Compares real occurrence with random occurence.
- BLOSUM
- sAAlog2(qAA/eAA) 2 -0.304 (will be rounded)
- PAM1 DNA (uniform)
- sCT log2(pCMCT / pCpT)
- log2(0.25 0.00333/ 0.252)
- -6.23
55The PAM250 Matrix
Each matrix value is calculated by first dividing
the frequency of change, for each amino acid
pair, in related proteins separated by one step
in an evolutionary tree by the probability of a
chance alignment based on the frequency of the
amino acids. The ratios are expressed as
logarithms to the base 10 (approx. 1/3 bit
values).
- Note
- High values on diagonal
- High values for similar groups
56The Blosum62 Matrix
Each entry is the actual frequency of occurrence
of the amino acid pair in the blocks database,
clustered at the 62 level, divided by the
expected probability of occurrence. The expected
value is calculated from the frequency of
occurrence of each of the two individual amino
acids in the blocks database,and provides a
measure of a chance alignment of the two amino
acids. The actual/expected ratio is expressed as
a log odds. A zero score means that the frequency
of the amino acid pair in the database was as
expected by chance, a positive score that the
pair was found more often than by chance, and a
negative score that the pair was found less often
than by chance.
57Selecting Matrices
- PAM
- Mutational model of evolution
- Tracks evolutionary origins of proteins/sequences
- Use lower numbers for evolutionarily close
sequences, higher numbers for distance sequences - BLOSUM
- No model of evolution, conserved AA motifs
- Designed to find conserved domains
- Similar sequences, use higher numbers,
- Divergent sequences, use lower numbers.
58GAP Penalties
- Recall d is gap opening penalty, e is gap
extension penalty - Total gap penalty wxde(x-1)
- In order to make things work properly, need
affine gap function (Smith et al. 1981) - wx dx
- Any affine function works
- For the linear function above, e d
- Typical gap penalties (Mount p.113)
- BLOSUM50 d15, e8-15
- PAM250 d15, e5-15
59Next Time
- Put on your math/stats caps
- Significance of scores
- Bayesian statistics
- Bayes block alignment