Title: Pairwise Sequence Alignment
1Pairwise Sequence Alignment
- What is an alignment, and why might it be
significant? - An alignment is a mapping from one sequence to
another, identifying elements that are likely to
have arisen from a common ancestor - A good alignment is an indication of homology
- Alignments are NOT exact matches. We will need a
method to find good alignments in a database...
2Similarity vs. HomologyParalogs vs. Orthologs
- Homology is an evolutionary relationship that
either exists or does not. It cannot be partial. - An ortholog is a homolog with shared function.
- A paralog is a homolog that arose through a gene
duplication event. Paralogs often have divergent
function. - Similarity is a measure of the quality of
alignment between two sequences. High similarity
is evidence for homology. Similar sequences may
be orthologs or paralogs.
3How do we compute similarity?
- Similarity can be defined by counting positions
that are identical between two sequences - Gaps (insertions/deletions) can be important
abcdef abcdef abcdef
abceef acdef a-cdef
4Not all mismatches are the same
- Some amino acids are more substitutable for each
other than others. Serine and threonine are
more alike than tryptophan and alanine. - We can introduce "mismatch costs" for handling
different substitutions. - We don't usually use mismatch costs in aligning
nucleotide sequences, since no substitution is
per se better than any other.
5Many possible alignments to consider
- Without gaps, there are are NxM possible
alignments between sequences of length N and M - Once we start allowing gaps, there are many
possible arrangements to consider abcbcd
abcbcd abcbcd
abc--d a--bcd ab--cd - This becomes a very large number when we allow
mismatches, since we then need to look at every
possible pairing between elements there are
roughly NM possible alignments.
6Exponential computations get big fast
- If nm100, there are 100100 10200
100,000,000,000,000,000,000,000,000,000,000,000,00
0,000,000,000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,000,000,00
0,000,000,000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,000,000,00
0,000,000,000,000 different alignments. - And 100 amino acids is a small protein!
7Avoiding random alignments with a score function
- Not only are there many possible gapped
alignments, but introducing too many gaps makes
nonsense alignments possible
s--e-----qu---en--ce sometimesquipsentice - Need to distinguish between alignments that occur
due to homology, and those that could be expected
to be seen just by chance. - Define a score function that accounts for both
element mismatches and a gap penalty
8Match scores
- Match scores are oftencalculated on the basis
of the frequency of particular mutations in
very similar sequences. - We can transform substitution frequencies into
log odds scores, which can then be added
together.
9Local vs. Global alignments
- A global alignment includes all elements of a
sequence, and includes gaps - A global alignment may or may not include "end
gap" penalties. - A local alignment is includes only subsequences,
and sometimes computed without gaps. - Local alignments can find shared domains in
divergent proteins and are fast to compute - Global alignments are better indicators of
homology and take longer to compute.
10An alignment score
- An alignment score is the sum of all the match
scores of an alignment, with a penalty subtracted
for each gap. - Gap penalties are usually "affine" meaning that
the penalty for one long gap is smaller than the
penalty for many smaller gaps that add up to the
same size.a b c - - da c c e f d9 2 7 6
gt 24 - (10 2) 12
Gap start continuationpenalty
Matchscore
AlignmentScore
11Finding the optimal alignment
- Given a pair of sequences and a score function,
identify the best scoring (optimal) alignment
between the sequences. - Remember, exponential number of possible
alignments (most with terrible scores). - Computer science to the rescue dynamic
programming identifies optimal alignments in time
proportional to the sum of the lengths of the
sequences
12Dynamic programming
- The name comes from an operations research task,
and has nothing to do with writing programs. - The key idea is to start aligning the sequences
left to right once a prefix is optimally
aligned, nothing about the remainder of the
alignment changes the alignment of the prefix. - We construct a matrix of possible alignment
scores (NxM2 calculations worst case) and then
"traceback" to find the optimal alignment. - Called Needleman-Wunch or Smith-Waterman
13Alignment matrix
- Create a matrix with each sequence to be aligned
along one edge and the score of the alignment of
each pair of elements in a cell. - Best local alignment is just the highest
scoring diagonal
14Dynamic programming matrix
- Each cell has the score for the best aligned
sequence prefix up to that position. - Number in ( )s is thealignment score forthe
pair of amino acids at that position. - Gap penalty here is-12 to start and -4 to
continue.
15Optimal alignment by traceback
- We traceback a path that gets us the highest
score. If we don't have end gap penalties,
then takeany path from thelast row or columnto
the first. - Otherwise we needto include the top and bottom
corners
16Study guide....
- Dynamic programming alignments are a key
technology in bioinformatics, and you should
understand how they work. - The method is counterintuitive
- Work some examples by hand. The textbook has a
very good explanation, and there is more detail
and supplementary material on the textbook web
site, www.bioinformaticsonline.org
17How do we pick match scores?
- For match scores, two main options
- PAM based on global alignments of closely related
sequences. Normalized to changes per 100 sites,
then exponentiated for more distant relatives. - BLOSUM based on local alignments in much more
diverse sequences - Picking the right distance is important, and may
be hard to do. BLOSUM seems to work better for
more evolutionarily distant sequences. BLOSUM62
is a good default.
18Picking gap penalties
- Many different possible forms
- Most common is affine (gap open gap continue
penalities) - More complex penalties have been proposed.
- Penalties must be commensurate with match scores.
Therefore, the match scoring scheme influences
the gap penalty - Most alignment programs suggest appropriate
penalties for each match score option.
19Searching for optimal scores
- One possibility is to try several different match
score and gap penalties, and choose the best
result. - In general, this is called parameter space search
and it is important in many areas. - Problems
- requires a lot computation
- we need some principled way to compare the
results. - Use significance testing to compare...
20The significance of an alignment
- Significance testing is the branch of statistics
that is concerned with assesing the probability
that a particular result could have occurred by
chance. - How do we calculate the probability that an
alignment occurred by chance? - Either with a model of evolution, or
- Empirically, by scrambling our sequences and
calculating scores on many randomized sequences.
21For next week
- Read Mount, Chapter 3 on pairwise sequence
alignment. - Finish Assignment 1. Start Assignment 2.