Pairwise Sequence Alignment

About This Presentation

Title:

Pairwise Sequence Alignment

Description:

An alignment is a mapping from one sequence to another, identifying elements ... Called Needleman-Wunch or Smith-Waterman. Alignment matrix ... – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 22

Provided by: leahHa2

Category:

more less

Transcript and Presenter's Notes

Title: Pairwise Sequence Alignment

1
Pairwise Sequence Alignment

What is an alignment, and why might it be
significant?
An alignment is a mapping from one sequence to
another, identifying elements that are likely to
have arisen from a common ancestor
A good alignment is an indication of homology
Alignments are NOT exact matches. We will need a
method to find good alignments in a database...

2
Similarity vs. HomologyParalogs vs. Orthologs

Homology is an evolutionary relationship that
either exists or does not. It cannot be partial.
An ortholog is a homolog with shared function.
A paralog is a homolog that arose through a gene
duplication event. Paralogs often have divergent
function.
Similarity is a measure of the quality of
alignment between two sequences. High similarity
is evidence for homology. Similar sequences may
be orthologs or paralogs.

3
How do we compute similarity?

Similarity can be defined by counting positions
that are identical between two sequences
Gaps (insertions/deletions) can be important
abcdef abcdef abcdef
abceef acdef a-cdef

4
Not all mismatches are the same

Some amino acids are more substitutable for each
other than others. Serine and threonine are
more alike than tryptophan and alanine.
We can introduce "mismatch costs" for handling
different substitutions.
We don't usually use mismatch costs in aligning
nucleotide sequences, since no substitution is
per se better than any other.

5
Many possible alignments to consider

Without gaps, there are are NxM possible
alignments between sequences of length N and M
Once we start allowing gaps, there are many
possible arrangements to consider abcbcd
abcbcd abcbcd
abc--d a--bcd ab--cd
This becomes a very large number when we allow
mismatches, since we then need to look at every
possible pairing between elements there are
roughly NM possible alignments.

6
Exponential computations get big fast

If nm100, there are 100100 10200
100,000,000,000,000,000,000,000,000,000,000,000,00
0,000,000,000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,000,000,00
0,000,000,000,000,000,000,000,000,000,000,000,000,
000,000,000,000,000,000,000,000,000,000,000,000,00
0,000,000,000,000 different alignments.
And 100 amino acids is a small protein!

7
Avoiding random alignments with a score function

Not only are there many possible gapped
alignments, but introducing too many gaps makes
nonsense alignments possible
s--e-----qu---en--ce sometimesquipsentice
Need to distinguish between alignments that occur
due to homology, and those that could be expected
to be seen just by chance.
Define a score function that accounts for both
element mismatches and a gap penalty

8
Match scores

Match scores are oftencalculated on the basis
of the frequency of particular mutations in
very similar sequences.
We can transform substitution frequencies into
log odds scores, which can then be added
together.

9
Local vs. Global alignments

A global alignment includes all elements of a
sequence, and includes gaps
A global alignment may or may not include "end
gap" penalties.
A local alignment is includes only subsequences,
and sometimes computed without gaps.
Local alignments can find shared domains in
divergent proteins and are fast to compute
Global alignments are better indicators of
homology and take longer to compute.

10
An alignment score

An alignment score is the sum of all the match
scores of an alignment, with a penalty subtracted
for each gap.
Gap penalties are usually "affine" meaning that
the penalty for one long gap is smaller than the
penalty for many smaller gaps that add up to the
same size.a b c - - da c c e f d9 2 7 6
gt 24 - (10 2) 12

Gap start continuationpenalty
Matchscore
AlignmentScore
11
Finding the optimal alignment

Given a pair of sequences and a score function,
identify the best scoring (optimal) alignment
between the sequences.
Remember, exponential number of possible
alignments (most with terrible scores).
Computer science to the rescue dynamic
programming identifies optimal alignments in time
proportional to the sum of the lengths of the
sequences

12
Dynamic programming

The name comes from an operations research task,
and has nothing to do with writing programs.
The key idea is to start aligning the sequences
left to right once a prefix is optimally
aligned, nothing about the remainder of the
alignment changes the alignment of the prefix.
We construct a matrix of possible alignment
scores (NxM2 calculations worst case) and then
"traceback" to find the optimal alignment.
Called Needleman-Wunch or Smith-Waterman

13
Alignment matrix

Create a matrix with each sequence to be aligned
along one edge and the score of the alignment of
each pair of elements in a cell.
Best local alignment is just the highest
scoring diagonal

14
Dynamic programming matrix

Each cell has the score for the best aligned
sequence prefix up to that position.
Number in ( )s is thealignment score forthe
pair of amino acids at that position.
Gap penalty here is-12 to start and -4 to
continue.

15
Optimal alignment by traceback

We traceback a path that gets us the highest
score. If we don't have end gap penalties,
then takeany path from thelast row or columnto
the first.
Otherwise we needto include the top and bottom
corners

16
Study guide....

Dynamic programming alignments are a key
technology in bioinformatics, and you should
understand how they work.
The method is counterintuitive
Work some examples by hand. The textbook has a
very good explanation, and there is more detail
and supplementary material on the textbook web
site, www.bioinformaticsonline.org

17
How do we pick match scores?

For match scores, two main options
PAM based on global alignments of closely related
sequences. Normalized to changes per 100 sites,
then exponentiated for more distant relatives.
BLOSUM based on local alignments in much more
diverse sequences
Picking the right distance is important, and may
be hard to do. BLOSUM seems to work better for
more evolutionarily distant sequences. BLOSUM62
is a good default.

18
Picking gap penalties

Many different possible forms
Most common is affine (gap open gap continue
penalities)
More complex penalties have been proposed.
Penalties must be commensurate with match scores.
Therefore, the match scoring scheme influences
the gap penalty
Most alignment programs suggest appropriate
penalties for each match score option.

19
Searching for optimal scores

One possibility is to try several different match
score and gap penalties, and choose the best
result.
In general, this is called parameter space search
and it is important in many areas.
Problems
requires a lot computation
we need some principled way to compare the
results.
Use significance testing to compare...

20
The significance of an alignment

Significance testing is the branch of statistics
that is concerned with assesing the probability
that a particular result could have occurred by
chance.
How do we calculate the probability that an
alignment occurred by chance?
Either with a model of evolution, or
Empirically, by scrambling our sequences and
calculating scores on many randomized sequences.

Pairwise Sequence Alignment - PowerPoint PPT Presentation

Pairwise Sequence Alignment

An alignment is a mapping from one sequence to another, identifying elements ... Called Needleman-Wunch or Smith-Waterman. Alignment matrix ... – PowerPoint PPT presentation