Title: Search
1Search Sequence Alignment
2Similarity vs. HomologyParalogs vs. Orthologs
- Homology is an evolutionary relationship that
either exists or does not. It cannot be partial. - An ortholog is a homolog with shared function.
- A paralog is a homolog that arose through a gene
duplication event. Paralogs often have divergent
function. - Similarity is a measure of the quality of
alignment between two sequences. High similarity
is evidence for homology. Similar sequences may
be orthologs or paralogs.
3Sequence search alignment
- In order to understand how the example from last
week works, we need to look at databases and
sequence alignment - Database organization is focused on efficiency
- Sequence search doesnt match the traditional
database model perfectly - Start with dynamic programming (a central idea in
computational biology) - Then explore approximations to it (BLAST)
4Computational Complexity
- A key idea in computer science How much work
does it take to solve a class of problems? - How do we measure complexity?
- Relative to problem size
- How long does it take?
- Clock time versus operations
- Order O(?) notation
- Other resources used (particularly space)
5Linear search
- Database
- ACTGA
- TTAGG
- CGTAA
- AGAGA
- CGATA
- CCGGA
- GCCCT
- TTACG
- Test query against each target sequentially
- Worst case, query matches last target and you
have as manytests as targets (size of database) - Average case, test half the targets.
- Linear in the size of the database
6Indexed (binary) search
- Create an sorted set of keys that point to
entries - Start in the middle, then figure out which half
- Eliminate half the database each step, so need
log2 steps at worst - Need to build the index (takes space and time at
each database update)
1
2
3
7Hash tables
- Map each query to an arbitrary number with a
hash function - Use those numbers as an index into a table
- Collisions can happen, but are rare
- Constant time lookup,no index construction
f (TTACG) 8
8How to define a hash function
- Basic must map keys to a number that is within
the size of the table - Desired minimize collisions
- So similar keys should lead to different hashes
- Good general method map key to a number, and
then take the remainder when divided by a prime
number. Specialized hash functions can be
better. - Hash tables are the basis of most database
lookups.
9Approximate searches
- Recall the needs of sequence searches
- Not looking for exact match, but similar
sequences - Database search methods only help us find exact
matches. - Hash tables particularly bad at similar because
we need similar keys to map to different hashes - First, need to define what is similar, then
find efficient ways to search for similar
sequences.
10What counts as similarity?
- Similarity can be defined by counting positions
that match between two sequences - But which positions? Allowing gaps makes a
difference in the number of matching positions
abcdef abcdef abcdef-
abceef acdefg a-cdefg
11Pairwise Sequence Alignment
- Sequence similarity depends on an alignment.
- What is an alignment, and why might it be
significant? - An alignment is a mapping from one sequence to
another. - Biological alignment maps together elements that
are likely to have arisen from a common ancestor - The existence of an alignment with many matches
is an indication of homology
12Not all mismatches are the same
- Some amino acids are more substitutable for each
other than others. Serine and threonine are
more alike than tryptophan and alanine. - We can introduce "mismatch costs" for handling
different substitutions. - We don't usually use mismatch costs in aligning
nucleotide sequences, since no substitution is
per se better than any other.
13Many possible alignments to consider
- Without gaps, there are are NM-1 possible
alignments between sequences of length N and M - Once we start allowing gaps, there are many more
possible arrangements to consider abcbcd
abcbcd abcbcd
abc--d a--bcd ab--cd - This becomes a very large number when we allow
mismatches, since we then need to look at every
possible pairing between elements there are
roughly NM possible alignments. Aligning length
100 sequences this way is impractical
14Avoiding random alignments with a score function
- Not only are there many possible gapped
alignments, but introducing too many gaps makes
nonsense alignments possible
s--e-----qu---en--ce sometimesquipsentice - Want to distinguish between alignments that occur
due to homology, and those that could be expected
to be seen just by chance. - Define a score function that accounts for both
element mismatches and a gap penalty
15Match scores
- Match scores are often calculated on the basis of
the frequency of particular mutations in very
similar sequences. - We can transform substitution frequencies into
log odds scores, which can then be added together.
16An alignment score
- An alignment score is the sum of all the match
scores of an alignment, with a penalty subtracted
for each gap. - Gap penalties are usually "affine" meaning that
the penalty for one long gap is smaller than the
penalty for many smaller gaps that add up to the
same size.a b c - - da c c e f d9 2 7 6
24 - (10 2) 12
Gap start continuationpenalty
Matchscore
AlignmentScore
17Local vs. Global alignments
- A global alignment includes all elements of a
sequence, and includes gaps - A global alignment may or may not include "end
gap" penalties. - A local alignment is includes only subsequences,
and sometimes computed without gaps. - Local alignments can find shared domains in
divergent proteins and are fast to compute - Global alignments are better indicators of
homology and take longer to compute.
18Finding the optimal alignment
- Given a pair of sequences and a score function,
identify the best scoring (optimal) alignment
between the sequences. - Remember, exponential number of possible
alignments (most with terrible scores). - Computer science to the rescue dynamic
programming identifies optimal alignments in time
proportional to the sum of the lengths of the
sequences
19Dynamic programming
- The name comes from an operations research task,
and has nothing to do with writing programs. - The key idea is to start aligning the sequences
left to right - Once a prefix is optimally aligned, nothing about
the remainder of the alignment can change the
alignment of the prefix. - We construct a matrix of possible alignment
scores (NxM2 calculations worst case) and then
"traceback" to find the optimal alignment. - Called Needleman-Wunch or Smith-Waterman
20Dynamic programming alignment
- Each cell contains the score for the best aligned
sequence prefix up to that position. - Start by filling in initial gap and first element
to first element match score - Use arrow to indicate path to that alignment
Align ACD to AACADCD
21Continue filling in optimal path scores
- For each cell, have three choices for how to get
there from the last optimal alignment (match, gap
sequence 1, gap sequence 2). - Best score(s) are selected, and arrows added
indicated route. - From -5 align As
- -5 5 0
- From 5, insert gap
- 5 -5 0
- From -7, insert gap
- -7 -5 -12
22Optimal alignment by traceback
- We traceback a path that gets us the highest
score. If we don't have end gap penalties,
then take any path from the last row or columnto
the first. - Otherwise we needto include the top and bottom
corners - AACADCD---A-CD
23Study guide....
- Dynamic programming alignments are a key
technology in bioinformatics, and you should
understand how they work. - The method is counterintuitive
- Work some examples by hand.
- All of the textbooks describe D-P, and there is
more detail and supplementary material on the
course web site.
24Parameter Selection
- The optimal alignment between a pair of sequences
depends critically on the selection of the score
matrix and the gap penalty. - These sorts of generic inputs to a program are
called parameters. - How do we pick the ones that give the most
biologically meaningful alignments (and alignment
scores?)
25How do we pick match scores?
- For match scores, two main options
- PAM based on global alignments of closely related
sequences. Normalized to changes per 100 sites,
then exponentiated for more distant relatives. - BLOSUM based on local alignments in much more
diverse sequences - Each matrix has versions aimed at different
evolutionary distances. - BLOSUM62 is NCBIs default. The unnumbered
BLOSUM may work better for more evolutionarily
distant sequences.
26Picking gap penalties
- Many different possible forms
- Most common is affine (gap open gap continue
penalities) - More complex penalties have been proposed.
- Penalties must be commensurate with match scores.
Therefore, the match scoring scheme influences
the gap penalty - Most alignment programs suggest appropriate
penalties for each match score option.
27Searching for optimal scores
- One possibility is to try several different match
score and gap penalties, and choose the best - In general, this is called parameter space search
and it is important in many areas. - Problems
- requires a lot computation
- we need some principled way to compare the
results. - Use significance testing to compare...
28The significance of an alignment
- Significance testing is the branch of statistics
that is concerned with assessing the probability
that a particular result could have occurred by
chance. - How do we calculate the probability that an
alignment occurred by chance? - Either with a model of evolution, or
- Empirically, by scrambling our sequences and
calculating scores on many randomized (and by
assumption unrelated) sequences.
29Why BLAST?
- Dynamic programming solutions to alignment
problems are relatively slow, and don't lend
themselves to efficient database search. - Time complexity proportional to the size of the
database. - Need some way to search a large database to find
sequences that have an inexact match to a query
sequence - BLAST an imperfect approximation to DP. DP
finds some distantly related sequences the
approximations don't
30Sequence search basics
- BLAST is 50-100x faster than DP
- Proper use is similar to DP
- Use appropriate substitution and gap scores
- BLOSUM62 is good for weak protein similarities
- Use PAM30, PAM70 or BLOSUM45 for better results
on more similar sequences, BLOSUM80 for most
distant - Use low-complexity (repetitive seq) filters and
filter out human repeats (ALUs, etc) - If searching for coding regions, always translate
nucleotide to amino acid sequence.
31How BLAST works
- Break sequence into overlapping words, by
default of length 3. - Sequence of length n makes n-m1 m-size words
ABCDE ? ABC, BCD, CDE - For each word, define 50 other words that are
similar (use substitution matrix threshold T) - Repeat for each of the n-m1 words, giving about
50n words (out of 2038000 possible) - Use a hash table to find all places in DB with an
exact match to any of those words.
32Extending HSPs
- Identify database sequences that contain several
matching words on the same diagonal (think DP
alignments) and within a short distance. - Extend these short, ungapped alignments in both
directions along the sequence so long as score of
alignment increases. - BLAST alignments scored simply with a log-odds
matrix no gap penalties at this point. - Call these extended alignments HSPs for high
scoring pairs
33Is an HSP Significant?
- What is the probability of scoring at least as
large as x by chance? - Extreme value (not Normal!) distributionWh
ere m is size of the database, n is length of
query, and l is average length of alignment
between two random sequences of those lengths
using this scoring scheme. - Called E value for expectation (analogous to p
value)
34K and ?
- Parameters of the extreme value distribution
- Depend on the particular substitution matrix
- Estimated by aligning a lot of random sequences
drawn on a particular distribution of amino
acids, and fitting the extreme value distribution
to those alignments - These empirical estimates may not be correct
(error in the assumed distribution of AAs used to
create the random sequences) but seem to be
reasonably close.
35BLAST2 add gaps
- Multiple HSPs in one target sequence ?
possibility of gapped alignment. - Combine HSP scores to score whole sequence
- Add HSP scores
- Adjust K and ? for this scoring method
- Set modest e-value threshold to identify
reasonable target set - Use DP to produce final gapped alignments
- Run DP on the (relatively) small number of
database sequences that were above the threshold
with multiple HSPs
36Practical Gapped BLAST
- Default on NCBI web site
- BLAST versus DP on whole databases
- Still might miss some alignments DP would find as
database search tool - DP on fractions of the database (e.g. all human
sequences) can be done with parallel hardware,
but computational complexity scales with database
size. - BLAST allows users to set certain gap penalties,
word sizes and thresholds in Advanced settings
but not all (since K ? have to be calculated in
advance)