Search - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

Search

Description:

Sequence search doesn't match the traditional database model perfectly ... Similarity can be defined by counting positions that match between two sequences ... – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 37

Provided by: compbi

Category:

more less

Transcript and Presenter's Notes

Title: Search

1
Search Sequence Alignment
2
Similarity vs. HomologyParalogs vs. Orthologs

Homology is an evolutionary relationship that
either exists or does not. It cannot be partial.
An ortholog is a homolog with shared function.
A paralog is a homolog that arose through a gene
duplication event. Paralogs often have divergent
function.
Similarity is a measure of the quality of
alignment between two sequences. High similarity
is evidence for homology. Similar sequences may
be orthologs or paralogs.

3
Sequence search alignment

In order to understand how the example from last
week works, we need to look at databases and
sequence alignment
Database organization is focused on efficiency
Sequence search doesnt match the traditional
database model perfectly
Start with dynamic programming (a central idea in
computational biology)
Then explore approximations to it (BLAST)

4
Computational Complexity

A key idea in computer science How much work
does it take to solve a class of problems?
How do we measure complexity?
Relative to problem size
How long does it take?
Clock time versus operations
Order O(?) notation
Other resources used (particularly space)

5
Linear search

Database
ACTGA
TTAGG
CGTAA
AGAGA
CGATA
CCGGA
GCCCT
TTACG

Test query against each target sequentially
Worst case, query matches last target and you
have as manytests as targets (size of database)
Average case, test half the targets.
Linear in the size of the database

6
Indexed (binary) search

Create an sorted set of keys that point to
entries
Start in the middle, then figure out which half
Eliminate half the database each step, so need
log2 steps at worst
Need to build the index (takes space and time at
each database update)

1
2
3
7
Hash tables

Map each query to an arbitrary number with a
hash function
Use those numbers as an index into a table
Collisions can happen, but are rare
Constant time lookup,no index construction

f (TTACG) 8
8
How to define a hash function

Basic must map keys to a number that is within
the size of the table
Desired minimize collisions
So similar keys should lead to different hashes
Good general method map key to a number, and
then take the remainder when divided by a prime
number. Specialized hash functions can be
better.
Hash tables are the basis of most database
lookups.

9
Approximate searches

Recall the needs of sequence searches
Not looking for exact match, but similar
sequences
Database search methods only help us find exact
matches.
Hash tables particularly bad at similar because
we need similar keys to map to different hashes
First, need to define what is similar, then
find efficient ways to search for similar
sequences.

10
What counts as similarity?

Similarity can be defined by counting positions
that match between two sequences
But which positions? Allowing gaps makes a
difference in the number of matching positions
abcdef abcdef abcdef-
abceef acdefg a-cdefg

11
Pairwise Sequence Alignment

Sequence similarity depends on an alignment.
What is an alignment, and why might it be
significant?
An alignment is a mapping from one sequence to
another.
Biological alignment maps together elements that
are likely to have arisen from a common ancestor
The existence of an alignment with many matches
is an indication of homology

12
Not all mismatches are the same

Some amino acids are more substitutable for each
other than others. Serine and threonine are
more alike than tryptophan and alanine.
We can introduce "mismatch costs" for handling
different substitutions.
We don't usually use mismatch costs in aligning
nucleotide sequences, since no substitution is
per se better than any other.

13
Many possible alignments to consider

Without gaps, there are are NM-1 possible
alignments between sequences of length N and M
Once we start allowing gaps, there are many more
possible arrangements to consider abcbcd
abcbcd abcbcd
abc--d a--bcd ab--cd
This becomes a very large number when we allow
mismatches, since we then need to look at every
possible pairing between elements there are
roughly NM possible alignments. Aligning length
100 sequences this way is impractical

14
Avoiding random alignments with a score function

Not only are there many possible gapped
alignments, but introducing too many gaps makes
nonsense alignments possible
s--e-----qu---en--ce sometimesquipsentice
Want to distinguish between alignments that occur
due to homology, and those that could be expected
to be seen just by chance.
Define a score function that accounts for both
element mismatches and a gap penalty

15
Match scores

Match scores are often calculated on the basis of
the frequency of particular mutations in very
similar sequences.
We can transform substitution frequencies into
log odds scores, which can then be added together.

16
An alignment score

An alignment score is the sum of all the match
scores of an alignment, with a penalty subtracted
for each gap.
Gap penalties are usually "affine" meaning that
the penalty for one long gap is smaller than the
penalty for many smaller gaps that add up to the
same size.a b c - - da c c e f d9 2 7 6
24 - (10 2) 12

Gap start continuationpenalty
Matchscore
AlignmentScore
17
Local vs. Global alignments

A global alignment includes all elements of a
sequence, and includes gaps
A global alignment may or may not include "end
gap" penalties.
A local alignment is includes only subsequences,
and sometimes computed without gaps.
Local alignments can find shared domains in
divergent proteins and are fast to compute
Global alignments are better indicators of
homology and take longer to compute.

18
Finding the optimal alignment

Given a pair of sequences and a score function,
identify the best scoring (optimal) alignment
between the sequences.
Remember, exponential number of possible
alignments (most with terrible scores).
Computer science to the rescue dynamic
programming identifies optimal alignments in time
proportional to the sum of the lengths of the
sequences

19
Dynamic programming

The name comes from an operations research task,
and has nothing to do with writing programs.
The key idea is to start aligning the sequences
left to right
Once a prefix is optimally aligned, nothing about
the remainder of the alignment can change the
alignment of the prefix.
We construct a matrix of possible alignment
scores (NxM2 calculations worst case) and then
"traceback" to find the optimal alignment.
Called Needleman-Wunch or Smith-Waterman

20
Dynamic programming alignment

Each cell contains the score for the best aligned
sequence prefix up to that position.
Start by filling in initial gap and first element
to first element match score
Use arrow to indicate path to that alignment

Align ACD to AACADCD
21
Continue filling in optimal path scores

For each cell, have three choices for how to get
there from the last optimal alignment (match, gap
sequence 1, gap sequence 2).
Best score(s) are selected, and arrows added
indicated route.
From -5 align As
-5 5 0
From 5, insert gap
5 -5 0
From -7, insert gap
-7 -5 -12

22
Optimal alignment by traceback

We traceback a path that gets us the highest
score. If we don't have end gap penalties,
then take any path from the last row or columnto
the first.
Otherwise we needto include the top and bottom
corners
AACADCD---A-CD

23
Study guide....

Dynamic programming alignments are a key
technology in bioinformatics, and you should
understand how they work.
The method is counterintuitive
Work some examples by hand.
All of the textbooks describe D-P, and there is
more detail and supplementary material on the
course web site.

24
Parameter Selection

The optimal alignment between a pair of sequences
depends critically on the selection of the score
matrix and the gap penalty.
These sorts of generic inputs to a program are
called parameters.
How do we pick the ones that give the most
biologically meaningful alignments (and alignment
scores?)

25
How do we pick match scores?

For match scores, two main options
PAM based on global alignments of closely related
sequences. Normalized to changes per 100 sites,
then exponentiated for more distant relatives.
BLOSUM based on local alignments in much more
diverse sequences
Each matrix has versions aimed at different
evolutionary distances.
BLOSUM62 is NCBIs default. The unnumbered
BLOSUM may work better for more evolutionarily
distant sequences.

26
Picking gap penalties

Many different possible forms
Most common is affine (gap open gap continue
penalities)
More complex penalties have been proposed.
Penalties must be commensurate with match scores.
Therefore, the match scoring scheme influences
the gap penalty
Most alignment programs suggest appropriate
penalties for each match score option.

27
Searching for optimal scores

One possibility is to try several different match
score and gap penalties, and choose the best
In general, this is called parameter space search
and it is important in many areas.
Problems
requires a lot computation
we need some principled way to compare the
results.
Use significance testing to compare...

28
The significance of an alignment

Significance testing is the branch of statistics
that is concerned with assessing the probability
that a particular result could have occurred by
chance.
How do we calculate the probability that an
alignment occurred by chance?
Either with a model of evolution, or
Empirically, by scrambling our sequences and
calculating scores on many randomized (and by
assumption unrelated) sequences.

29
Why BLAST?

Dynamic programming solutions to alignment
problems are relatively slow, and don't lend
themselves to efficient database search.
Time complexity proportional to the size of the
database.
Need some way to search a large database to find
sequences that have an inexact match to a query
sequence
BLAST an imperfect approximation to DP. DP
finds some distantly related sequences the
approximations don't

30
Sequence search basics

BLAST is 50-100x faster than DP
Proper use is similar to DP
Use appropriate substitution and gap scores
BLOSUM62 is good for weak protein similarities
Use PAM30, PAM70 or BLOSUM45 for better results
on more similar sequences, BLOSUM80 for most
distant
Use low-complexity (repetitive seq) filters and
filter out human repeats (ALUs, etc)
If searching for coding regions, always translate
nucleotide to amino acid sequence.

31
How BLAST works

Break sequence into overlapping words, by
default of length 3.
Sequence of length n makes n-m1 m-size words
ABCDE ? ABC, BCD, CDE
For each word, define 50 other words that are
similar (use substitution matrix threshold T)
Repeat for each of the n-m1 words, giving about
50n words (out of 2038000 possible)
Use a hash table to find all places in DB with an
exact match to any of those words.

32
Extending HSPs

Identify database sequences that contain several
matching words on the same diagonal (think DP
alignments) and within a short distance.
Extend these short, ungapped alignments in both
directions along the sequence so long as score of
alignment increases.
BLAST alignments scored simply with a log-odds
matrix no gap penalties at this point.
Call these extended alignments HSPs for high
scoring pairs

33
Is an HSP Significant?

What is the probability of scoring at least as
large as x by chance?
Extreme value (not Normal!) distributionWh
ere m is size of the database, n is length of
query, and l is average length of alignment
between two random sequences of those lengths
using this scoring scheme.
Called E value for expectation (analogous to p
value)

34
K and ?

Parameters of the extreme value distribution
Depend on the particular substitution matrix
Estimated by aligning a lot of random sequences
drawn on a particular distribution of amino
acids, and fitting the extreme value distribution
to those alignments
These empirical estimates may not be correct
(error in the assumed distribution of AAs used to
create the random sequences) but seem to be
reasonably close.

35
BLAST2 add gaps

Multiple HSPs in one target sequence ?
possibility of gapped alignment.
Combine HSP scores to score whole sequence
Add HSP scores
Adjust K and ? for this scoring method
Set modest e-value threshold to identify
reasonable target set
Use DP to produce final gapped alignments
Run DP on the (relatively) small number of
database sequences that were above the threshold
with multiple HSPs

36
Practical Gapped BLAST

Default on NCBI web site
BLAST versus DP on whole databases
Still might miss some alignments DP would find as
database search tool
DP on fractions of the database (e.g. all human
sequences) can be done with parallel hardware,
but computational complexity scales with database
size.
BLAST allows users to set certain gap penalties,
word sizes and thresholds in Advanced settings
but not all (since K ? have to be calculated in
advance)