Search - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Search

Description:

Sequence search doesn't match the traditional database model perfectly ... Similarity can be defined by counting positions that match between two sequences ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 37
Provided by: compbi
Category:
Tags: match | search

less

Transcript and Presenter's Notes

Title: Search


1
Search Sequence Alignment
2
Similarity vs. HomologyParalogs vs. Orthologs
  • Homology is an evolutionary relationship that
    either exists or does not. It cannot be partial.
  • An ortholog is a homolog with shared function.
  • A paralog is a homolog that arose through a gene
    duplication event. Paralogs often have divergent
    function.
  • Similarity is a measure of the quality of
    alignment between two sequences. High similarity
    is evidence for homology. Similar sequences may
    be orthologs or paralogs.

3
Sequence search alignment
  • In order to understand how the example from last
    week works, we need to look at databases and
    sequence alignment
  • Database organization is focused on efficiency
  • Sequence search doesnt match the traditional
    database model perfectly
  • Start with dynamic programming (a central idea in
    computational biology)
  • Then explore approximations to it (BLAST)

4
Computational Complexity
  • A key idea in computer science How much work
    does it take to solve a class of problems?
  • How do we measure complexity?
  • Relative to problem size
  • How long does it take?
  • Clock time versus operations
  • Order O(?) notation
  • Other resources used (particularly space)

5
Linear search
  • Database
  • ACTGA
  • TTAGG
  • CGTAA
  • AGAGA
  • CGATA
  • CCGGA
  • GCCCT
  • TTACG
  • Test query against each target sequentially
  • Worst case, query matches last target and you
    have as manytests as targets (size of database)
  • Average case, test half the targets.
  • Linear in the size of the database

6
Indexed (binary) search
  • Create an sorted set of keys that point to
    entries
  • Start in the middle, then figure out which half
  • Eliminate half the database each step, so need
    log2 steps at worst
  • Need to build the index (takes space and time at
    each database update)

1
2
3
7
Hash tables
  • Map each query to an arbitrary number with a
    hash function
  • Use those numbers as an index into a table
  • Collisions can happen, but are rare
  • Constant time lookup,no index construction

f (TTACG) 8
8
How to define a hash function
  • Basic must map keys to a number that is within
    the size of the table
  • Desired minimize collisions
  • So similar keys should lead to different hashes
  • Good general method map key to a number, and
    then take the remainder when divided by a prime
    number. Specialized hash functions can be
    better.
  • Hash tables are the basis of most database
    lookups.

9
Approximate searches
  • Recall the needs of sequence searches
  • Not looking for exact match, but similar
    sequences
  • Database search methods only help us find exact
    matches.
  • Hash tables particularly bad at similar because
    we need similar keys to map to different hashes
  • First, need to define what is similar, then
    find efficient ways to search for similar
    sequences.

10
What counts as similarity?
  • Similarity can be defined by counting positions
    that match between two sequences
  • But which positions? Allowing gaps makes a
    difference in the number of matching positions
    abcdef abcdef abcdef-
    abceef acdefg a-cdefg

11
Pairwise Sequence Alignment
  • Sequence similarity depends on an alignment.
  • What is an alignment, and why might it be
    significant?
  • An alignment is a mapping from one sequence to
    another.
  • Biological alignment maps together elements that
    are likely to have arisen from a common ancestor
  • The existence of an alignment with many matches
    is an indication of homology

12
Not all mismatches are the same
  • Some amino acids are more substitutable for each
    other than others. Serine and threonine are
    more alike than tryptophan and alanine.
  • We can introduce "mismatch costs" for handling
    different substitutions.
  • We don't usually use mismatch costs in aligning
    nucleotide sequences, since no substitution is
    per se better than any other.

13
Many possible alignments to consider
  • Without gaps, there are are NM-1 possible
    alignments between sequences of length N and M
  • Once we start allowing gaps, there are many more
    possible arrangements to consider abcbcd
    abcbcd abcbcd
    abc--d a--bcd ab--cd
  • This becomes a very large number when we allow
    mismatches, since we then need to look at every
    possible pairing between elements there are
    roughly NM possible alignments. Aligning length
    100 sequences this way is impractical

14
Avoiding random alignments with a score function
  • Not only are there many possible gapped
    alignments, but introducing too many gaps makes
    nonsense alignments possible
    s--e-----qu---en--ce sometimesquipsentice
  • Want to distinguish between alignments that occur
    due to homology, and those that could be expected
    to be seen just by chance.
  • Define a score function that accounts for both
    element mismatches and a gap penalty

15
Match scores
  • Match scores are often calculated on the basis of
    the frequency of particular mutations in very
    similar sequences.
  • We can transform substitution frequencies into
    log odds scores, which can then be added together.

16
An alignment score
  • An alignment score is the sum of all the match
    scores of an alignment, with a penalty subtracted
    for each gap.
  • Gap penalties are usually "affine" meaning that
    the penalty for one long gap is smaller than the
    penalty for many smaller gaps that add up to the
    same size.a b c - - da c c e f d9 2 7 6
    24 - (10 2) 12

Gap start continuationpenalty
Matchscore
AlignmentScore
17
Local vs. Global alignments
  • A global alignment includes all elements of a
    sequence, and includes gaps
  • A global alignment may or may not include "end
    gap" penalties.
  • A local alignment is includes only subsequences,
    and sometimes computed without gaps.
  • Local alignments can find shared domains in
    divergent proteins and are fast to compute
  • Global alignments are better indicators of
    homology and take longer to compute.

18
Finding the optimal alignment
  • Given a pair of sequences and a score function,
    identify the best scoring (optimal) alignment
    between the sequences.
  • Remember, exponential number of possible
    alignments (most with terrible scores).
  • Computer science to the rescue dynamic
    programming identifies optimal alignments in time
    proportional to the sum of the lengths of the
    sequences

19
Dynamic programming
  • The name comes from an operations research task,
    and has nothing to do with writing programs.
  • The key idea is to start aligning the sequences
    left to right
  • Once a prefix is optimally aligned, nothing about
    the remainder of the alignment can change the
    alignment of the prefix.
  • We construct a matrix of possible alignment
    scores (NxM2 calculations worst case) and then
    "traceback" to find the optimal alignment.
  • Called Needleman-Wunch or Smith-Waterman

20
Dynamic programming alignment
  • Each cell contains the score for the best aligned
    sequence prefix up to that position.
  • Start by filling in initial gap and first element
    to first element match score
  • Use arrow to indicate path to that alignment

Align ACD to AACADCD
21
Continue filling in optimal path scores
  • For each cell, have three choices for how to get
    there from the last optimal alignment (match, gap
    sequence 1, gap sequence 2).
  • Best score(s) are selected, and arrows added
    indicated route.
  • From -5 align As
  • -5 5 0
  • From 5, insert gap
  • 5 -5 0
  • From -7, insert gap
  • -7 -5 -12

22
Optimal alignment by traceback
  • We traceback a path that gets us the highest
    score. If we don't have end gap penalties,
    then take any path from the last row or columnto
    the first.
  • Otherwise we needto include the top and bottom
    corners
  • AACADCD---A-CD

23
Study guide....
  • Dynamic programming alignments are a key
    technology in bioinformatics, and you should
    understand how they work.
  • The method is counterintuitive
  • Work some examples by hand.
  • All of the textbooks describe D-P, and there is
    more detail and supplementary material on the
    course web site.

24
Parameter Selection
  • The optimal alignment between a pair of sequences
    depends critically on the selection of the score
    matrix and the gap penalty.
  • These sorts of generic inputs to a program are
    called parameters.
  • How do we pick the ones that give the most
    biologically meaningful alignments (and alignment
    scores?)

25
How do we pick match scores?
  • For match scores, two main options
  • PAM based on global alignments of closely related
    sequences. Normalized to changes per 100 sites,
    then exponentiated for more distant relatives.
  • BLOSUM based on local alignments in much more
    diverse sequences
  • Each matrix has versions aimed at different
    evolutionary distances.
  • BLOSUM62 is NCBIs default. The unnumbered
    BLOSUM may work better for more evolutionarily
    distant sequences.

26
Picking gap penalties
  • Many different possible forms
  • Most common is affine (gap open gap continue
    penalities)
  • More complex penalties have been proposed.
  • Penalties must be commensurate with match scores.
    Therefore, the match scoring scheme influences
    the gap penalty
  • Most alignment programs suggest appropriate
    penalties for each match score option.

27
Searching for optimal scores
  • One possibility is to try several different match
    score and gap penalties, and choose the best
  • In general, this is called parameter space search
    and it is important in many areas.
  • Problems
  • requires a lot computation
  • we need some principled way to compare the
    results.
  • Use significance testing to compare...

28
The significance of an alignment
  • Significance testing is the branch of statistics
    that is concerned with assessing the probability
    that a particular result could have occurred by
    chance.
  • How do we calculate the probability that an
    alignment occurred by chance?
  • Either with a model of evolution, or
  • Empirically, by scrambling our sequences and
    calculating scores on many randomized (and by
    assumption unrelated) sequences.

29
Why BLAST?
  • Dynamic programming solutions to alignment
    problems are relatively slow, and don't lend
    themselves to efficient database search.
  • Time complexity proportional to the size of the
    database.
  • Need some way to search a large database to find
    sequences that have an inexact match to a query
    sequence
  • BLAST an imperfect approximation to DP. DP
    finds some distantly related sequences the
    approximations don't

30
Sequence search basics
  • BLAST is 50-100x faster than DP
  • Proper use is similar to DP
  • Use appropriate substitution and gap scores
  • BLOSUM62 is good for weak protein similarities
  • Use PAM30, PAM70 or BLOSUM45 for better results
    on more similar sequences, BLOSUM80 for most
    distant
  • Use low-complexity (repetitive seq) filters and
    filter out human repeats (ALUs, etc)
  • If searching for coding regions, always translate
    nucleotide to amino acid sequence.

31
How BLAST works
  • Break sequence into overlapping words, by
    default of length 3.
  • Sequence of length n makes n-m1 m-size words
    ABCDE ? ABC, BCD, CDE
  • For each word, define 50 other words that are
    similar (use substitution matrix threshold T)
  • Repeat for each of the n-m1 words, giving about
    50n words (out of 2038000 possible)
  • Use a hash table to find all places in DB with an
    exact match to any of those words.

32
Extending HSPs
  • Identify database sequences that contain several
    matching words on the same diagonal (think DP
    alignments) and within a short distance.
  • Extend these short, ungapped alignments in both
    directions along the sequence so long as score of
    alignment increases.
  • BLAST alignments scored simply with a log-odds
    matrix no gap penalties at this point.
  • Call these extended alignments HSPs for high
    scoring pairs

33
Is an HSP Significant?
  • What is the probability of scoring at least as
    large as x by chance?
  • Extreme value (not Normal!) distributionWh
    ere m is size of the database, n is length of
    query, and l is average length of alignment
    between two random sequences of those lengths
    using this scoring scheme.
  • Called E value for expectation (analogous to p
    value)

34
K and ?
  • Parameters of the extreme value distribution
  • Depend on the particular substitution matrix
  • Estimated by aligning a lot of random sequences
    drawn on a particular distribution of amino
    acids, and fitting the extreme value distribution
    to those alignments
  • These empirical estimates may not be correct
    (error in the assumed distribution of AAs used to
    create the random sequences) but seem to be
    reasonably close.

35
BLAST2 add gaps
  • Multiple HSPs in one target sequence ?
    possibility of gapped alignment.
  • Combine HSP scores to score whole sequence
  • Add HSP scores
  • Adjust K and ? for this scoring method
  • Set modest e-value threshold to identify
    reasonable target set
  • Use DP to produce final gapped alignments
  • Run DP on the (relatively) small number of
    database sequences that were above the threshold
    with multiple HSPs

36
Practical Gapped BLAST
  • Default on NCBI web site
  • BLAST versus DP on whole databases
  • Still might miss some alignments DP would find as
    database search tool
  • DP on fractions of the database (e.g. all human
    sequences) can be done with parallel hardware,
    but computational complexity scales with database
    size.
  • BLAST allows users to set certain gap penalties,
    word sizes and thresholds in Advanced settings
    but not all (since K ? have to be calculated in
    advance)
Write a Comment
User Comments (0)
About PowerShow.com