Title: Biological Sequence Comparison Database Homology Searching
1Biological Sequence Comparison / Database
Homology Searching
- Aoife McLysaght
- Summer Intern,
- Compaq Computer Corporation
- Ballybrit Business Park, Galway, Ireland
2Database Homology Searching
- Use algorithms to increase efficiency and to
provide a mathematical basis for searches which
can be translated into statistical significance - Assumes that sequence, structure and function are
inter-related - BLAST (Basic Local Alignment Search Tool) and
FastA (Fast Alignment) - heuristic approximations of Needleman-Wunsch and
Smith-Waterman algorithms - reduce computation
3Needleman-Wunsch Algorithm
- General algorithm for sequence comparison
- Maximise a similarity score, to give maximum
match - Maximum match largest number of residues of one
sequence that can be matched with another
allowing for all possible deletions - Finds the best GLOBAL alignment of any two
sequences - N-W involves an iterative matrix method of
calculation - All possible pairs of residues (bases or amino
acids) - one from each sequence - are represented
in a 2-dimensional array - All possible alignments (comparisons) are
represented by pathways through this array
4Needleman-Wunsch Algorithm (cont.)
- Three main steps
- 1. Assign similarity values
- 2. For each cell, look at all possible pathways
back to the beginning of the sequence (allowing
insertions and deletions) and give that cell the
value of the maximum scoring pathway - 3. Construct an alignment (pathway) back from the
highest scoring cell to give the highest scoring
alignment
5Needleman-Wunsch Algorithm (cont.)
- Similarity values
- A numerical value is assigned to every cell in
the array depending on the similarity/dissimilarit
y of the two residues - These may be simple scores or more complicated,
e.g. related to chemical similarities or
frequency of observed substitutions - The example shown has
- match 1
- mismatch 0
6Needleman-Wunsch Algorithm (cont.)
- Score pathways through array
- For each cell want to know the maximum possible
score for an alignment ending at that point - Searches subrow and subcolumn, as shown, for the
highest score - Adds this to the score for the current cell
- Proceeds row by row through the array
- Gap penalty for the introduction of gaps in the
alignment (presumed insertions or deletions into
one sequence) here 0
HijmaxHi-1, j-1 s(ai,bj), maxHi-k,j-1 -Wk
s(ai,bj), maxHi-1, j-l -Wl s(ai,bj)
7Needleman-Wunsch Algorithm (cont.)
- Construct alignment
- The alignment score is cumulative by adding along
a path through the array - The best alignment has the highest score i.e. the
maximum match - Maximum match largest number resulting from
summing the cell values of every pathway - The maximum match will ALWAYS be somewhere in the
outer row or column shown - The alignment is constructed by working backwards
from the maximum match
MP-RCLCQR-JNCBA -PBRCKC-RNJ-CJA
8Needleman-Wunsch Algorithm (cont.)
- Statistical Significance
- Maximum match is a function of sequence
relationship and composition - Would like to know probability of obtaining
result (maximum match) from a pair of random
sequences - Estimate this experimentally
- form pairs of random sequences by randomly
drawing one member from each set (I.e. have same
composition as the real proteins) - if the value found for the real proteins is
significantly different from that for the random
proteins then the difference is a function of the
sequences alone and not of their composition
9Smith-Waterman Algorithm
- Instead of looking at each sequence in its
entirety this compares segments of all possible
lengths (LOCAL alignments) and chooses whichever
maximise the similarity measure - For every cell the algorithm calculates ALL
possible paths leading to it. These paths can be
of any length and can contain insertions and
deletions
10Smith-Waterman Algorithm (cont.)
- Only works effectively when gap penalties are
used - In example shown
- match 1
- mismatch -1/3
- gap -11/3k (kextent of gap)
- Start with all cell values 0
- Looks in subcolumn and subrow shown and in direct
diagonal for a score that is the highest when you
take alignment score or gap penalty into account
HijmaxHi-1, j-1 s(ai,bj), maxHi-k,j -Wk,
maxHi, j-l -Wl, 0
11Smith-Waterman Algorithm (cont.)
- Four possible ways of forming a path
- For every residue in the query sequence
- 1. Align with next residue of db sequence score
is previous score plus similarity score for the
two residues - 2. Deletion (i.e. match residue of query with a
gap) score is previous score minus gap penalty
dependent on size of gap - 3. Insertion (i.e. match residue of db sequence
with a gap) score is previous score minus gap
penalty dependent on size of gap - 4. Stop score is zero
- Choose whichever of these is the highest
12Smith-Waterman Algorithm (cont.)
- Construct Alignment
- The score in each cell is the maximum possible
score for an alignment of ANY LENGTH ending at
those coordinates - Trace pathway back from highest scoring cell
- This cell can be anywhere in the array
- Align highest scoring segment
GCC-UCG GCCAUUG
13Differences
- Needleman-Wunsch
- 1. Global alignments
- 2. Requires alignment score for a pair of
residues to be gt0 - 3. No gap penalty required
- 4. Score cannot decrease between two cells of a
pathway
- Smith-Waterman
- 1. Local alignments
- 2. Residue alignment score may be positive or
negative - 3. Requires a gap penalty to work effectively
- 4. Score can increase, decrease or stay level
between two cells of a pathway