Biological Sequence Comparison Database Homology Searching - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

Biological Sequence Comparison Database Homology Searching

Description:

These paths can be of any length and can contain insertions and deletions. TM ... 3. Insertion (i.e. match residue of db sequence with a gap) ... score is ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 14
Provided by: aoifemc
Category:

less

Transcript and Presenter's Notes

Title: Biological Sequence Comparison Database Homology Searching


1
Biological Sequence Comparison / Database
Homology Searching
  • Aoife McLysaght
  • Summer Intern,
  • Compaq Computer Corporation
  • Ballybrit Business Park, Galway, Ireland

2
Database Homology Searching
  • Use algorithms to increase efficiency and to
    provide a mathematical basis for searches which
    can be translated into statistical significance
  • Assumes that sequence, structure and function are
    inter-related
  • BLAST (Basic Local Alignment Search Tool) and
    FastA (Fast Alignment)
  • heuristic approximations of Needleman-Wunsch and
    Smith-Waterman algorithms
  • reduce computation

3
Needleman-Wunsch Algorithm
  • General algorithm for sequence comparison
  • Maximise a similarity score, to give maximum
    match
  • Maximum match largest number of residues of one
    sequence that can be matched with another
    allowing for all possible deletions
  • Finds the best GLOBAL alignment of any two
    sequences
  • N-W involves an iterative matrix method of
    calculation
  • All possible pairs of residues (bases or amino
    acids) - one from each sequence - are represented
    in a 2-dimensional array
  • All possible alignments (comparisons) are
    represented by pathways through this array

4
Needleman-Wunsch Algorithm (cont.)
  • Three main steps
  • 1. Assign similarity values
  • 2. For each cell, look at all possible pathways
    back to the beginning of the sequence (allowing
    insertions and deletions) and give that cell the
    value of the maximum scoring pathway
  • 3. Construct an alignment (pathway) back from the
    highest scoring cell to give the highest scoring
    alignment

5
Needleman-Wunsch Algorithm (cont.)
  • Similarity values
  • A numerical value is assigned to every cell in
    the array depending on the similarity/dissimilarit
    y of the two residues
  • These may be simple scores or more complicated,
    e.g. related to chemical similarities or
    frequency of observed substitutions
  • The example shown has
  • match 1
  • mismatch 0

6
Needleman-Wunsch Algorithm (cont.)
  • Score pathways through array
  • For each cell want to know the maximum possible
    score for an alignment ending at that point
  • Searches subrow and subcolumn, as shown, for the
    highest score
  • Adds this to the score for the current cell
  • Proceeds row by row through the array
  • Gap penalty for the introduction of gaps in the
    alignment (presumed insertions or deletions into
    one sequence) here 0

HijmaxHi-1, j-1 s(ai,bj), maxHi-k,j-1 -Wk
s(ai,bj), maxHi-1, j-l -Wl s(ai,bj)
7
Needleman-Wunsch Algorithm (cont.)
  • Construct alignment
  • The alignment score is cumulative by adding along
    a path through the array
  • The best alignment has the highest score i.e. the
    maximum match
  • Maximum match largest number resulting from
    summing the cell values of every pathway
  • The maximum match will ALWAYS be somewhere in the
    outer row or column shown
  • The alignment is constructed by working backwards
    from the maximum match

MP-RCLCQR-JNCBA -PBRCKC-RNJ-CJA
8
Needleman-Wunsch Algorithm (cont.)
  • Statistical Significance
  • Maximum match is a function of sequence
    relationship and composition
  • Would like to know probability of obtaining
    result (maximum match) from a pair of random
    sequences
  • Estimate this experimentally
  • form pairs of random sequences by randomly
    drawing one member from each set (I.e. have same
    composition as the real proteins)
  • if the value found for the real proteins is
    significantly different from that for the random
    proteins then the difference is a function of the
    sequences alone and not of their composition

9
Smith-Waterman Algorithm
  • Instead of looking at each sequence in its
    entirety this compares segments of all possible
    lengths (LOCAL alignments) and chooses whichever
    maximise the similarity measure
  • For every cell the algorithm calculates ALL
    possible paths leading to it. These paths can be
    of any length and can contain insertions and
    deletions

10
Smith-Waterman Algorithm (cont.)
  • Only works effectively when gap penalties are
    used
  • In example shown
  • match 1
  • mismatch -1/3
  • gap -11/3k (kextent of gap)
  • Start with all cell values 0
  • Looks in subcolumn and subrow shown and in direct
    diagonal for a score that is the highest when you
    take alignment score or gap penalty into account

HijmaxHi-1, j-1 s(ai,bj), maxHi-k,j -Wk,
maxHi, j-l -Wl, 0
11
Smith-Waterman Algorithm (cont.)
  • Four possible ways of forming a path
  • For every residue in the query sequence
  • 1. Align with next residue of db sequence score
    is previous score plus similarity score for the
    two residues
  • 2. Deletion (i.e. match residue of query with a
    gap) score is previous score minus gap penalty
    dependent on size of gap
  • 3. Insertion (i.e. match residue of db sequence
    with a gap) score is previous score minus gap
    penalty dependent on size of gap
  • 4. Stop score is zero
  • Choose whichever of these is the highest

12
Smith-Waterman Algorithm (cont.)
  • Construct Alignment
  • The score in each cell is the maximum possible
    score for an alignment of ANY LENGTH ending at
    those coordinates
  • Trace pathway back from highest scoring cell
  • This cell can be anywhere in the array
  • Align highest scoring segment

GCC-UCG GCCAUUG
13
Differences
  • Needleman-Wunsch
  • 1. Global alignments
  • 2. Requires alignment score for a pair of
    residues to be gt0
  • 3. No gap penalty required
  • 4. Score cannot decrease between two cells of a
    pathway
  • Smith-Waterman
  • 1. Local alignments
  • 2. Residue alignment score may be positive or
    negative
  • 3. Requires a gap penalty to work effectively
  • 4. Score can increase, decrease or stay level
    between two cells of a pathway
Write a Comment
User Comments (0)
About PowerShow.com