Sequence Alignment - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Sequence Alignment

Description:

Sequence Alignment * * * * * * * * * * * * * * * Alignment of pairs of sequence Local and global alignments Methods of alignments Dot matrix analysis Dynamic ... – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 24
Provided by: AnatolyR7
Category:

less

Transcript and Presenter's Notes

Title: Sequence Alignment


1
Sequence Alignment
2
Outline
  • Alignment of pairs of sequence
  • Local and global alignments
  • Methods of alignments
  • Dot matrix analysis
  • Dynamic programming approach
  • Use of scoring matrices and gap penalties
  • Scoring matrices -- PAM and BLOSUM

3
What is sequence alignment?
  • Sequence alignment is a way of arranging the
    sequences of DNA, RNA or protein to identify
    regions of similarity that may be a consequence
    of functional, structural or evolutionary
    relationships between the sequences.
  • The procedure of comparing two (pair-wise
    alignment) or more multiple sequences is to
    search for a series of individual characters or
    patterns that are in the same order in the
    sequences.
  • There are two types of alignment local and
    global.

4
Global alignment vs Local alignment
  • Global alignment is attempting to match as much
    of the sequence as possible.
  • The tool for Global alignment is based on
    Needleman-Wunsch algorithm.
  • Local alignment is to try to find the regions
    with highest density of matches. The tool for
    local alignment is based on Smith-Waterman.
  • Both algorithms are derivates from the basic
    dynamic programming algorithm.

L G P S S K Q T G K G S - S R I W D N
Global alignment L N -
I T K S A G K G A I M R L G D A - - - - - - - T
G K G - - - - - - - -
Local alignment - - - - - - - A G K
G - - - - - - - -
5
Why do sequence alignment?
  • Sequence alignment is useful for discovering
    structural, functional and evolutionary
    information in biological sequences.
  • Sequences that are very much alike may have
    similar secondary and 3D structure, similar
    function and likely a common ancestral sequence.
    It is extremely unlikely that such sequences
    obtained similarity by chance.
  • -- For DNA molecules with n nucleotides such
    probability is very low P 4-n.
  • -- For proteins with n nucleotides, the
    probability even much lower P 20 n.
  • Sequence alignment makes the following tasks
    easy 1.annotation of new sequences 2. modelling
    of protein structures 3. design and analysis of
    gene expression experiments

6
An example of aligning text strings
  • Raw Data ??? T C A T G C A T T G
  • 2 matches, 0 gaps
  • T C A T G C A T T G
  • 3 matches (2 end gaps)
  • T C A T G . . C A T
    T G
  • 4 matches, 1 insertion
  • T C A - T G . C
    A T T G
  • 4 matches, 1 insertion
  • T C A T - G .
    C A T T G

7
Terminologies of sequence comparison
  • Sequence identity -- exactly the same Amino Acid
    or Nucleotide in the same position.
  • Sequence similarity -- Substitutions with similar
    chemical properties.
  • Sequence homology -- general term that indicates
    evolutionary relatedness among sequences we
    usually measure of percentage identity of
    sequence homology
  • Pairwise alignment -- used to find the
    best-matching piecewise (local) or global
    alignments of two query sequences. Pairwise
    alignments can only be used between two sequences
    at a time.
  • Multiple sequence alignment -- try to align all
    of the sequences in a given query set.

8
Methods of pairwise alignment
  • Dot matrix analysis
  • The dynamic programming (DP) algorithm
  • Word methods

9
What is Dot matrix analysis
  • A dot matrix analysis is a method for comparing
    two sequences to look for possible alignment
    (Gibbs and McIntyre 1970)
  • The algorithm for a dot matrix
  • 1. One sequence (A) is listed across the top of
    the matrix and the other (B) is listed down the
    left side
  • 2. Starting from the first character in B, one
    moves across the page keeping in the first row
    and placing a dot in many column where the
    character in A is the same
  • 3. The process is continued until all possible
    comparisons between A and B are made
  • 4. Any region of similarity is revealed by a
    diagonal row of dots
  • 5. Isolated dots not on diagonal represent
    random matches

10
What can Dot matrix analysis do?
  • It can detect of matching regions can be
    improved by filtering out random matches and this
    can be achieved by using a sliding window
  • It can be used to assess repetitiveness in a
    single sequence, such as direct and inverted
    repeats within the sequences

11
1st example of Dot matrix analysis two identical
sequences
  • http//arbl.cvmbs.colostate.edu/molkit/dnadot/inde
    x.html

12
2nd example of Dot matrix analysis two very
different sequences
  • http//arbl.cvmbs.colostate.edu/molkit/dnadot/ind
    ex.html

13
3rd example of Dot matrix analysis two similar
sequences sequences
  • http//arbl.cvmbs.colostate.edu/molkit/dnadot/inde
    x.html

14
Dynamic programming algorithm
  • The approach compares every pair of characters
    in the two sequences and generates an alignment,
    which is the best or optimal.
  • The method can be useful in aligning nucleotide
    to protein sequences.
  • The method requires large amounts of computing
    power and is a highly computationally demanding
    because the nature of dynamic programming
    technique is recursion.
  • New algorithmic improvements as well as
    increasing computer capacity make possible to
    align a query sequence against a large DB in a
    few minutes.
  • Two approaches for dynamic programming Top-down
    approach and Bottom-up.

15
The procedure of the dynamic programming algorithm
  • The alignment procedure depends upon scoring
    system based on probability that
  • 1) a particular amino acid pair is found in
    alignments of related proteins (pxy)
  • 2) the same amino acid pair is aligned by
    chance (pxpy)
  • 3) introduction of a gap would be a better
    choice as it increases the score.
  • A substitution matrix is composed of the ratio
    of the first two probabilities. There are many
    such matrices, two of them PAM and BLOSUM will be
    talked in next few slides.
  • The calculation of scores for the gap
    introduction and its extension is from the
    matrices and represent a prior knowledge and some
    assumptions.
  • For example one of them is quite simple, if
    negative cost of a gap is too high a reasonable
    alignment between slightly different sequences
    will be never achieved but if it is too low an
    optimal alignment is hardly possible. Other
    assumptions are based on sophisticated
    statistical procedures.

16
An example scoring a sequence alignment with a
gap penalty
Sequence 1 V D S - C Y Sequence 2 V E S L
C Y Score 4 2 4 -11 9 7 Score
sum of amino acid pair scores (26) minus
single gap penalty (11) 15
Note 1. it is likely to have non-identical amino
acids placed in the corresponding positions.
2. Scores gained by each match are not
always the same, for instance two rare amino
acids will score more than two common.
3. The alignment gap(s) may be introduced for
optimising the score. Introduction of gaps causes
penalties.
17
Steps for the dynamic programming algorithm
  • Score of new Score of previous Score of
    new
  • alignment alignment (A)
    aligned pair
  • V D S - C Y V D S - C Y
  • V E S L C Y V E S L C Y
  • 15 8
    7
  • 2. Score of Score of previous
    Score of new
  • alignment (A) alignment (B)
    aligned pair
  • V D S - C V D S - C
  • V E S L C V E S L C
  • 8 -1
    9
  • 3. Repeat removing aligned pairs until end of
    alignments is reached

18
Why use a substitution matrix?
  • Determine likelihood of homology between two
    sequences.
  • Substitutions that are more likely should get a
    higher score,
  • Substitutions that are less likely should get a
    lower score.

19
How to calculate Scoring Matrices
  • Log-odds matrix where each cell gives the
    probability of aligning those two residues
  • Score of alignment Sum of log-odds scores of
    residues
  • Score for each residue given by

20
Types of Matrices
  • Percent Identity
  • Standard scoring matrix to align DNA sequences
  • PAM
  • Estimates the rate at which each possible residue
    in a sequence changes to each other residue over
    time
  • BLOSUM-X
  • Identifies sequences that are X similar to the
    query sequence

21
Scoring matrices PAM (Percent Accepted Mutation)
and BLOSUM62 (BLOcks amino acid SUbstitution
Matrices)
Amino acids are grouped according to to the
chemistry of the side group (C) sulfhydryl,
(STPAG)-small hydrophilic, (NDEQ) acid, acid
amide and hydrophilic, (HRK) basic, (MILV) small
hydrophobic, and (FYW) aromatic. Log odds values
10 means that ancestor probability is greater, 0
means that the probability are equal, -4 means
that the change is random. Thus the probability
of alignment YY/YY is 101020, whereas YY/TP is
3-5-8, a rare and unexpected between homologous
sequences.
BLOSUM is based on local alignments. BLOSUM was
first introduced in a paper by Henikoff and
Henikoff. They scanned the for very conserved
regions of protein families (that do not have
gaps in the sequence alignment) and then counted
the relative frequencies of amino acids and their
substitution probabilities. Then, they calculated
a log-odds score for each of the 210 possible
substitutions of the 20 standard amino acids.
22
Word methods
  • Word methods, also known as k-tuple methods, are
    heuristic methods that are not guaranteed to find
    an optimal alignment solution, but are
    significantly more efficient than dynamic
    programming.
  • The typical tools used for this method is BLAST
    and FASTA.

23
The list of sequence alignment software
  • http//en.wikipedia.org/wiki/List_of_sequence_alig
    nment_software
Write a Comment
User Comments (0)
About PowerShow.com