Multiple Sequence Alignment - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Multiple Sequence Alignment

Description:

SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE ---- FAST CAT ... SeqD - THE ---- FAT CAT 100 SeqD - THE ---- FA-T CAT 100. Secondary library ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 47
Provided by: Martin488
Category:

less

Transcript and Presenter's Notes

Title: Multiple Sequence Alignment


1
Multiple Sequence Alignment
2
Alignment can be easy or difficult
Easy
Difficult due to insertions or deletions
(indels)
3
Homology Definition
  • Homology similarity that is the result of
    inheritance from a common ancestor -
    identification and analysis of homologies is
    central to phylogenetic systematics.
  • An Alignment is an hypothesis of positional
    homology between bases/Amino Acids.

4
Multiple Sequence Alignment- Goals
  • To generate a concise, information-rich summary
    of sequence data.
  • Sometimes used to illustrate the similarity
    between a group of sequences.
  • Sometimes used to illustrate the dissimilarity
    between a group of sequences.
  • Alignments can be treated as models that can be
    used to test hypotheses.

5
(No Transcript)
6
(No Transcript)
7
Alignment of 16S rRNA can be guided by secondary
structure
Alignment of 16S rRNA sequences from different
bacteria
8
Protein Alignment may be guided by Tertiary
Structure Interactions
Homo sapiens DjlA protein
Escherichia coli DjlA protein
9
Multiple Sequence Alignment- Methods
  • 3 main methods of alignment
  • Manual
  • Automatic
  • Combined

10
Manual Alignment - reasons
  • Might be carried out because
  • Alignment is easy.
  • There is some extraneous information
    (structural).
  • Automated alignment methods have encountered the
    local minimum problem.
  • An automated alignment method can be improved.

11
Dynamic programming
  • 2 methods
  • Dynamic programming
  • Consider 2 protein sequences of 100 amino acids
    in length.
  • If it takes 1002 seconds to exhaustively align
    these sequences, then it will take 1003 seconds
    to align 3 sequences, 1004 to align 4
    sequences...etc.
  • 1.90258x1034 years to align 20 sequences
    exhaustively.
  • Progressive alignment

12
Progressive Alignment
  • Devised by Feng and Doolittle in 1987.
  • Essentially a heuristic method and as such is not
    guaranteed to find the optimal alignment.
  • Requires n-1n-2n-3...n-n1 pairwise alignments
    as a starting point
  • Most successful implementation is Clustal (Des
    Higgins)

13
Overview of ClustalW Procedure
CLUSTAL W
Hbb_Human 1 -
Hbb_Horse 2 .17 -
Hba_Human 3 .59 .60 -
Quick pairwise alignment calculate distance
matrix
Hba_Horse 4 .59 .59 .13 -
Myg_Whale 5 .77 .77 .75 .75 -
Hbb_Human
4
2
3
Hbb_Horse
Neighbor-joining tree (guide tree)
Hba_Human
1
Hba_Horse
Myg_Whale
alpha-helices
1 PEEKSAVTALWGKVN--VDEVGG
4
2
3
Progressive alignment following guide tree
2 GEEKAAVLALWDKVN--EEEVGG
3 PADKTNVKAAWGKVGAHAGEYGA
1
4 AADKTNVKAAWSKVGGHAGEYGA
5 EHEWQLVLHVWAKVEADVAGHGQ
14
ClustalW- Pairwise Alignments
  • First perform all possible pairwise alignments
    between each pair of sequences. There are
    (n-1)(n-2)...(n-n1) possibilities.
  • Calculate the distance between each pair of
    sequences based on these isolated pairwise
    alignments.
  • Generate a distance matrix.

15
Path Graph for aligning two sequences.
16
Possible alignment
  • Scoring Scheme
  • Match 1
  • Mismatch 0
  • Indel -1

1
1
0
1
Score for this path 2
0
-1
17
Alignment using this path
1
GATTC- GAATTC
1
0
1
0
-1
18
Optimal Alignment 1
Alignment using this path GA-TTC GAATTC
1
1
-1
1
1
Alignment score 4
1
19
Optimal Alignment 2
Alignment using this path G-ATTC GAATTC
1
-1
1
1
1
Alignment score 4
1
20
ClustalW- Guide Tree
  • Generate a Neighbor-Joining guide tree from
    these pairwise distances.
  • This guide tree gives the order in which the
    progressive alignment will be carried out.

21
Multiple Alignment- First pair
  • Align the two most closely-related sequences
    first.
  • This alignment is then fixed and will never
    change. If a gap is to be introduced
    subsequently, then it will be introduced in the
    same place in both sequences, but their relative
    alignment remains unchanged.

22
ClustalW- Decision time
  • Consult the guide tree to see what alignment is
    performed next.
  • Align a third sequence to the first two
  • Or
  • Align two entirely different sequences to each
    other.

Option 1
Option 2
23
ClustalW- Alternative 1
If the situation arises where a third sequence is
aligned to the first two, then when a gap has to
be introduced to improve the alignment, each of
these two entities are treated as two single
sequences.

24
ClustalW- Alternative 2
  • If, on the other hand, two separate sequences
    have to be aligned together, then the first
    pairwise alignment is placed to one side and the
    pairwise alignment of the other two is carried
    out.


25
ClustalW- Progression
  • The alignment is progressively built up in this
    way, with each step being treated as a pairwise
    alignment, sometimes with each member of a pair
    having more than one sequence.

26
Progressive alignment - step 1
gctcgatacgatacgatgactagcta gctcgatacaagacgatgacagc
ta gctcgatacacgatgactagcta gctcgatacacgatgacgagcga
ctcgaacgatacgatgactagct
gctcgatacgatacgatgactagcta gctcgatacaagacgatgac-ag
cta
27
Progressive alignment - step 2
gctcgatacgatacgatgactagcta gctcgatacaagacgatgacagc
ta gctcgatacacgatgactagcta gctcgatacacgatgacgagcga
ctcgaacgatacgatgactagct
gctcgatacacgatgactagcta gctcgatacacgatgacgagcga
28
Progressive alignment - step 3
gctcgatacgatacgatgactagcta gctcgatacaagacgatgac-ag
cta gctcgatacacgatgactagcta gctcgatacacgatgacgag
cga
gctcgatacgatacgatgactagcta gctcgatacaagacgatgac-ag
cta Gctcgatacacga---tgactagcta Gctcgatacacga---tga
cgagcga
29
Progressive alignment - final step
gctcgatacgatacgatgactagcta gctcgatacaagacgatgac-ag
cta gctcgatacacga---tgactagcta gctcgatacacga---tga
cgagcga ctcgaacgatacgatgactagct
gctcgatacgatacgatgactagcta Gctcgatacaagacgatgac-ag
cta gctcgatacacga---tgactagcta gctcgatacacga---tga
cgagcga -ctcga-acgatacgatgactagct-
30
ClustalW-Good points/Bad points
  • Advantages
  • Speed.
  • Disadvantages
  • No objective function.
  • No way of quantifying whether or not the
    alignment is good
  • No way of knowing if the alignment is correct.

31
ClustalW-Local Minimum
  • Potential problems
  • Local minimum problem. If an error is introduced
    early in the alignment process, it is impossible
    to correct this later in the procedure.
  • Arbitrary alignment.

32
Increasing the sophistication of the alignment
process.
  • Should we treat all the sequences in the same
    way? - even though some sequences are
    closely-related and some sequences are distant
    relatives.
  • Should we treat all positions in the sequences as
    though they were the same? - even though they
    might have different functions and different
    locations in the 3-dimensional structure.

33
(No Transcript)
34
ClustalW- Caveats
  • Sequence weighting
  • Varying substitution matrices
  • Residue-specific gap penalties and reduced
    penalties in hydrophilic regions (external
    regions of protein sequences), encourage gaps in
    loops rather than in core regions.
  • Positions in early alignments where gaps have
    been opened receive locally reduced gap penalties
    to encourage openings in subsequent alignments

35
ClustalW- User-supplied values
  • Two penalties are set by the user (there are
    default values, but you should know that it is
    possible to change these).
  • GOP- Gap Opening Penalty is the cost of opening a
    gap in an alignment.
  • GEP- Gap Extension Penalty is the cost of
    extending this gap.

36
Position-Specific gap penalties
  • Before any pair of (groups of) sequences are
    aligned, a table of GOPs are generated for each
    position in the two (sets of) sequences.
  • The GOP is manipulated in a position-specific
    manner, so that it can vary over the sequences.
  • If there is a gap at a position, the GOP and GEP
    penalties are lowered, the other rules do not
    apply.
  • This makes gaps more likely at positions where
    gaps already exist.

37
Discouraging too many gaps
  • If there is no gap opened, then the GOP is
    increased if the position is within 8 residues of
    an existing gap.
  • This discourages gaps that are too close
    together.
  • At any position within a run of hydrophilic
    residues, the GOP is decreased.
  • These runs usually indicate loop regions in
    protein structures.
  • A run of 5 hydrophilic residues is considered to
    be a hydrophilic stretch.
  • The default hydrophilic residues are
  • D, E, G, K, N, Q, P, R, S
  • But this can be changed by the user.

38
Divergent Sequences
  • The most divergent sequences (most different, on
    average from all of the other sequences) are
    usually the most difficult to align.
  • It is sometimes better to delay their aligment
    until later (when the easier sequences have
    already been aligned).
  • The user has the choice of setting a cutoff
    (default is 40 identity).
  • This will delay the alignment until the others
    have been aligned.

39
T-COFFEETree-based consistency objective
function for alignment evaluation)
  • Generate a library of all the pairwise alignments
    between the sequences.
  • This gives positional information concerning
    which residues are homologous to which other
    residues.
  • This can then be used to guide progressive
    alignments.

40
An example dataset
  • SequenceA GARFIELD THE LAST FAT CAT
  • SequenceB GARFIELD THE FAST CAT
  • SequenceC GARFIELD THE VERY FAST CAT
  • SequenceD THE FAT CAT

Clustal alignment
Sequence A GARFIELD THE LAST FA-T CAT Sequence B
GARFIELD THE FAST CA-T --- Sequence C GARFIELD
THE VERY FAST CAT Sequence D -------- THE ----
FA-T CAT
41
Primary library
  • SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD
    THE ---- FAST CAT
  • SeqB GARFIELD THE FAST CAT --- 88 SeqC GARFIELD
    THE VERY FAST CAT 100
  • SeqA GARFIELD THE LAST FA-T CAT SeqB GARFIELD
    THE FAST CAT
  • SeqC GARFIELD THE VERY FAST CAT 77 SeqD --------
    THE FA-T CAT 100
  • SeqA GARFIELD THE LAST FAT CAT SeqC GARFIELD
    THE VERY FAST CAT
  • SeqD -------- THE ---- FAT CAT 100 SeqD --------
    THE ---- FA-T CAT 100

42
Secondary library
  • SeqA GARFIELD THE LAST FAT CAT
  • SeqB GARFIELD THE FAST CAT Weight 88
  • SeqA GARFIELD THE LAST FAT CAT
  • SeqC GARFIELD THE VERY FAST CAT
  • SeqB GARFIELD THE FAST CAT Weight 77
  • SeqA GARFIELD THE LAST FAT CAT
  • SeqD THE FAT CAT
  • SeqB GARFIELD THE FAST CAT Weight 100

43
Extended library
  • SeqA GARFIELD THE LAST FAT CAT
  • SeqB GARFIELD THE FAST CAT
  • SeqA GARFIELD THE LAST FA-T CAT
  • SeqB GARFIELD THE ---- FAST CAT

Dynamic programming
44
Advice on progressive alignment
  • Progressive alignment is a mathematical process
    that is completely independent of biological
    reality.
  • Can be a very good estimate
  • Can be an impossibly poor estimate.
  • Requires user input and skill.
  • Treat cautiously
  • Can be improved by eye (usually)
  • Often helps to have colour-coding.
  • Depending on the use, the user should be able to
    make a judgement on those regions that are
    reliable or not.
  • For phylogeny reconstruction, only use those
    positions whose hypothesis of positional homology
    is unimpeachable

45
Alignment of protein-coding DNA sequences
  • It is not very sensible to align the DNA
    sequences of protein-coding genes.

ATGCTGTTAGGG ATGCTCGTAGGG
ATGCT-GTTAGGG ATGCTCGT-AGGG
The result might be highly-implausible and might
not reflect what is known about biological
processes. It is much more sensible to translate
the sequences to their corresponding amino acid
sequences, align these protein sequences and then
put the gaps in the DNA sequences according to
where they are found in the amino acid alignment.
46
Manual Alignment- software
  • GDE- The Genetic Data Environment (UNIX)
  • CINEMA- Java applet available from
  • http//www.biochem.ucl.ac.uk
  • Seqapp/Seqpup- Mac/PC/UNIX available from
  • http//iubio.bio.indiana.edu
  • SeAl for Macintosh, available from
  • http//evolve.zoo.ox.ac.uk/Se-Al/Se-Al.html
  • BioEdit for PC, available from
  • http//www.mbio.ncsu.edu/RNaseP/info/programs/BIOE
    DIT/bioedit.html
Write a Comment
User Comments (0)
About PowerShow.com