Title: Multiple Sequence Alignment
1Multiple Sequence Alignment
2Alignment can be easy or difficult
Easy
Difficult due to insertions or deletions
(indels)
3Homology Definition
- Homology similarity that is the result of
inheritance from a common ancestor -
identification and analysis of homologies is
central to phylogenetic systematics. - An Alignment is an hypothesis of positional
homology between bases/Amino Acids.
4Multiple Sequence Alignment- Goals
- To generate a concise, information-rich summary
of sequence data. - Sometimes used to illustrate the similarity
between a group of sequences. - Sometimes used to illustrate the dissimilarity
between a group of sequences. - Alignments can be treated as models that can be
used to test hypotheses.
5(No Transcript)
6(No Transcript)
7Alignment of 16S rRNA can be guided by secondary
structure
Alignment of 16S rRNA sequences from different
bacteria
8Protein Alignment may be guided by Tertiary
Structure Interactions
Homo sapiens DjlA protein
Escherichia coli DjlA protein
9Multiple Sequence Alignment- Methods
- 3 main methods of alignment
- Manual
- Automatic
- Combined
10Manual Alignment - reasons
- Might be carried out because
- Alignment is easy.
- There is some extraneous information
(structural). - Automated alignment methods have encountered the
local minimum problem. - An automated alignment method can be improved.
11Dynamic programming
- 2 methods
- Dynamic programming
- Consider 2 protein sequences of 100 amino acids
in length. - If it takes 1002 seconds to exhaustively align
these sequences, then it will take 1003 seconds
to align 3 sequences, 1004 to align 4
sequences...etc. - 1.90258x1034 years to align 20 sequences
exhaustively. - Progressive alignment
12Progressive Alignment
- Devised by Feng and Doolittle in 1987.
- Essentially a heuristic method and as such is not
guaranteed to find the optimal alignment. - Requires n-1n-2n-3...n-n1 pairwise alignments
as a starting point - Most successful implementation is Clustal (Des
Higgins)
13Overview of ClustalW Procedure
CLUSTAL W
Hbb_Human 1 -
Hbb_Horse 2 .17 -
Hba_Human 3 .59 .60 -
Quick pairwise alignment calculate distance
matrix
Hba_Horse 4 .59 .59 .13 -
Myg_Whale 5 .77 .77 .75 .75 -
Hbb_Human
4
2
3
Hbb_Horse
Neighbor-joining tree (guide tree)
Hba_Human
1
Hba_Horse
Myg_Whale
alpha-helices
1 PEEKSAVTALWGKVN--VDEVGG
4
2
3
Progressive alignment following guide tree
2 GEEKAAVLALWDKVN--EEEVGG
3 PADKTNVKAAWGKVGAHAGEYGA
1
4 AADKTNVKAAWSKVGGHAGEYGA
5 EHEWQLVLHVWAKVEADVAGHGQ
14ClustalW- Pairwise Alignments
- First perform all possible pairwise alignments
between each pair of sequences. There are
(n-1)(n-2)...(n-n1) possibilities. - Calculate the distance between each pair of
sequences based on these isolated pairwise
alignments. - Generate a distance matrix.
15Path Graph for aligning two sequences.
16Possible alignment
- Scoring Scheme
- Match 1
- Mismatch 0
- Indel -1
1
1
0
1
Score for this path 2
0
-1
17Alignment using this path
1
GATTC- GAATTC
1
0
1
0
-1
18Optimal Alignment 1
Alignment using this path GA-TTC GAATTC
1
1
-1
1
1
Alignment score 4
1
19Optimal Alignment 2
Alignment using this path G-ATTC GAATTC
1
-1
1
1
1
Alignment score 4
1
20ClustalW- Guide Tree
- Generate a Neighbor-Joining guide tree from
these pairwise distances. - This guide tree gives the order in which the
progressive alignment will be carried out.
21Multiple Alignment- First pair
- Align the two most closely-related sequences
first. - This alignment is then fixed and will never
change. If a gap is to be introduced
subsequently, then it will be introduced in the
same place in both sequences, but their relative
alignment remains unchanged.
22ClustalW- Decision time
- Consult the guide tree to see what alignment is
performed next. - Align a third sequence to the first two
- Or
- Align two entirely different sequences to each
other.
Option 1
Option 2
23ClustalW- Alternative 1
If the situation arises where a third sequence is
aligned to the first two, then when a gap has to
be introduced to improve the alignment, each of
these two entities are treated as two single
sequences.
24ClustalW- Alternative 2
- If, on the other hand, two separate sequences
have to be aligned together, then the first
pairwise alignment is placed to one side and the
pairwise alignment of the other two is carried
out.
25ClustalW- Progression
- The alignment is progressively built up in this
way, with each step being treated as a pairwise
alignment, sometimes with each member of a pair
having more than one sequence.
26Progressive alignment - step 1
gctcgatacgatacgatgactagcta gctcgatacaagacgatgacagc
ta gctcgatacacgatgactagcta gctcgatacacgatgacgagcga
ctcgaacgatacgatgactagct
gctcgatacgatacgatgactagcta gctcgatacaagacgatgac-ag
cta
27Progressive alignment - step 2
gctcgatacgatacgatgactagcta gctcgatacaagacgatgacagc
ta gctcgatacacgatgactagcta gctcgatacacgatgacgagcga
ctcgaacgatacgatgactagct
gctcgatacacgatgactagcta gctcgatacacgatgacgagcga
28Progressive alignment - step 3
gctcgatacgatacgatgactagcta gctcgatacaagacgatgac-ag
cta gctcgatacacgatgactagcta gctcgatacacgatgacgag
cga
gctcgatacgatacgatgactagcta gctcgatacaagacgatgac-ag
cta Gctcgatacacga---tgactagcta Gctcgatacacga---tga
cgagcga
29Progressive alignment - final step
gctcgatacgatacgatgactagcta gctcgatacaagacgatgac-ag
cta gctcgatacacga---tgactagcta gctcgatacacga---tga
cgagcga ctcgaacgatacgatgactagct
gctcgatacgatacgatgactagcta Gctcgatacaagacgatgac-ag
cta gctcgatacacga---tgactagcta gctcgatacacga---tga
cgagcga -ctcga-acgatacgatgactagct-
30ClustalW-Good points/Bad points
- Advantages
- Speed.
- Disadvantages
- No objective function.
- No way of quantifying whether or not the
alignment is good - No way of knowing if the alignment is correct.
31ClustalW-Local Minimum
- Potential problems
- Local minimum problem. If an error is introduced
early in the alignment process, it is impossible
to correct this later in the procedure. - Arbitrary alignment.
32Increasing the sophistication of the alignment
process.
- Should we treat all the sequences in the same
way? - even though some sequences are
closely-related and some sequences are distant
relatives. - Should we treat all positions in the sequences as
though they were the same? - even though they
might have different functions and different
locations in the 3-dimensional structure.
33(No Transcript)
34ClustalW- Caveats
- Sequence weighting
- Varying substitution matrices
- Residue-specific gap penalties and reduced
penalties in hydrophilic regions (external
regions of protein sequences), encourage gaps in
loops rather than in core regions. - Positions in early alignments where gaps have
been opened receive locally reduced gap penalties
to encourage openings in subsequent alignments
35ClustalW- User-supplied values
- Two penalties are set by the user (there are
default values, but you should know that it is
possible to change these). - GOP- Gap Opening Penalty is the cost of opening a
gap in an alignment. - GEP- Gap Extension Penalty is the cost of
extending this gap.
36Position-Specific gap penalties
- Before any pair of (groups of) sequences are
aligned, a table of GOPs are generated for each
position in the two (sets of) sequences. - The GOP is manipulated in a position-specific
manner, so that it can vary over the sequences. - If there is a gap at a position, the GOP and GEP
penalties are lowered, the other rules do not
apply. - This makes gaps more likely at positions where
gaps already exist.
37Discouraging too many gaps
- If there is no gap opened, then the GOP is
increased if the position is within 8 residues of
an existing gap. - This discourages gaps that are too close
together. - At any position within a run of hydrophilic
residues, the GOP is decreased. - These runs usually indicate loop regions in
protein structures. - A run of 5 hydrophilic residues is considered to
be a hydrophilic stretch. - The default hydrophilic residues are
- D, E, G, K, N, Q, P, R, S
- But this can be changed by the user.
38Divergent Sequences
- The most divergent sequences (most different, on
average from all of the other sequences) are
usually the most difficult to align. - It is sometimes better to delay their aligment
until later (when the easier sequences have
already been aligned). - The user has the choice of setting a cutoff
(default is 40 identity). - This will delay the alignment until the others
have been aligned.
39T-COFFEETree-based consistency objective
function for alignment evaluation)
- Generate a library of all the pairwise alignments
between the sequences. - This gives positional information concerning
which residues are homologous to which other
residues. - This can then be used to guide progressive
alignments.
40An example dataset
- SequenceA GARFIELD THE LAST FAT CAT
- SequenceB GARFIELD THE FAST CAT
- SequenceC GARFIELD THE VERY FAST CAT
- SequenceD THE FAT CAT
Clustal alignment
Sequence A GARFIELD THE LAST FA-T CAT Sequence B
GARFIELD THE FAST CA-T --- Sequence C GARFIELD
THE VERY FAST CAT Sequence D -------- THE ----
FA-T CAT
41Primary library
- SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD
THE ---- FAST CAT - SeqB GARFIELD THE FAST CAT --- 88 SeqC GARFIELD
THE VERY FAST CAT 100 - SeqA GARFIELD THE LAST FA-T CAT SeqB GARFIELD
THE FAST CAT - SeqC GARFIELD THE VERY FAST CAT 77 SeqD --------
THE FA-T CAT 100 - SeqA GARFIELD THE LAST FAT CAT SeqC GARFIELD
THE VERY FAST CAT - SeqD -------- THE ---- FAT CAT 100 SeqD --------
THE ---- FA-T CAT 100
42Secondary library
- SeqA GARFIELD THE LAST FAT CAT
- SeqB GARFIELD THE FAST CAT Weight 88
- SeqA GARFIELD THE LAST FAT CAT
- SeqC GARFIELD THE VERY FAST CAT
- SeqB GARFIELD THE FAST CAT Weight 77
- SeqA GARFIELD THE LAST FAT CAT
- SeqD THE FAT CAT
- SeqB GARFIELD THE FAST CAT Weight 100
43Extended library
- SeqA GARFIELD THE LAST FAT CAT
- SeqB GARFIELD THE FAST CAT
- SeqA GARFIELD THE LAST FA-T CAT
- SeqB GARFIELD THE ---- FAST CAT
Dynamic programming
44Advice on progressive alignment
- Progressive alignment is a mathematical process
that is completely independent of biological
reality. - Can be a very good estimate
- Can be an impossibly poor estimate.
- Requires user input and skill.
- Treat cautiously
- Can be improved by eye (usually)
- Often helps to have colour-coding.
- Depending on the use, the user should be able to
make a judgement on those regions that are
reliable or not. - For phylogeny reconstruction, only use those
positions whose hypothesis of positional homology
is unimpeachable
45Alignment of protein-coding DNA sequences
- It is not very sensible to align the DNA
sequences of protein-coding genes.
ATGCTGTTAGGG ATGCTCGTAGGG
ATGCT-GTTAGGG ATGCTCGT-AGGG
The result might be highly-implausible and might
not reflect what is known about biological
processes. It is much more sensible to translate
the sequences to their corresponding amino acid
sequences, align these protein sequences and then
put the gaps in the DNA sequences according to
where they are found in the amino acid alignment.
46Manual Alignment- software
- GDE- The Genetic Data Environment (UNIX)
- CINEMA- Java applet available from
- http//www.biochem.ucl.ac.uk
- Seqapp/Seqpup- Mac/PC/UNIX available from
- http//iubio.bio.indiana.edu
- SeAl for Macintosh, available from
- http//evolve.zoo.ox.ac.uk/Se-Al/Se-Al.html
- BioEdit for PC, available from
- http//www.mbio.ncsu.edu/RNaseP/info/programs/BIOE
DIT/bioedit.html