Multiple Sequence Alignment - PowerPoint PPT Presentation

1 / 46

About This Presentation

Title:

Multiple Sequence Alignment

Description:

SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE ---- FAST CAT ... SeqD - THE ---- FAT CAT 100 SeqD - THE ---- FA-T CAT 100. Secondary library ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 47

Provided by: Martin488

Category:

more less

Transcript and Presenter's Notes

Title: Multiple Sequence Alignment

1
Multiple Sequence Alignment
2
Alignment can be easy or difficult
Easy
Difficult due to insertions or deletions
(indels)
3
Homology Definition

Homology similarity that is the result of
inheritance from a common ancestor -
identification and analysis of homologies is
central to phylogenetic systematics.
An Alignment is an hypothesis of positional
homology between bases/Amino Acids.

4
Multiple Sequence Alignment- Goals

To generate a concise, information-rich summary
of sequence data.
Sometimes used to illustrate the similarity
between a group of sequences.
Sometimes used to illustrate the dissimilarity
between a group of sequences.
Alignments can be treated as models that can be
used to test hypotheses.

5
(No Transcript)
6
(No Transcript)
7
Alignment of 16S rRNA can be guided by secondary
structure
Alignment of 16S rRNA sequences from different
bacteria
8
Protein Alignment may be guided by Tertiary
Structure Interactions
Homo sapiens DjlA protein
Escherichia coli DjlA protein
9
Multiple Sequence Alignment- Methods

3 main methods of alignment
Manual
Automatic
Combined

10
Manual Alignment - reasons

Might be carried out because
Alignment is easy.
There is some extraneous information
(structural).
Automated alignment methods have encountered the
local minimum problem.
An automated alignment method can be improved.

11
Dynamic programming

2 methods
Dynamic programming
Consider 2 protein sequences of 100 amino acids
in length.
If it takes 1002 seconds to exhaustively align
these sequences, then it will take 1003 seconds
to align 3 sequences, 1004 to align 4
sequences...etc.
1.90258x1034 years to align 20 sequences
exhaustively.
Progressive alignment

12
Progressive Alignment

Devised by Feng and Doolittle in 1987.
Essentially a heuristic method and as such is not
guaranteed to find the optimal alignment.
Requires n-1n-2n-3...n-n1 pairwise alignments
as a starting point
Most successful implementation is Clustal (Des
Higgins)

13
Overview of ClustalW Procedure
CLUSTAL W
Hbb_Human 1 -
Hbb_Horse 2 .17 -
Hba_Human 3 .59 .60 -
Quick pairwise alignment calculate distance
matrix
Hba_Horse 4 .59 .59 .13 -
Myg_Whale 5 .77 .77 .75 .75 -
Hbb_Human
4
2
3
Hbb_Horse
Neighbor-joining tree (guide tree)
Hba_Human
1
Hba_Horse
Myg_Whale
alpha-helices
1 PEEKSAVTALWGKVN--VDEVGG
4
2
3
Progressive alignment following guide tree
2 GEEKAAVLALWDKVN--EEEVGG
3 PADKTNVKAAWGKVGAHAGEYGA
1
4 AADKTNVKAAWSKVGGHAGEYGA
5 EHEWQLVLHVWAKVEADVAGHGQ
14
ClustalW- Pairwise Alignments

First perform all possible pairwise alignments
between each pair of sequences. There are
(n-1)(n-2)...(n-n1) possibilities.
Calculate the distance between each pair of
sequences based on these isolated pairwise
alignments.
Generate a distance matrix.

15
Path Graph for aligning two sequences.
16
Possible alignment

Scoring Scheme
Match 1
Mismatch 0
Indel -1

1
1
0
1
Score for this path 2
0
-1
17
Alignment using this path
1
GATTC- GAATTC
1
0
1
0
-1
18
Optimal Alignment 1
Alignment using this path GA-TTC GAATTC
1
1
-1
1
1
Alignment score 4
1
19
Optimal Alignment 2
Alignment using this path G-ATTC GAATTC
1
-1
1
1
1
Alignment score 4
1
20
ClustalW- Guide Tree

Generate a Neighbor-Joining guide tree from
these pairwise distances.
This guide tree gives the order in which the
progressive alignment will be carried out.

21
Multiple Alignment- First pair

Align the two most closely-related sequences
first.
This alignment is then fixed and will never
change. If a gap is to be introduced
subsequently, then it will be introduced in the
same place in both sequences, but their relative
alignment remains unchanged.

22
ClustalW- Decision time

Consult the guide tree to see what alignment is
performed next.
Align a third sequence to the first two
Or
Align two entirely different sequences to each
other.

Option 1
Option 2
23
ClustalW- Alternative 1
If the situation arises where a third sequence is
aligned to the first two, then when a gap has to
be introduced to improve the alignment, each of
these two entities are treated as two single
sequences.

24
ClustalW- Alternative 2

If, on the other hand, two separate sequences
have to be aligned together, then the first
pairwise alignment is placed to one side and the
pairwise alignment of the other two is carried
out.

25
ClustalW- Progression

The alignment is progressively built up in this
way, with each step being treated as a pairwise
alignment, sometimes with each member of a pair
having more than one sequence.

26
Progressive alignment - step 1
gctcgatacgatacgatgactagcta gctcgatacaagacgatgacagc
ta gctcgatacacgatgactagcta gctcgatacacgatgacgagcga
ctcgaacgatacgatgactagct
gctcgatacgatacgatgactagcta gctcgatacaagacgatgac-ag
cta
27
Progressive alignment - step 2
gctcgatacgatacgatgactagcta gctcgatacaagacgatgacagc
ta gctcgatacacgatgactagcta gctcgatacacgatgacgagcga
ctcgaacgatacgatgactagct
gctcgatacacgatgactagcta gctcgatacacgatgacgagcga
28
Progressive alignment - step 3
gctcgatacgatacgatgactagcta gctcgatacaagacgatgac-ag
cta gctcgatacacgatgactagcta gctcgatacacgatgacgag
cga
gctcgatacgatacgatgactagcta gctcgatacaagacgatgac-ag
cta Gctcgatacacga---tgactagcta Gctcgatacacga---tga
cgagcga
29
Progressive alignment - final step
gctcgatacgatacgatgactagcta gctcgatacaagacgatgac-ag
cta gctcgatacacga---tgactagcta gctcgatacacga---tga
cgagcga ctcgaacgatacgatgactagct
gctcgatacgatacgatgactagcta Gctcgatacaagacgatgac-ag
cta gctcgatacacga---tgactagcta gctcgatacacga---tga
cgagcga -ctcga-acgatacgatgactagct-
30
ClustalW-Good points/Bad points

Advantages
Speed.
Disadvantages
No objective function.
No way of quantifying whether or not the
alignment is good
No way of knowing if the alignment is correct.

31
ClustalW-Local Minimum

Potential problems
Local minimum problem. If an error is introduced
early in the alignment process, it is impossible
to correct this later in the procedure.
Arbitrary alignment.

32
Increasing the sophistication of the alignment
process.

Should we treat all the sequences in the same
way? - even though some sequences are
closely-related and some sequences are distant
relatives.
Should we treat all positions in the sequences as
though they were the same? - even though they
might have different functions and different
locations in the 3-dimensional structure.

33
(No Transcript)
34
ClustalW- Caveats

Sequence weighting
Varying substitution matrices
Residue-specific gap penalties and reduced
penalties in hydrophilic regions (external
regions of protein sequences), encourage gaps in
loops rather than in core regions.
Positions in early alignments where gaps have
been opened receive locally reduced gap penalties
to encourage openings in subsequent alignments

35
ClustalW- User-supplied values

Two penalties are set by the user (there are
default values, but you should know that it is
possible to change these).
GOP- Gap Opening Penalty is the cost of opening a
gap in an alignment.
GEP- Gap Extension Penalty is the cost of
extending this gap.

36
Position-Specific gap penalties

Before any pair of (groups of) sequences are
aligned, a table of GOPs are generated for each
position in the two (sets of) sequences.
The GOP is manipulated in a position-specific
manner, so that it can vary over the sequences.
If there is a gap at a position, the GOP and GEP
penalties are lowered, the other rules do not
apply.
This makes gaps more likely at positions where
gaps already exist.

37
Discouraging too many gaps

If there is no gap opened, then the GOP is
increased if the position is within 8 residues of
an existing gap.
This discourages gaps that are too close
together.
At any position within a run of hydrophilic
residues, the GOP is decreased.
These runs usually indicate loop regions in
protein structures.
A run of 5 hydrophilic residues is considered to
be a hydrophilic stretch.
The default hydrophilic residues are
D, E, G, K, N, Q, P, R, S
But this can be changed by the user.

38
Divergent Sequences

The most divergent sequences (most different, on
average from all of the other sequences) are
usually the most difficult to align.
It is sometimes better to delay their aligment
until later (when the easier sequences have
already been aligned).
The user has the choice of setting a cutoff
(default is 40 identity).
This will delay the alignment until the others
have been aligned.

39
T-COFFEETree-based consistency objective
function for alignment evaluation)

Generate a library of all the pairwise alignments
between the sequences.
This gives positional information concerning
which residues are homologous to which other
residues.
This can then be used to guide progressive
alignments.

40
An example dataset

SequenceA GARFIELD THE LAST FAT CAT
SequenceB GARFIELD THE FAST CAT
SequenceC GARFIELD THE VERY FAST CAT
SequenceD THE FAT CAT

Clustal alignment
Sequence A GARFIELD THE LAST FA-T CAT Sequence B
GARFIELD THE FAST CA-T --- Sequence C GARFIELD
THE VERY FAST CAT Sequence D -------- THE ----
FA-T CAT
41
Primary library

SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD
THE ---- FAST CAT
SeqB GARFIELD THE FAST CAT --- 88 SeqC GARFIELD
THE VERY FAST CAT 100
SeqA GARFIELD THE LAST FA-T CAT SeqB GARFIELD
THE FAST CAT
SeqC GARFIELD THE VERY FAST CAT 77 SeqD --------
THE FA-T CAT 100
SeqA GARFIELD THE LAST FAT CAT SeqC GARFIELD
THE VERY FAST CAT
SeqD -------- THE ---- FAT CAT 100 SeqD --------
THE ---- FA-T CAT 100

42
Secondary library

SeqA GARFIELD THE LAST FAT CAT
SeqB GARFIELD THE FAST CAT Weight 88
SeqA GARFIELD THE LAST FAT CAT
SeqC GARFIELD THE VERY FAST CAT
SeqB GARFIELD THE FAST CAT Weight 77
SeqA GARFIELD THE LAST FAT CAT
SeqD THE FAT CAT
SeqB GARFIELD THE FAST CAT Weight 100

43
Extended library

SeqA GARFIELD THE LAST FAT CAT
SeqB GARFIELD THE FAST CAT
SeqA GARFIELD THE LAST FA-T CAT
SeqB GARFIELD THE ---- FAST CAT

Dynamic programming
44
Advice on progressive alignment

Progressive alignment is a mathematical process
that is completely independent of biological
reality.
Can be a very good estimate
Can be an impossibly poor estimate.
Requires user input and skill.
Treat cautiously
Can be improved by eye (usually)
Often helps to have colour-coding.
Depending on the use, the user should be able to
make a judgement on those regions that are
reliable or not.
For phylogeny reconstruction, only use those
positions whose hypothesis of positional homology
is unimpeachable

45
Alignment of protein-coding DNA sequences

It is not very sensible to align the DNA
sequences of protein-coding genes.

ATGCTGTTAGGG ATGCTCGTAGGG
ATGCT-GTTAGGG ATGCTCGT-AGGG
The result might be highly-implausible and might
not reflect what is known about biological
processes. It is much more sensible to translate
the sequences to their corresponding amino acid
sequences, align these protein sequences and then
put the gaps in the DNA sequences according to
where they are found in the amino acid alignment.
46
Manual Alignment- software