Multiple Sequence Alignment

About This Presentation

Title:

Multiple Sequence Alignment

Description:

Global algorithms are often not effective for highly diverged sequences and do ... The MACAW and SeqVu program for Macintosh and GeneDoc and DCSE for PCs are free ... – PowerPoint PPT presentation

Number of Views:210

Avg rating:3.0/5.0

Slides: 60

Provided by: stua70

Category:

more less

Transcript and Presenter's Notes

Title: Multiple Sequence Alignment

1
Multiple Sequence Alignment
2
(No Transcript)
3
Terminology

Motif the biological object one attempts to
model - a functional or structural domain, active
site, phosphorylation site etc.
Pattern a qualitative motif description based on
a regular expression-like syntax
Profile a quantitative motif description -
assigns a degree of similarity to a potential
match

4
Global Alignment

Global algorithms are often not effective for
highly diverged sequences and do not reflect the
biological reality that two sequences may only
share limited regions of conserved sequence.
Sometimes two sequences may be derived from
ancient recombination events where only a single
functional domain is shared.

5
What is Multiple Sequence Alignment (MSA)?

Multiple sequence alignment (MSA) can be seen as
a generalization of Pairwise Sequence Alignment -
instead of aligning two sequences, n sequences
are aligned simultaneously, where n is gt 2
Definition A multiple sequence alignment is an
alignment of n gt 2 sequences obtained by
inserting gaps (-) into sequences such that the
resulting sequences have all length L and can be
arranged in a matrix of N rows and L columns
where each column represents a homologous
position (each column corresponds to a specific
residue in the 'prototypical' protein)

6
Multiple Sequence Alignment

MSA applies both to nucleotide and amino acid
sequences
To construct a multiple alignment, one may have
to introduce gaps in sequences at positions where
there were no gaps in the corresponding pairwise
alignment.
This means that multiple alignments typically
contain more gaps than any given pair of aligned
sequences

7
(No Transcript)
8
How to optimize alignment algorithms?

Use structural information
reading frame
protein structure
Sequence elements are not truly independent but
related by phylogenic descent
Sequences often contain highly conserved regions

9
Optimize alignment algorithms
10
Pairwise Alignment

The alignment of two sequences (DNA or protein)
is a relatively straightforward computational
problem.

11
The big-O notation

One of the most important properties of an
algorithm is how its execution time increases as
the problem is made larger. By a larger problem,
we mean more sequences to align, or longer
sequences to align.
This is the so-called algorithmic (or
computational) complexity of the algorithm
There is a notation to describe the algorithmic
complexity, called the big-O notation.
If we have a problem size (number of input data
points) n, then an algorithm takes O(n) time if
the time increases linearly with n.
If the algorithm needs time proportional to the
square of n, then it is O(n2)

12
The big-O notation

It is important to realize that an algorithm that
is quick on small problems may be totally useless
on large problems if it has a bad O() behavior.
As a rule of thumb one can use the following
characterizations, where n is the size of the
problem, and c is a constant
O(c) utopian
O(log n) excellent
O(n) very good
O(n2) not so good
O(n3) pretty bad
O(cn) disaster

13
The big-O notation

To compute a N-wise alignment, the algorithmic
complexity is something like O(c2n), where c is a
constant, and n is the number of sequences.
This is a big-O disaster!

14
(No Transcript)
15
The best solution is Dynamic Programming.
16
Multiple Sequence Alignment

In pairwise alignments, you have a
two-dimensional matrix with the sequenceson each
axis.
The number of operations required to locate the
best path through the matrix is approximately
proportional to the product of the lengths of the
two sequences
A possible general method would be to extend the
pairwise alignment method into a simultaneous
N-wise alignment, using a complete
dynamical-programming algorithm in N dimensions.
Algorithmically, this is not difficult to do

17
Dynamic Programming

Dynamic Programming is a very general programming
technique.
It is applicable when a large search space can be
structured into a succession of stages, such
that
the initial stage contains trivial solutions to
sub-problems
each partial solution in a later stage can be
calculated by recurring a fixed number of partial
solutions in an earlier stage
the final stage contains the overall solution

18
Multiple Alignments

In theory, making an optimal alignment between
two sequences is computationally straightforward
(Smith-Waterman algorithm), but aligning a large
number of sequences using the same method is
almost impossible.
The problem increases exponentially with the
number of sequences involved
(the product of the sequence lengths)

19
(No Transcript)
20
Optimal Alignment

For a given group of sequences, there is no
single "correct" alignment, only an alignment
that is "optimal" according to some set of
calculations.
Determining what alignment is best for a given
set of sequences is really up to the judgement of
the investigator.

21
Why we do multiple alignments?

In order to characterize protein families,
identify shared regions of homology in a multiple
sequence alignment (this happens generally when
a sequence search revealed homologies to several
sequences).
Determination of the consensus sequence of
several aligned sequences.
Consensus sequences can help to develop a
sequence finger print which allows the
identification of members of distantly related
protein family (motifs)
MSA can help us to reveal biological facts about
proteins, like analysis of the secondary/tertiary
structure)

22
(No Transcript)
23
Why we do multiple alignments?

Crucial for genome sequencing
Random fragments of a large molecule are
sequenced and those that overlap are found by a
multiple sequence alignment program.
There should be one correct alignment that
corresponds to the genomic sequence rather than
a range of possibilities
Sequence may be from one strand of DNA or the
other, so complements of each sequence must also
be compared
Sequence fragments will usually overlap, but by
an unknown amount and in some cases, one sequence
may be included within another
All of the overlapping pairs of sequence
fragments must be assembled into large composite
genome sequence

24
An example of Multiple Alignment
VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWY
QQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--
YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDF
YPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLT
CLVKGFYPSD--IAVEWWSNG--
25
Three Types of Algorithms

Progressive ClustalW
Iterative Muscle
Concistency Based T-Coffee and Probcons

26
Progressive Multiple Alignment

The most practical and widely used method in
multiple sequence alignment is the hierarchical
extensions of pairwise alignment methods.
The principal is that multiple alignments is
achieved by successive application of pairwise
methods.

27
Choosing sequences for alignment

The more sequences to align the better.
Dont include similar (gt80) sequences.
Sub-groups should be pre-aligned separately, and
one member of each subgroup should be included in
the final multiple alignment.

28
Progressive Pairwise Methods

Most of the available multiple alignment programs
use some sort of incremental or progressive
method that makes pairwise alignments, then adds
new sequences one at a time to these aligned
groups.
This is an approximate or heuristic method!

29
(No Transcript)
30
Multiple Alignment Method

Compare all sequences pairwise.
Perform cluster analysis on the pairwise data to
generate a hierarchy for alignment. This may be
in the form of a binary tree or a simple ordering
Build the multiple alignment by first aligning
the most similar pair of sequences, then the next
most similar pair and so on. Once an alignment of
two sequences has been made, then this is fixed.
Thus for a set of sequences A, B, C, D having
aligned A with C and B with D the alignment of A,
B, C, D is obtained by comparing the alignments
of A and C with that of B and D using averaged
scores at each aligned position.

31
Gap Penalties

In the MSA scoring scheme, a penalty is
subtracted for each gap introduced into an
alignment because the gap increases uncertainty
into an alignment
The gap penalty is used to help decide whether or
not to accept a gap or insertion in an alignment
Biologically, it should in general be easier for
a sequence to accept a different residue in a
position, rather than having parts of the
sequence chopped away or inserted.
Gaps/insertions should therefore be more rare
than point mutations (substitutions)
In general, the lower the gapping penalties, the
more gaps and more identities are detected but
this should be considered in relation to
biological significance
Most MSA programs allow for an adjustment of gap
penalties

32
The PILEUP Algorithm

First, PILEUP calculates approximate pairwise
similarity scores between all sequences to be
aligned, and they are clustered into a dendrogram
(tree structure).
Then the most similar pairs of sequences are
aligned.
Averages (similar to consensus sequences) are
calculated for the aligned pairs.
New sequences and clusters of sequences are added
one by one, according to the branching order in
the dendrogram.

33
Choosing sequences for MSA

As far as possible, try to align sequences of
similar length.
Pileup can align sequences of up to 5000
residues, with 2000 gaps (total 7000 characters).
Pileup is a good program only for similar (close)
sequences.

34
PileUp considerations

PileUp does global multiple alignment, and
therefore is good for a group of similar
sequences.
PileUp will fail to find the best local region of
similarity (such as a shared motif) among distant
related sequences.
PileUp always aligns all of the sequences you
specified in the input file, even if they are not
related.
The alignment can be degraded if some of the
sequences are only distantly related.

35
PILEUP Considerations

Since the alignment is calculated on a
progressive basis, the order of the initial
sequences can affect the final alignment.
PILEUP parameters 2 gap penalties (gap insert
and gap extend) and an amino acid comparison
matrix.
PILEUP will refuse to align sequences that
require too many gaps or mismatches.
PILEUP will take quite a while to align more than
about 10 sequences

36
CLUSTAL

CLUSTAL is a stand-alone (i.e. not integrated
into GCG) multiple alignment program that is
superior in some respects to PILEUP
Works by progressive alignment it aligns a pair
of sequences then aligns the next one onto the
first pair
Most closely related sequences are aligned first,
and then additional sequences and groups of
sequences are added, guided by the initial
alignments
Uses alignment scores to produce a phylogenetic
tree

37
CLUSTAL

Aligns the sequences sequentially, guided by the
phylogenetic relationships indicated by the tree
Gap penalties can be adjusted based on specific
amino acid residues, regions of hydrophobicity,
proximity to other gaps, or secondary structure
Is available with a great web interface
http//www.ebi.ac.uk/clustalw/
Also available in Biology Workbench

38
Multiple Alignment tools on the Web

There are a variety of multiple alignment tools
available for free on the web.
CLUSTAL is available from a number of sites (with
a variety of restrictions)
Other algorithms are available too

39
Muscle Algorithm Using The Iteration
40
Consistency Based Algorithms T-Coffee

Gotoh (1990)
Iterative strategy using concistency
Martin Vingron (1991)
Dot Matrices Multiplications
Accurate but too stringeant
Dialign (1996, Morgenstern)
Concistency
Agglomerative Assembly
T-Coffee (2000, Notredame)
Concistency
Progressive algorithm

41
(No Transcript)
42
T-Coffee and Consistency
43
T-Coffee and Consistency
44
T-Coffee and Consistency
45
T-Coffee and Consistency
46
T-Coffee and Consistency
47
APPROXIMATEFAST
ACCURATE SLOW
48
Some URLs

EMBL-EBI
http//www.ebi.ac.uk/clustalw/
BCM Search Launcher Multiple Alignment
http//dot.imgen.bcm.tmc.edu9331/multi-align/mult
i-align.html
Multiple Sequence Alignment for Proteins (Wash.
U. St. Louis)
http//www.ibc.wustl.edu/service/msa/

49
Editing and displaying alignments

Sequence editors are used for
manual alignment/editing of sequences
visualization of data
data management
import/export of data
graphical enhancement of data for presentations

50
Editing Multiple Alignments

There are a variety of tools that can be used to
modify a multiple alignment.
These programs can be very useful in formatting
and annotating an alignment for publication.
An editor can also be used to make modifications
by hand to improve biologically significant
regions in a multiple alignment created by one of
the automated alignment programs.

51
Displaying a multiple alignment in GCG

There are several programs to display the
multiple alignment prettily.
The Pretty program prints sequences with their
columns aligned and can display a consensus for
the alignment, allowing you to look at
relationships among the sequences.
The PrettyBox program displays the alignment
graphically with the conserved regions of the
alignment as shaded boxes. The output is in
Postscript format.

52
Example of PrettyBox Output
53
GCG alignment editors

Alignments produced with PILEUP (or CLUSTAL) can
be adjusted with LINEUP.
Nicely shaded printouts can be produced with
PRETTYBOX
GCG's SeqLab X-Windows interface has a superb
multiple sequence editor - the best editor of any
kind.

54
(No Transcript)
55
Other editors

The MACAW and SeqVu program for Macintosh and
GeneDoc and DCSE for PCs are free and provide
excellent editor functionality.
Many comprehensive molecular biology programs
include multiple alignment functions
MacVector, OMIGA, Vector NTI, and
GeneTool/PepTool all include a built-in version
of CLUSTAL

56
SeqVu
57
CINEMA

CINEMA (Colour INteractive Editor for Multiple
Alignments)
It is an editor created completely in JAVA (old
browsers beware)
It includes a fully functional version of
CLUSTAL, BLAST, and a DotPlot module

http//www.bioinf.man.ac.uk/dbbrowser/CINEMA2.1/
58
Informative Colors

By default, the alignment is coloured crudely
according to residue type (proline and glycine
have special structural properties, particularly
in membrane proteins, so are grouped separately
similarly for cysteine, which is often involved
in disulphide bond formation)
Polar positive H, K, R Blue
Polar negative D, E Red
Polar neutral S, T, N, Q Green
Non-polar aliphatic A, V, L, I, M White
Non-polar aromatic F, Y, W Purple
P, G Brown
C Yellow
Special characters B, Z, X, - Grey