Title: Multiple Alignment
1Multiple Alignment
- Stuart M. Brown
- NYU School of Medicine
2(No Transcript)
3Pairwise Alignment
- The alignment of two sequences (DNA or protein)
is a relatively straightforward computational
problem. - The best solution seems to be an approach called
Dynamic Programming.
4(No Transcript)
5Dynamic Programming
- Dynamic Programming is a general programming
technique. - It is applicable when a large search space can be
structured into a succession of stages, such
that - the initial stage contains trivial solutions to
sub-problems - each partial solution in a later stage can be
calculated by recurring a fixed number of partial
solutions in an earlier stage - the final stage contains the overall solution
6Multiple Alignments
- Making an optimal alignment between two sequences
is computationally straightforward, but aligning
a large number of sequences using the same method
is almost impossible. - The problem increases exponentially with the
number of sequences involved, so it becomes
computationally expensive (and inefficient) for
large numbers of sequences. -
7Longer Sequences
- What happens to the number of cells in the matrix
when we add another base to one sequence? - How about to both?
- cells L1 x L2 or L2 if we use 2
sequences of the same length. - So the amount of computing grows with the square
of seq. length bad but not terrible, because
the compute time for each cell remains constant
8Align Three Sequences by Dynamic programming
Georg Fullen, VSNS Biocomputing, Univ. Munster
So how many cells (that contain values that must
be computed) do we add for each additional
sequence its a power function! For N
sequences of length L of cells 2n x Ln
This is very bad for computing alignments of a
lot of sequences!
If the calculation takes 1 nanosecond per cell,
then for 6 sequences of length 100, we'll have a
running time of is 26 x 1006 x 10-9 seconds
(64000 seconds). Just add 2 more sequences, and
the running time is 28 x 1008 x 10-9 2.6 x 109
seconds (28 days)
9Global vs. Local Multiple Alignments
- Global alignment algorithms start at the
beginning of two sequences and add gaps to each
until the end of one is reached. - Local alignment algorithms finds the region (or
regions) of highest similarity between two
sequences and build the alignment outward from
there.
10Optimal Alignment
- For a given group of sequences, there is no
single "correct" alignment, only an alignment
that is "optimal" according to some set of
calculations. - Determining what alignment is best for a given
set of sequences is really up to the judgement of
the investigator.
11Progressive PairwiseMethods
- Most of the available multiple alignment programs
use some sort of incremental or progressive
method that makes pairwise alignments, averages
them into a consensus (actually a profile), then
adds new sequences one at a time to the aligned
set. - This is an approximate method!
12CLUSTALW
- CLUSTAL is the most popular multiple alignment
program - Gap penalties can be adjusted based on specific
amino acid residues, regions of hydrophobicity,
proximity to other gaps, or secondary structure. - it can re-align just selected sequences or
selected regions in an existing alignment - It can compute phylogenetic trees from a set of
aligned sequences. - Unix command line program
- Website http//www.ebi.ac.uk/Tools/clustalw2/ind
ex.html - There are also Mac and PC versions with a nice
graphical interface (CLUSTALX).
13http//www.ebi.ac.uk/Tools/clustalw2/index.html
CLUSTALW2 at the EBI website
14Other Multiple Alignment Tools
- MUSCLE
- http//www.ebi.ac.uk/Tools/muscle/index.html
- TCOFFE http//www.ebi.ac.uk/Tools/t-coffee/
- MSA
15Editing Multiple Alignments
- There are a variety of tools that can be used to
modify and display a multiple alignment. - These programs can be very useful in formatting
and annotating an alignment for publication. - An editor can also be used to make modifications
by hand to improve biologically significant
regions in a multiple alignment created by an
alignment program.
16Alignment editors
- The MACAW and SeqVu program for Macintosh and
GeneDoc and DCSE for PCs are free and provide
excellent editor functionality. - Many comprehensive molecular biology programs
include multiple alignment functions - Sequencher, MacVector, DS Gene, Vector NTI, all
include a built-in version of CLUSTAL
17SeqVu
18JalView
- Install on your machine
- or run as a Java WebStart application
19- Check out CINEMA (Colour INteractive Editor for
Multiple Alignments) - It is an editor created completely in JAVA (old
browsers beware) - It includes a fully functional version of
CLUSTAL, BLAST, and a DotPlot module
http//www.bioinf.man.ac.uk/dbbrowser/CINEMA2.1/
20(No Transcript)
21Analysis of Alignments
- Once you have a multiple alignment, what can you
do with it? - 1) Identify regions of similarity and difference
- conserved regions may be functionally important,
and/or sites for inclusive (cross species) primer
design - Variable regions may be functionally important,
and/or sites for gene/allele-specific primer
design - 2) Create a sequence logo
- 3) Build a Phylogenetic Tree (next week)
22Format a Multiple Alignment
- The concept of a consensus sequence is implied
by any multiple alignment. There can be various
rules for building the consensus simple majority
rules, plurality by a specific , etc. - The alignment may look nicer by showing how each
letter matches the consensus highlight the
differences.
- PLOTSIMILARITY (a graph of overall similarity
across the alignment) EMBOSS plotcon - Show match to consensus showalign
- Shade by similarity prettyplot/Boxshade
23(No Transcript)
24Plurality 2.00 Threshold 4 AveWeight 0.55
AveMatch 2.91 AvMisMatch -2.00 PRETTY of
_at_pretty.list October 7, 1998 1035 ..
1
50 fa10.ugly .......... ..........
.......... ..TTttGESA D.PvtTtVE. fa12.ugly
.......... .......... .......... ..TTatGESA
D.PvtTtVE. fo1k.ugly .......... ..........
.......... ..TTsaGESA D.PvtTtVE. e.ugly
Gvenae.kgv tEnTna.Tad fvaqpvyLPe .nqT......
kv.Affynrs p1m.ugly GlgqmlEsmI .dnTvreTvg
AatsrdaLPn teasGPthSk eiPALTAVET p1s.ugly
GlgqmlEsmI .dnTvreTvg AatsrdaLPn teasGPahSk
eiPALTAVET p2s.ugly GigdmiEgav .Egitknalv
pptstnsLPg hkpsGPahSk eiPALTAVET p3s.ugly
Giedliseva .qgal..Tls lpkqqdsLPd tkasGPahSk
evPALTAVET cb3.ugly ...gpvEdaI .......T..
Aaigr..vad tvgTGPtnSe aiPALTAaET r14.ugly
GlgdelEevI vEkT.kqTv. Asi....... ..ssGPkhtq
kvPiLTAnET r2.ugly ...npvEnyI dEvlnevlv.
.......vPn inssnPttSn saPALdAaET Consensus
G-----E--I -E-T---T-- A------LP- --TTGPGESA
D-PALTAVET //////////////////////////////////////
/////////////////////////// 301
349 fa10.ugly aElyCPRPll AIkvtsqdRy KqKI.iAPa.
..KQll.... ......... fa12.ugly aElyCPRPll
AIevssqdRh KqKI.iAPg. ..KQll....
......... fo1k.ugly aEtyCPRPll AIhpt.eaRh
KqKI.vAPv. ..KQTl.... ......... e.ugly
krvfCPRPtv ffPwpTsG.D Kidmtpragv lmlespnald
isrty.... p1m.ugly irvWCPRPPR AlaYygpGvD
ykdgtltPls tkdlTTy... ......... p1s.ugly
irvWCPRPPR AvaYygpGvD ykdgtltPls tkdlTTy...
......... p2s.ugly VrvWCPRPPR AvPYfgpGvD
ykdg.ltPlp ekglTTy... ......... p3s.ugly
VrvWCPRPPR AvPYygpGvD yrn.nldPls ekglTTy...
......... cb3.ugly VkaWiPRPPR lcqYekakn.
vnfrssgvtt trqsiTtmtn tgaiwtti. r14.ugly
VEaWiPRaPR AlPY.Tsigr tny..pknte pvikkrk.gd
i.ksy.... r2.ugly VkaWCPRPPR AleY.Trahr
tnfkiedrsi qtaivTrpii ttagpsdmy Consensus
VE-WCPRPPR AIPY-T-GRD K-KI--AP-- --KQTT----
---------
25Boxshade
Shade each letter of the alignment based on its
match to the consensus highlights conserved
regions much more informative for protein
alignments (shades of grey for similar amino
acids)
http//mobyle.pasteur.fr/cgi-bin/MobylePortal/port
al.py?formboxshade
http//www.ch.embnet.org/software/BOX_form.html
26(No Transcript)
27(No Transcript)
28Sequence Logos
http//weblogo.berkeley.edu/logo.cgi
http//weblogo.threeplusone.com/create.cgi
http//genome.tugraz.at/Logo/
T. D. Schneider and R. M. Stephens. Sequence
logos a new way to display consensus sequences.
Nucleic Acids Research, Vol. 18, No 20, p.
6097-6100.
29Buidling on Alignments
- Multiple Alignments are the starting point for
calculating phylogenetic trees - Motifs and Profiles are calculated from multiple
alignments