Title: Multiple Alignment
1Multiple Alignment
- Stuart M. Brown
- NYU School of Medicine
2(No Transcript)
3Pairwise Alignment
- The alignment of two sequences (DNA or protein)
is a relatively straightforward computational
problem. - The best solution seems to be an approach called
Dynamic Programming.
4Dynamic Programming
- Dynamic Programming is a very general programming
technique. - It is applicable when a large search space can be
structured into a succession of stages, such
that - the initial stage contains trivial solutions to
sub-problems - each partial solution in a later stage can be
calculated by recurring a fixed number of partial
solutions in an earlier stage - the final stage contains the overall solution
5(No Transcript)
6Global vs. Local Alignments
- Global alignment algorithms start at the
beginning of two sequences and add gaps to each
until the end of one is reached. - Local alignment algorithms finds the region (or
regions) of highest similarity between two
sequences and build the alignment outward from
there.
7(No Transcript)
8GAP
- The GCG program GAP implements the Needleman and
Wunsch Global alignment algorithm. - Global algorithms are often not effective for
highly diverged sequences and do not reflect the
biological reality that two sequences may only
share limited regions of conserved sequence. - Sometimes two sequences may be derived from
ancient recombination events where only a single
functional domain is shared. - GAP is useful when you want to force two
sequences to align over their entire length
9BESTFIT
- The GCG program BESTFIT implements the
Smith-Waterman local alignment algorithm. - FASTA and BLAST are local alignment algorithms
- NCBI has a BLAST 2 Sequences feature on its
website - http//www.ncbi.nlm.nih.gov/gorf/bl2.html
10Pairwise Alignment on the Web
- The ALIGN global alignment program is available
at several servers - http//molbiol.soton.ac.uk/compute/align.html
- http//www2.igh.cnrs.fr/bin/align-guess.cgi
- LALIGN local alignment program is available at
several servers - http//www2.igh.cnrs.fr/bin/lalign-guess.cgi
- http//www.ch.embnet.org/software/LALIGN_form.html
- LFASTA uses FASTA for local alignment of 2
sequences - http//pbil.univ-lyon1.fr/lfasta.html
- BLAST 2 Sequences (NCBI)
- http//www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html
11(No Transcript)
12Multiple Alignments
- In theory, making an optimal alignment between
two sequences is computationally straightforward
(Smith-Waterman algorithm), but aligning a large
number of sequences using the same method is
almost impossible. - The problem increases exponentially with the
number of sequences involved - (the product of the sequence lengths)
13Optimal Alignment
- For a given group of sequences, there is no
single "correct" alignment, only an alignment
that is "optimal" according to some set of
calculations. - Determining what alignment is best for a given
set of sequences is really up to the judgement of
the investigator.
14Progressive PairwiseMethods
- Most of the available multiple alignment programs
use some sort of incremental or progressive
method that makes pairwise alignments, then adds
new sequences one at a time to these
aligned groups. - This is an approximate method!
15PILEUP
- PILEUP is the multiple alignment program in the
GCG package - CLUSTAL is another popular program (also
available on the RCR server) that uses a similar
algorithm.
16The PILEUP Algorithm
- First, PILEUP calculates approximate pairwise
similarity scores between all sequences to be
aligned, and they are clustered into a dendrogram
(tree structure). - Then the most similar pairs of sequences are
aligned. - Averages (similar to consensus sequences) are
calculated for the aligned pairs. - New sequences and clusters of sequences are added
one by one, according to the branching order in
the dendrogram.
17PILEUP Considerations
- Since the alignment is calculated on a
progressive basis, the order of the initial
sequences can affect the final alignment. - PILEUP paramaters 2 gap penalties (gap insert
and gap extend) and an amino acid comparison
matrix. - PILEUP will refuse to align sequences that
require too many gaps or mismatches. - PILEUP will take quite a while to align more than
about 10 sequences
18Instructions for running PILEUP
- PILEUP uses a list of sequence files as input
- You can use output from a FASTA or LOOKUP search
as a list or make your own list in a text editor - A list file can include files from your own
directory and/or GCG database files.
19LIST file format
- List files always begin with two dots ..
- ..
- gpS31321
- gpYno3_Yeast
- S51900.pep
- Yan2_Schpo
- Ypd1_Caeel
- A36205
- Mpp1_Rat begin100 end345
- B46665.pep
- Ymxg_Bacsu begin150 end464
- A48043.pep
- List files can also include Begin and End
positions within a sequence
20PILEUP _at_myseqs.list
- Now at the gt prompt, type PILEUP and the name of
the file that is your list of sequence names. - However, GCG requires that you must precede the
name of your list file with the _at_ character. - So the command looks like this
- gt PILEUP _at_myseqs.list
21PILEUP Output
gt more myseqs.msf
1501
1550 Hsirf2 SERPSKKGKK PKTEKEDKVK
HIKQEPVESS LGLSNGVSDL SPEYAVLTST Muirf2
SERPSKKGKK PKTEKEERVK HIKQEPVESS LGLSNGVSGF
SPEYAVLTSA Chirf2 SERPSKKGKK TKSEKDDKFK
QIKQEPVESS FGI.NGLNDV TSDY.FLSSS Muirf1
LTRNQRKERK SKSSRDTKSK TKRKLCGDVS PDTFS..DGL
SSSTLPDDHS Ratirf1 LTKNQRKERK SKSSRDTKSK
TKRKLCGDSS PDTLS..DGL SSSTLPDDHS Hsirf1
LTKNQRKERK SKSSRDAKSK AKRKSCGDSS PDTFS..DGL
SSSTLPDDHS Chkirf1a LTKDQKKERK SKSSREARNK
SKRKLYEDMR MEESA..ERL TSTPLPDDHS Hsirf3a
Mmuirf3
Hsirf5
GPAPTDSQPP EDYSFGAGEE EEEEEELQRM LPSLSLTDAV
QSGPHMTPYS Mmuirf6 IPQPQGS.VI NPGSTGSAPW
DEKDNDVDED EEEDELEQSQ HHVPIQDTFP Hump48
...PPGIVSG QPGTQKVPSK RQHSSVSSER KEEEDAMQNC
TLSPSVLQDS Mup48 ...PAGTLPN QPRNQKSPCK
RSISCVSPER EEN...MENG RTNGVVNHSD Hsirf4
...PEGAKKG AKQLTLEDPQ MSMSHPYTMT TPYPSLPA.Q
VHNYMMPPLD Mupip ...PEGAKKG AKQLTLDDTQ
MAMGHPYPMT APYGSLPAQQ VHNYMMPPHD Huicsbp
...PEEDQK. .......... .......... CKLGVATAGC
VNEVTEMECG Muicsbp ...PEEEQK. ..........
.......... CKLGVAPAGC MSEVPEMECG Chkicsbp
...PEEEQK. .......... .......... CKIGVGNGSS
LTDVGDMDCS 1551
1600 Hsirf2 IKNEVDSTVN
IIVVGQSHLD SNIENQEIVT NPPDICQVVE VTTESDEQPV
Muirf2 IKNEVDSTVN IIVVGQSHLD SNIEDQEIVT
NPPDICQVVE VTTESDDQPV Chirf2 IKNEVDSTVN
IVVVGQPHLD GSSEEQVIVA NPPDVCQVVE VTTESDEQPL
Muirf1 SYTTQGYLGQ DLDMER.DIT PALSPCVVSS
SLSEWHMQMD I.IPDSTTDL Ratirf1 SYTAQGYLGQ
DLDMDR.DIT PALSPCVVSS SLSEWHMQMD I.MPDSTTDL
Hsirf1 SYTVPGYM.Q DLEVEQ.ALT PALSPCAVSS
TLPDWHIPVE V.VPDSTSDL Chkirf1a SYTAHDYTGQ
EVEVENTSIT LDLSSCEVSG SLTDWRMPME IAMADSTNDI
Hsirf3a
Mmuirf3
Hsirf5 LLKEDVKWPP TLQPPTLQPP VVLGPPAPDP
SPLAPPPGNP AGFRELLSEV Mmuirf6 FL........
NINGSPMAPA SVGNCSVGNC SPESVWP... ......KTEP
Hump48 LNNEEEGASG GAVHSDIGSS SSSSSPEPQE
VTDTTEAPFQ ........GD Mup48 SGSNIGGGGN
GSNRSD...S NSNCNSELEE GAGTTEATIR ........ED
Hsirf4 RSWRDYVPDQ PHPEIPYQCP MTFGPRGHHW
QGPACENGCQ VTGTFYACAP Mupip RSWRDYAPDQ
SHPEIPYQCP VTFGPRGHHW QGPSCENGCQ VTGTFYACAP
Huicsbp RSEIDELIKE .PSVDDYMGM IKRSPSP...
P.DACRS..Q LLPDWWAHEP Muicsbp RSEIEELIKE
.PSVDEYMGM TKRSPSP... P.EACRS..Q ILPDWWVQQP
Chkicsbp PSAIDDLMKE PPCVDEYLGI IKRSPSP...
PQETCRN..P PIPDWWMQQP
22PILEUP options
- For a first try, take the default options, but
give the output file a meaningful name. - If you dont get a good alignment, try a less
stringent matrix and/or gap penalties. - gt PILEUP -matroldpep.cmp
- It is a good idea to run PILEUP in batch mode if
you have more than 10 sequences to align - gt PILEUP -bat
23CLUSTAL
- CLUSTAL is a stand-alone (i.e. not integrated
into GCG) multiple alignment program that is
superior in some respects to PILEUP - Gap penalties can be adjusted based on specific
amino acid residues, regions of hydrophobicity,
proximity to other gaps, or secondary structure. - it can re-align just selected sequences or
selected regions in an existing alignment - It can compute phylogenetic trees from a set of
aligned sequences. - There are also Mac and PC versions with a nice
graphical interface (CLUSTALX).
24Using CLUSTAL
- On mcrcr0 type clustal
- CLUSTAL can only work with sequences in
multi-sequence FASTA format. - The GCG program TOFASTA can convert lists of file
names into FASTA multi-sequence format.
25Multiple Alignment tools on the Web
- There are a variety of multiple alignment tools
available for free on the web. - CLUSTAL is available from a number of sites (with
a variety of restrictions) - Other algorithms are available too
- Watch out for experimental algorithms there
may be a good reason why you have never heard of
some oddball program
26Some URLs
- EMBL-EBI
- http//www.ebi.ac.uk/clustalw/
- BCM Search Launcher Multiple Alignment
- http//dot.imgen.bcm.tmc.edu9331/multi-align/mult
i-align.html - Multiple Sequence Alignment for Proteins (Wash.
U. St. Louis) - http//www.ibc.wustl.edu/service/msa/
27Editing Multiple Alignments
- There are a variety of tools that can be used to
modify a multiple alignment. - These programs can be very useful in formatting
and annotating an alignment for publication. - An editor can also be used to make modifications
by hand to improve biologically significant
regions in a multiple alignment created by one of
the automated alignment programs.
28GCG alignment editors
- Alignments produced with PILEUP (or CLUSTAL) can
be adjusted with LINEUP. - Nicely shaded printouts can be produced with
PRETTYBOX - GCG's SeqLab X-Windows interface has a superb
multiple sequence editor - the best editor
of any kind.
29(No Transcript)
30Other editors
- The MACAW and SeqVu program for Macintosh and
GeneDoc and DCSE for PCs are free and provide
excellent editor functionality. - Many comprehensive molecular biology programs
include multiple alignment functions - MacVector, OMIGA, Vector NTI, and
GeneTool/PepTool all include a built-in version
of CLUSTAL
31SeqVu
32Editors on the Web
- Check out CINEMA (Colour INteractive Editor for
Multiple Alignments) - It is an editor created completely in JAVA (old
browsers beware) - It includes a fully functional version of
CLUSTAL, BLAST, and a DotPlot module
http//www.bioinf.man.ac.uk/dbbrowser/CINEMA2.1/
33(No Transcript)