Multiple Sequence Alignment - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

Multiple Sequence Alignment

Description:

Global algorithms are often not effective for highly diverged sequences and do ... The MACAW and SeqVu program for Macintosh and GeneDoc and DCSE for PCs are free ... – PowerPoint PPT presentation

Number of Views:205
Avg rating:3.0/5.0
Slides: 60
Provided by: stua70
Category:

less

Transcript and Presenter's Notes

Title: Multiple Sequence Alignment


1
Multiple Sequence Alignment
2
(No Transcript)
3
Terminology
  • Motif the biological object one attempts to
    model - a functional or structural domain, active
    site, phosphorylation site etc.
  • Pattern a qualitative motif description based on
    a regular expression-like syntax
  • Profile a quantitative motif description -
    assigns a degree of similarity to a potential
    match

4
Global Alignment
  • Global algorithms are often not effective for
    highly diverged sequences and do not reflect the
    biological reality that two sequences may only
    share limited regions of conserved sequence.
  • Sometimes two sequences may be derived from
    ancient recombination events where only a single
    functional domain is shared.

5
What is Multiple Sequence Alignment (MSA)?
  • Multiple sequence alignment (MSA) can be seen as
    a generalization of Pairwise Sequence Alignment -
    instead of aligning two sequences, n sequences
    are aligned simultaneously, where n is gt 2
  • Definition A multiple sequence alignment is an
    alignment of n gt 2 sequences obtained by
    inserting gaps (-) into sequences such that the
    resulting sequences have all length L and can be
    arranged in a matrix of N rows and L columns
    where each column represents a homologous
    position (each column corresponds to a specific
    residue in the 'prototypical' protein)

6
Multiple Sequence Alignment
  • MSA applies both to nucleotide and amino acid
    sequences
  • To construct a multiple alignment, one may have
    to introduce gaps in sequences at positions where
    there were no gaps in the corresponding pairwise
    alignment.
  • This means that multiple alignments typically
    contain more gaps than any given pair of aligned
    sequences

7
(No Transcript)
8
How to optimize alignment algorithms?
  • Use structural information
  • reading frame
  • protein structure
  • Sequence elements are not truly independent but
    related by phylogenic descent
  • Sequences often contain highly conserved regions

9
Optimize alignment algorithms
10
Pairwise Alignment
  • The alignment of two sequences (DNA or protein)
    is a relatively straightforward computational
    problem.

11
The big-O notation
  • One of the most important properties of an
    algorithm is how its execution time increases as
    the problem is made larger. By a larger problem,
    we mean more sequences to align, or longer
    sequences to align.
  • This is the so-called algorithmic (or
    computational) complexity of the algorithm
  • There is a notation to describe the algorithmic
    complexity, called the big-O notation.
  • If we have a problem size (number of input data
    points) n, then an algorithm takes O(n) time if
    the time increases linearly with n.
  • If the algorithm needs time proportional to the
    square of n, then it is O(n2)

12
The big-O notation
  • It is important to realize that an algorithm that
    is quick on small problems may be totally useless
    on large problems if it has a bad O() behavior.
  • As a rule of thumb one can use the following
    characterizations, where n is the size of the
    problem, and c is a constant
  • O(c) utopian
  • O(log n) excellent
  • O(n) very good
  • O(n2) not so good
  • O(n3) pretty bad
  • O(cn) disaster

13
The big-O notation
  • To compute a N-wise alignment, the algorithmic
    complexity is something like O(c2n), where c is a
    constant, and n is the number of sequences.
  • This is a big-O disaster!

14
(No Transcript)
15
The best solution is Dynamic Programming.
16
Multiple Sequence Alignment
  • In pairwise alignments, you have a
    two-dimensional matrix with the sequenceson each
    axis.
  • The number of operations required to locate the
    best path through the matrix is approximately
    proportional to the product of the lengths of the
    two sequences
  • A possible general method would be to extend the
    pairwise alignment method into a simultaneous
    N-wise alignment, using a complete
    dynamical-programming algorithm in N dimensions.
  • Algorithmically, this is not difficult to do

17
Dynamic Programming
  • Dynamic Programming is a very general programming
    technique.
  • It is applicable when a large search space can be
    structured into a succession of stages, such
    that
  • the initial stage contains trivial solutions to
    sub-problems
  • each partial solution in a later stage can be
    calculated by recurring a fixed number of partial
    solutions in an earlier stage
  • the final stage contains the overall solution

18
Multiple Alignments
  • In theory, making an optimal alignment between
    two sequences is computationally straightforward
    (Smith-Waterman algorithm), but aligning a large
    number of sequences using the same method is
    almost impossible.
  • The problem increases exponentially with the
    number of sequences involved
  • (the product of the sequence lengths)

19
(No Transcript)
20
Optimal Alignment
  • For a given group of sequences, there is no
    single "correct" alignment, only an alignment
    that is "optimal" according to some set of
    calculations.
  • Determining what alignment is best for a given
    set of sequences is really up to the judgement of
    the investigator.

21
Why we do multiple alignments?
  • In order to characterize protein families,
    identify shared regions of homology in a multiple
    sequence alignment (this happens generally when
    a sequence search revealed homologies to several
    sequences).
  • Determination of the consensus sequence of
    several aligned sequences.
  • Consensus sequences can help to develop a
    sequence finger print which allows the
    identification of members of distantly related
    protein family (motifs)
  • MSA can help us to reveal biological facts about
    proteins, like analysis of the secondary/tertiary
    structure)

22
(No Transcript)
23
Why we do multiple alignments?
  • Crucial for genome sequencing
  • Random fragments of a large molecule are
    sequenced and those that overlap are found by a
    multiple sequence alignment program.
  • There should be one correct alignment that
    corresponds to the genomic sequence rather than
    a range of possibilities
  • Sequence may be from one strand of DNA or the
    other, so complements of each sequence must also
    be compared
  • Sequence fragments will usually overlap, but by
    an unknown amount and in some cases, one sequence
    may be included within another
  • All of the overlapping pairs of sequence
    fragments must be assembled into large composite
    genome sequence

24
An example of Multiple Alignment
VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWY
QQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG LSLTCTVSGTSFDD--
YYSTWVRQPPG PEVTCVVVDVSHEDPQVKFNWYVDG-- ATLVCLISDF
YPGA--VTVAWKADS-- AALGCLVKDYFPEP--VTVSWNSG--- VSLT
CLVKGFYPSD--IAVEWWSNG--
25
Three Types of Algorithms
  • Progressive ClustalW
  • Iterative Muscle
  • Concistency Based T-Coffee and Probcons

26
Progressive Multiple Alignment
  • The most practical and widely used method in
    multiple sequence alignment is the hierarchical
    extensions of pairwise alignment methods.
  • The principal is that multiple alignments is
    achieved by successive application of pairwise
    methods.

27
Choosing sequences for alignment
  • The more sequences to align the better.
  • Dont include similar (gt80) sequences.
  • Sub-groups should be pre-aligned separately, and
    one member of each subgroup should be included in
    the final multiple alignment.

28
Progressive Pairwise Methods
  • Most of the available multiple alignment programs
    use some sort of incremental or progressive
    method that makes pairwise alignments, then adds
    new sequences one at a time to these aligned
    groups.
  • This is an approximate or heuristic method!

29
(No Transcript)
30
Multiple Alignment Method
  • Compare all sequences pairwise.
  • Perform cluster analysis on the pairwise data to
    generate a hierarchy for alignment. This may be
    in the form of a binary tree or a simple ordering
  • Build the multiple alignment by first aligning
    the most similar pair of sequences, then the next
    most similar pair and so on. Once an alignment of
    two sequences has been made, then this is fixed.
    Thus for a set of sequences A, B, C, D having
    aligned A with C and B with D the alignment of A,
    B, C, D is obtained by comparing the alignments
    of A and C with that of B and D using averaged
    scores at each aligned position.

31
Gap Penalties
  • In the MSA scoring scheme, a penalty is
    subtracted for each gap introduced into an
    alignment because the gap increases uncertainty
    into an alignment
  • The gap penalty is used to help decide whether or
    not to accept a gap or insertion in an alignment
  • Biologically, it should in general be easier for
    a sequence to accept a different residue in a
    position, rather than having parts of the
    sequence chopped away or inserted.
    Gaps/insertions should therefore be more rare
    than point mutations (substitutions)
  • In general, the lower the gapping penalties, the
    more gaps and more identities are detected but
    this should be considered in relation to
    biological significance
  • Most MSA programs allow for an adjustment of gap
    penalties

32
The PILEUP Algorithm
  • First, PILEUP calculates approximate pairwise
    similarity scores between all sequences to be
    aligned, and they are clustered into a dendrogram
    (tree structure).
  • Then the most similar pairs of sequences are
    aligned.
  • Averages (similar to consensus sequences) are
    calculated for the aligned pairs.
  • New sequences and clusters of sequences are added
    one by one, according to the branching order in
    the dendrogram.

33
Choosing sequences for MSA
  • As far as possible, try to align sequences of
    similar length.
  • Pileup can align sequences of up to 5000
    residues, with 2000 gaps (total 7000 characters).
  • Pileup is a good program only for similar (close)
    sequences.

34
PileUp considerations
  • PileUp does global multiple alignment, and
    therefore is good for a group of similar
    sequences.
  • PileUp will fail to find the best local region of
    similarity (such as a shared motif) among distant
    related sequences.
  • PileUp always aligns all of the sequences you
    specified in the input file, even if they are not
    related.
  • The alignment can be degraded if some of the
    sequences are only distantly related.

35
PILEUP Considerations
  • Since the alignment is calculated on a
    progressive basis, the order of the initial
    sequences can affect the final alignment.
  • PILEUP parameters 2 gap penalties (gap insert
    and gap extend) and an amino acid comparison
    matrix.
  • PILEUP will refuse to align sequences that
    require too many gaps or mismatches.
  • PILEUP will take quite a while to align more than
    about 10 sequences

36
CLUSTAL
  • CLUSTAL is a stand-alone (i.e. not integrated
    into GCG) multiple alignment program that is
    superior in some respects to PILEUP
  • Works by progressive alignment it aligns a pair
    of sequences then aligns the next one onto the
    first pair
  • Most closely related sequences are aligned first,
    and then additional sequences and groups of
    sequences are added, guided by the initial
    alignments
  • Uses alignment scores to produce a phylogenetic
    tree

37
CLUSTAL
  • Aligns the sequences sequentially, guided by the
    phylogenetic relationships indicated by the tree
  • Gap penalties can be adjusted based on specific
    amino acid residues, regions of hydrophobicity,
    proximity to other gaps, or secondary structure
  • Is available with a great web interface
    http//www.ebi.ac.uk/clustalw/
  • Also available in Biology Workbench

38
Multiple Alignment tools on the Web
  • There are a variety of multiple alignment tools
    available for free on the web.
  • CLUSTAL is available from a number of sites (with
    a variety of restrictions)
  • Other algorithms are available too

39
Muscle Algorithm Using The Iteration
40
Consistency Based Algorithms T-Coffee
  • Gotoh (1990)
  • Iterative strategy using concistency
  • Martin Vingron (1991)
  • Dot Matrices Multiplications
  • Accurate but too stringeant
  • Dialign (1996, Morgenstern)
  • Concistency
  • Agglomerative Assembly
  • T-Coffee (2000, Notredame)
  • Concistency
  • Progressive algorithm

41
(No Transcript)
42
T-Coffee and Consistency
43
T-Coffee and Consistency
44
T-Coffee and Consistency
45
T-Coffee and Consistency
46
T-Coffee and Consistency
47
APPROXIMATEFAST
ACCURATE SLOW
48
Some URLs
  • EMBL-EBI
  • http//www.ebi.ac.uk/clustalw/
  • BCM Search Launcher Multiple Alignment
  • http//dot.imgen.bcm.tmc.edu9331/multi-align/mult
    i-align.html
  • Multiple Sequence Alignment for Proteins (Wash.
    U. St. Louis)
  • http//www.ibc.wustl.edu/service/msa/

49
Editing and displaying alignments
  • Sequence editors are used for
  • manual alignment/editing of sequences
  • visualization of data
  • data management
  • import/export of data
  • graphical enhancement of data for presentations

50
Editing Multiple Alignments
  • There are a variety of tools that can be used to
    modify a multiple alignment.
  • These programs can be very useful in formatting
    and annotating an alignment for publication.
  • An editor can also be used to make modifications
    by hand to improve biologically significant
    regions in a multiple alignment created by one of
    the automated alignment programs.

51
Displaying a multiple alignment in GCG
  • There are several programs to display the
    multiple alignment prettily.
  • The Pretty program prints sequences with their
    columns aligned and can display a consensus for
    the alignment, allowing you to look at
    relationships among the sequences.
  • The PrettyBox program displays the alignment
    graphically with the conserved regions of the
    alignment as shaded boxes. The output is in
    Postscript format.

52
Example of PrettyBox Output
53
GCG alignment editors
  • Alignments produced with PILEUP (or CLUSTAL) can
    be adjusted with LINEUP.
  • Nicely shaded printouts can be produced with
    PRETTYBOX
  • GCG's SeqLab X-Windows interface has a superb
    multiple sequence editor - the best editor of any
    kind.

54
(No Transcript)
55
Other editors
  • The MACAW and SeqVu program for Macintosh and
    GeneDoc and DCSE for PCs are free and provide
    excellent editor functionality.
  • Many comprehensive molecular biology programs
    include multiple alignment functions
  • MacVector, OMIGA, Vector NTI, and
    GeneTool/PepTool all include a built-in version
    of CLUSTAL

56
SeqVu
57
CINEMA
  • CINEMA (Colour INteractive Editor for Multiple
    Alignments)
  • It is an editor created completely in JAVA (old
    browsers beware)
  • It includes a fully functional version of
    CLUSTAL, BLAST, and a DotPlot module

http//www.bioinf.man.ac.uk/dbbrowser/CINEMA2.1/
58
Informative Colors
  • By default, the alignment is coloured crudely
    according to residue type (proline and glycine
    have special structural properties, particularly
    in membrane proteins, so are grouped separately
    similarly for cysteine, which is often involved
    in disulphide bond formation)
  • Polar positive H, K, R Blue
  • Polar negative D, E Red
  • Polar neutral S, T, N, Q Green
  • Non-polar aliphatic A, V, L, I, M White
  • Non-polar aromatic F, Y, W Purple
  • P, G Brown
  • C Yellow
  • Special characters B, Z, X, - Grey

59
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com