Recent Progress in Multiple Sequence Alignments: A Survey - PowerPoint PPT Presentation

1 / 89
About This Presentation
Title:

Recent Progress in Multiple Sequence Alignments: A Survey

Description:

1) Identify best chain of segments on each pair of sequence. ... AB: 0.5. AC: 0.5. BC: 0.5. C dric Notredame (*) Weighting Sequences with a Tree. Clustal W ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 90
Provided by: tcof
Category:

less

Transcript and Presenter's Notes

Title: Recent Progress in Multiple Sequence Alignments: A Survey


1
Recent Progress in Multiple Sequence
AlignmentsA Survey
Cédric Notredame
2
Our Scope
What are The existing Methods?
How Do They Work -Assemby Algorithms -Weightin
g Schemes.
When Do They Work ?
Which Future?
3
Outline
-Introduction
-A taxonomy of the existing Packages
-A few algorithms
-Performance Comparison using BaliBase
4
Introduction
5
What Is A Multiple Sequence Alignment?
A MSA is a MODEL
It Indicates the RELATIONSHIP between residues of
different sequences.
It REVEALS -Similarities -Inconsistencies
6
How Can I Use A Multiple Sequence Alignment?
chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAK
KGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPK
NKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFR
S----KHSDLS-IVEMSKAAGAAWKELGP mouse
-----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLS
P . . .. . . .
. chite AATAKQNYIRALQEYERNGG- wheat
ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM---------
mouse AKDDRIRYDNEMKSWEEQMAE . .

Extrapolation
Motifs/Patterns
Multiple Alignments Are CENTRAL to MOST
Bioinformatics Techniques.
Profiles
Phylogeny
Struc. Prediction
7
How Can I Use A Multiple Sequence Alignment?
Multiple Alignments Is the most INTEGRATIVE
Method Available Today.
We Need MSA to INCORPORATE existing DATA
8
Why Is It Difficult To Compute A multiple
Sequence Alignment?
A CROSSROAD PROBLEM
9
Why Is It Difficult To Compute A multiple
Sequence Alignment ?
BIOLOGY
COMPUTATION
CIRCULAR PROBLEM....
Good
Good
Alignment
Sequences
10
A Taxonomy of Multiple Sequence Alignment Methods
11
Grouping According to the assembly Algorithm
12
Simultaneous As opposed to Progressive
Simultaneous they simultaneously use all the
information
Exact As opposed to Heursistic
Heuristics cut corners like Blast Vs SW
Heuristics do not guarranty an optimal solution
Stochastic As opposed to Determinist
Stochastic contain an element of randomness
Stochastic Example of a Monte Carlo Surface
estimation
Iterative As opposed to Non Iterative
Iterative run the same algorithm many times
Iterative Most stochastic methods are iterative
13
OMA
PralineMAFFT
14
Stochastic
SAGA
15
(No Transcript)
16
Grouping According to the Objective Function
17
Scoring an Alignment Evolutionary based methods
BIOLOGY How many events separate my sequences?
Such an evaluation relies on a biological
model.
COMPUTATION Every position musd be independant
18
REAL Tree
Model ALL the sequences evolved from the same
ancestor
A
C
Tree Cost1
A
A
C
PROBLEM We do not know the true tree
19
STAR Tree
Model ALL the sequences have the same ancestor
A
C
Star Tree Cost2
A
A
C
A
PROBLEM the tree star is phylogenetically wrong
20
Sums of Pairs
ModelEvery sequence is the ancestor of every
sequence
PROBLEM -over-estimation of the mutation
costs -Requires a weighting scheme
21
Sums of Pairs Some of itslimitations (Durbin,
p140)
22
Sums of Pairs Some of its limitations (Durbin,
p140)
Conclusion The more Leucine, the less expensive
it gets to add a Glycin to the column...
23
Enthropy based Functions
Model Minimize the enthropy (variety) in each
Column
number of Alanine (a) in column i
Score of column i
a alphabet
P can incorporate pseudocounts
S0 if the column is conserved
PROBLEM -requires a simultaneous
alignment -assumes independant sequences
24
Consistency based Functions
Model Maximise the consistency (agreement) with
a list of constraints (alignments)
kand l are sequences, i is a column
the two residues are found aligned in the list
of constraints
PROBLEM -requires a list of constraints
25
(No Transcript)
26
A few Multiple Sequence Alignment Algorithms
27
A Few Algorithms
MSA and DCA
POA
ClustalW
MAFFT
Dialign II
Prrp
SAGA
GIBBS Sampler
28
Simultaneous MSA and DCA
29
Simultaneous Alignments MSA
1) Set Bounds on each pair of sequences (Carillo
and Lipman)
2) Compute the Maln within the Hyperspace
-Few Small Closely Related Sequence.
-Memory and CPU hungry
-Do Well When They Can Run.
30
MSA the carillo and Lipman bounds
(
)
S

)
(
S
chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGG
ELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKS
VAAVGKAAGERWKSLSE

)
(
S

Pairwise projection of sequences k and l
31
MSA the carillo and Lipman bounds
a(k,l)score of the projection k l in the optimal
MSA
S(a(x,y))score of the complete multiple alignment
â(k,l)score of the optimal alignment of k l
32
MSA the carillo and Lipman bounds
LM a lower bound for the complete MSA
LMltS(â(x,y)) - (â(k,l)-a(k,l))
a(k,l)gtLM â(k,l)-S(â(x,y))
33
MSA the carillo and Lipman bounds
â(k,l)
LM â(k,l)-S(â(x,y))
a(k,l)
â(k,l)
ä(k,l)
LM can be measured on ANY heuristic alignment
LM S(ä(x,y))
The better LM, the tighter the bounds
34
MSA the carillo and Lipman bounds
N
N
0
0
Best( M-i, N-j)
Best( 0-i, 0-j)

M
M
Forward
backward
35
Simultaneous Alignments MSA
1) Set Bounds on each pair of sequences (Carillo
and Lipman)
2) Compute the Maln within the Hyperspace
-Few Small Closely Related Sequence.
-Memory and CPU hungry
-Do Well When They Can Run.
36
Simultaneous Alignments DCA
37
Simultaneous With a New Sequence
Representaion POA-Partial Ordered Graph
38
(No Transcript)
39
(No Transcript)
40
POA
POA makes it possible to represent complex
relationships -domain deletion -domain
inversions
41
Progressive ClustalW
42
Progressive Alignment ClustalW
Feng and Dolittle, 1988 Taylor 198ç
Clustering
43
Progressive Alignment ClustalW
44
Tree based Alignment Recursive Algorithm
Align ( Node N) if ( N-gtleft_child is a
Node) A1Align ( N-gtleft_child) else if (
N-gtleft_child is a Sequence) A1N-gtleft_child
if (N-gtright_child is a node) A2Align
(N-gtright_child) else if ( N-gtright_child is a
Sequence) A2N-gtright_child Return
dp_alignment (A1, A2)
A
D
F
G
C
E
B
45
Progressive Alignment ClustalW
-Depends on the CHOICE of the sequences.
-Depends on the ORDER of the sequences (Tree).
  • -Depends on the PARAMETERS
  • Substitution Matrix.
  • Penalties (Gop, Gep).
  • Sequence Weight.
  • Tree making Algorithm.

46
Weighting Within ClustalW
47
Position Specific GOP
48
ClustalW is the most Popular Method
49
Progressive Alignment With a Heuristic DP MAFFT
50
(No Transcript)
51
Progressive And Concistency BasedDialign II
52
3) Assemble the alignment according to the
segment pairs.
53
-May Align Too Few Residues
-No Gap Penalty -Does well with ESTs
54
Progressive And Concistency BasedT-COFFEE
55
Mixing Local and Global Alignments
Local Alignment
Global Alignment
Extension
Multiple Sequence Alignment
56
What is a library?
3 Seq1 anotherseq Seq2 atsecondone Seq3
athirdone 1 2 1 1 25 1 3 3 8 70 .
2 Seq1 MySeq Seq2 MyotherSeq 1 2 1 1 25 3
8 70 .
ExtensionT-Coffee
Library Based Multiple Sequence Alignment
57
Iterative
58
Iterative Methods
7.16.1 Progressive
-HMMs, HMMER, SAM.
-Slow, Sometimes Inaccurate
-Good Profile Generators
59
Iterative Methods Prrp
7.16.2 Prrp
Initial Alignment
Tree and weights computation
YES
End
Weights converged
Outer Iteration
NO
Realign two sub-groups
Inner Iteration
YES
NO
Alignment converged
60
Iterative Sochastic SAGA, The Genetic Algorithm
61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
Automatic scheduling of the operators
65
(No Transcript)
66
Weighting Schemes
67
The Problem
The sequences Contain Correlated Information
Most scoring Schemes Ignore this Correlation
68
Weighting Sequence Pairs with a Tree Carillo
and Lipman Rationale I
69
QUESTION Which Weight for a Pair of Sequences
EEDGE
PEvolutive Path from A to X
E must contribute the same weight to every path P
that goes throught it.
A
G
F
C
B
D
E
All the weights using E must sum to 1 S(WP,E)1.
70
USAGE
71
PROBLEM Weight Depends only on the Tree topology
72
Weighting Sequences with a Tree Clustal W Weights
73
QUESTION Which Weight for Sequences ?
A
F
C
B
G
D
E
74
USAGE
75
PROBLEM Overweight of distant sequences
G
F
D
E
C
-C Will dominate the Alignment -C Will be very
Difficult to align
76
Performance Comparison Using Collections of
Reference Alignments BaliBase and Ribosomal RNA
77
What Is BaliBase
BaliBase
BaliBase is a collection of reference Multiple
Alignments
The Structure of the Sequences are known and were
used to assemble the MALN.
Evaluation is carried out by Comparing the
Structure Based Reference Alignment With its
Sequence Based Counterpart
78
What Is BaliBase
BaliBase
?
Method X
DALI, Sap
Comparison
79
BaliBase
What Is BaliBase
Source BaliBase, Thompson et al, NAR, 1999,
Description
PROBLEM
80
Choosing The Right Method
81
Choosing The Right Method (POA Evaluation)
82
Choosing The Right Method (POA Evaluation)
83
Choosing The Right Method (MAFFT evaluation)
84
Choosing The Right Method (MAFFT evaluation)
85
Choosing The Right Method (MAFFT evaluation)
86
Conclusion
87
Which Method ?
What Is BaliBase
Source BaliBase, Thompson et al, NAR, 1999,
Strategy
Strategy
PROBLEM
88
Methods /Situtations
1-Carillo and Lipman
-MSA, DCA.
-Few Small Closely Related Sequence.
-Do Well When They Can Run.
2-Segment Based
-DIALIGN, MACAW.
-May Align Too Few Residues -Good For Long Indels
3-Iterative
-HMMs, HMMER, SAM.
-Slow, Sometimes Inaccurate
-Good Profile Generators
4-Progressive
-ClustalW, Pileup, Multalign
-Fast and Sensitive
89
Addresses
MAFFT Progressive
www.biophys.kyoto-u.jp/kat
oh
POA
Progressive/Simulataneous
www.bioinformatics.ucla.edu/poa
MUSCLE Progressive/Iterative
www.drive5.com/muscle/
Write a Comment
User Comments (0)
About PowerShow.com