Title: Growing Trees on the Right Compost
1 Growing Trees on the Right Compost
- Cédric Notredame
- Comparative Bioinformatics Group
- Bioinformatics and Genomics Program
2Manguel M, Samaniego F.J., Abraham Walds Work
on Aircraft Suvivability, J. American
Statistical Association. 79, 259-270, (1984)
3(No Transcript)
4What s in a Multiple Sequence Alignment
Evolution Inertia Common Ancestry Shows up In
the sequences
Selection Important Features Are Preserved
Functional Constraint Same Function Same
Sequence Convergence
Phylogenetic Footprint, Evolutionary Trace
5Why So Much Interest For Multiple Alignments ?
Extrapolation
Structure Prediction
Motifs/Patterns
SNP Analysis
Profiles
Regulatory Elements
Phylogeny
Reactivity Analysis
6Whats in a Multiple Alignment ?
- The MSA contains what you put inside
- Structural Similarity
- Evolutive Similarity
- Sequence Similarity
- You can view your MSA as
- A record of evolution
- A summary of a protein family
- A collection of experiments made for you by
Nature
7Producing The Right Alignment
- Multiple Sequence Alignments Influence
Phylogenetic Trees - Choice of Method is not Neutral
- Different Methods
- Different Alignments
- Different Trees
- Using The Right Models insures Producing the
right Tree
8Model Based Alignments vs Naïve Alignments
- Naïve Alignment
- Lexicographic Alignment
- Maximizing the number of identities
- At best using a substitution matrix
- Model Based Alignments
- Using a model
- Protein structure information
- RNA Structure information
- Combining/Confronting Modeling methods
- Template based Alignments
- Model based Alignments through the use of
Templates
9T-Coffee and Model Based Alignments
- T-Coffee Algorithm
- Expresso Aligning Protein Structures
- R-Coffee Aligning RNA structures
- M-Coffee Combining methods
10T-Coffee An extension of the progressive
Alignment Algorithm
11T-Coffee and Concistency
SeqA GARFIELD THE LAST FAT CAT
SeqB GARFIELD THE FAST CAT
SeqC GARFIELD THE VERY FAST CAT
SeqD THE FAT CAT
SeqA GARFIELD THE LAST FA-T CAT SeqB GARFIELD THE
FAST CA-T --- SeqC GARFIELD THE VERY FAST
CAT SeqD -------- THE ---- FA-T CAT
12T-Coffee and Concistency
SeqA GARFIELD THE LAST FAT CAT Prim. Weight
88 SeqB GARFIELD THE FAST CAT --- SeqA GARFIELD
THE LAST FA-T CAT Prim. Weight 77 SeqC
GARFIELD THE VERY FAST CAT SeqA GARFIELD THE
LAST FAT CAT Prim. Weight 100 SeqD --------
THE ---- FAT CAT SeqB GARFIELD THE ---- FAST CAT
Prim. Weight 100 SeqC GARFIELD THE VERY FAST
CAT SeqC GARFIELD THE VERY FAST CAT Prim.
Weight 100 SeqD -------- THE ---- FA-T CAT
13T-Coffee and Concistency
14T-Coffee and Concistency
15T-Coffee and Concistency
16T-Coffee and Concistency
17T-Coffee and Concistency
18T-Coffee and Concistency
19When Sequences Are not Enough3D-Coffee and
Expresso
203D-Coffee Combining Sequences and Structures
Within Multiple Sequence Alignments
213D-Coffee Combining Sequences and Structures
Within Multiple Sequence Alignments
22Expresso Finding the Right Structure
Sources
BLAST
BLAST
SAP
Templates
Templates
Template Alignment
Source Template Alignment
Library
Remove Templates
233D-Coffee Combining Sequences and Structures
Within Multiple Sequence Alignments
24Incorporating RNA Information Within the T-Coffee
Algorithm
25ncRNAs Can Evolve Rapidly
CCAGGCAAGACGGGACGAGAGTTGCCTGG CCTCCGTTCAGAGGTGCATA
GAACGGAGG -------------------
26R-Coffee Modifying T-Coffee at the Right Place
- Incorporation of Secondary Structure information
within the Library - Two Extra Components for the T-Coffee Scoring
Scheme - A new Library
- A new Scoring Scheme
27R-Coffee Extension
TC Library
G
C
G G Score X C C Score Y
G
C
G
C
G
C
- Goal Embedding RNA Structures Within The
T-Coffee Libraries - The R-extension can be added on the top of any
existing method.
28R-Coffee Structural Aligners
Method Avg Braliscore Net Improv. direct
T R T R -----------------------------------
------------------------ Stemloc 0.62 0.75 0.76
104 113Mlocarna 0.66 0.69 0.71
101 133Murlet 0.73 0.70 0.72
-132 -73Pmcomp 0.73 0.73 0.73
142 145T-Lara 0.74 0.74 0.69 -36
-8 Foldalign 0.75 0.77 0.77 72
73 -----------------------------------------------
------------ Dyalign --- 0.63 0.62
--- --- Consan --- 0.79 0.79
--- --- ------------------------------------------
----------------- Improvement R-Coffee wins -
R-Coffee looses over 170 test sets
29R-Coffee Regular Aligners
Method Avg Braliscore Net Improv. direct
T R T R -----------------------------------
------------------------ Poa 0.62 0.65 0.70
48 154Pcma 0.62 0.64 0.67 34 120Prrn 0.64
0.61 0.66 -63 45ClustalW 0.65 0.65 0.69 -7
83Mafft_fftnts 0.68 0.68 0.72 17
68ProbConsRNA 0.69 0.67 0.71 -49
39Muscle 0.69 0.69 0.73 -17
42Mafft_ginsi 0.70 0.68 0.72 -49
39 -----------------------------------------------
------------
Improvement R-Coffee wins - R-Coffee looses
over 388 test sets
30Choosing the right modeling methodM-Coffee
31Combining Many MSAs into ONE
ClustalW
MAFFT
T-Coffee
MUSCLE
???????
32Comparing Methods
MAFFT
33(No Transcript)
34(No Transcript)
35Where to Trust Your Alignments
Most Methods Disagree
Most Methods Agree
36What To Do Without Structures
37Conclusion
- Model Based Alignments Give the best Accuracy
- Template based alignment is a very efficient way
to turn Naïve aligners into model based aligners - Sequence Alignments are not necessarily reliable
over their entire lengths
38www.tcoffee.org
- Fabrice Armougom (CNRS, FR)
- Sebastien Moretti (CNRS, FR)
- Olivier Poirot (CNRS, FR)
- Frederic Reinier (CRS4, IT)
- Karsten Suhre (CNRS, FR)
- Vladimir Saudek (Sanofi-Aventis, FR)
- Des Higgins (UCD, IE)
- Orla OSullivan (UCD, IE)
- Iain Wallace (UCD, IE)
- Victor Jongeneel (SIB/VitalIT, CH)
- Bruno Nyfler (VitalIT, CH)
- Roger Hersch (EPFL, CH)
- Pierre Dumas (EPFL, CH)
- Basile Schaeli (EPFL, CH)
www.tcoffee.org cedric.notredame_at_europe.com
39www.tcoffee.org
www.tcoffee.org cedric.notredame_at_europe.com
40(No Transcript)
41Building and Using Models
35.67 Angstrom
42Computing the Correct Alignment is a Complicated
Problem
43Stochastic Optimization
44Stochastic Optimization
- Exploration of Complex Optimization Problems With
Multiple Constraints - Genomic Alignments
- RNA Alignments
- Generation of Population of Suboptimal Solutions
- Qualityf( optimality )
- Specification of Concistency Objective Function
of T-Coffee
45Three Types of Algorithms
- Progressive ClustalW
- Iterative Muscle
- Concistency Based T-Coffee and Probcons
46T-Coffee and Concistency
- Each Library Line is a Soft Constraint (a wish)
- You cant satisfy them all
- You must satisfy as many as possible (The easy
ones)
47Concistency Based Algorithms T-Coffee
- Gotoh (1990)
- Iterative strategy using consistency
- Martin Vingron (1991)
- Dot Matrices Multiplications
- Accurate but too stringeant
- Dialign (1996, Morgenstern)
- Concistency
- Agglomerative Assembly
- T-Coffee (2000, Notredame)
- Concistency
- Progressive algorithm
48How Good Is My Method ?
49Structures Vs Sequences
50Validation Using BaliBase
51Too Many Methods for ONE AlignmentM-Coffee
52(No Transcript)
53Estimating the Accuracy of your MSA
54What To Do Without Structures
553D-Coffee Combining Sequences and Structures
Within Multiple Sequence Alignments
56Expresso Finding the Right Structure
Why Not Using Structure Based Alignments
57Template Based Multiple Sequence Alignments
58Template Based Multiple Sequence Alignments
Sources
-Structure -Profile -
Template Aligner
-Structure -Profile -
Templates
Templates
Template Alignment
Source Template Alignment
Library
Remove Templates
59Method Score Templates Prefab Homstrad
-------------------------------------------------
------------- ClustalW Matrix ---- 61.80 ----
Kalign Matrix ---- 63.00 ---- MUSCLE Matrix
---- 68.00 45.0 --------------------------------
------------------------------ T-Coffee Consisten
cy ---- 69.97 44.0 ProbCons Consistency ---- 7
0.54 ---- Mafft Consistency ---- 72.20 ---- M-
Coffee Consistency ---- 72.91 ---- MUMMALS Consi
stency ---- 73.10 ---- -------------------------
------------------------------------- Clustal-db
Matrix Profiles ---- ---- PRALINE Matrix Profi
les ---- 50.2 PROMALS Consistency Profiles 79.00
---- SPEM Matrix Profiles 77.00 ---- ---------
--------------------------------------------------
--- EXPRESSO Consistency Structures ---- 71.9
T-Lara Consistency Structures ---- ---- ------
--------------------------------------------------
------ Table 1. Summary of all the methods
described in the review. Validation figures were
compiled from several sources, and selected for
the compatibility. Prefab refers to some
validation made on Prefab Version 3. The HOMSTRAD
validation was made on datasets having less than
30 identity. The source of each figure is
indicated by a reference. The EXPRESSO figure
comes from a slightly more demanding subset of
HOMSTRAD (HOM39) made of sequences less than 25
identical.
60Improving The Evaluation
61How Do We Perform In The Twilight Zone?
- Concistency Based Methods Have an Edge
- Hard to tell Methods Apart
- Sequence Alignment is NOT solved
62More Than Structure based Alignments
- Structural Correctness Is Only the Easy Side of
the Coin. - In practice MSA are intermediate models used to
generate other models
Data Model Type Benchmark
Homology Profile Yes
Evolution Trees No
Structure 3D-Structure CASP
Function Annotation No
63Conclusion
- Template based Multiple Sequence Alignments
- Projecting any relevant information onto the
sequences - Using this Information
- Need for new evaluation procedures
- Functional Analysis
- Phylogenetic Analysis
- Homology Search (Profiles)
- Homology Modelling
- Integrating data ? Making sure your bits of data
can fight with one another
64Turning Data into Models
- Data
- Columbus, considered that the landmass occupied
225, leaving only 135 of water (Marinus of
Tyre, 70 AD). - Columbus believed that 1 represented only 56
miles (Alfraganus, XIth century) -
- He knew there was an island named Japan off the
cost of China - Model
- Circumference of the Earth as 25,255 km at most,
- Canary Island to Japan 3,700 km (Reality
12,000 km.)
65The More Structures The Merrier
Average Improvement over T-Coffee
Struc/Seq Ratio
66The Right Mixt of Methods
673D-Coffee Combining Sequences and Structures
Within Multiple Sequence Alignments
68Applications
69Looking-Up The DNA Behind The Sequences PROTOGENE
70SAR Analysis
- Correlate Alignment Variations with Reactivity
- Application to the Human Kinome
- Collaboration with Sanofi-Aventis
- Main Issue
- Training problem ? Proper Benchmarking
71ncRNA Multiple Alignments with R-Coffee
- Laundering the Genome Dark Matter
- Cédric Notredame
- Comparative Bioinformatics Group
- Bioinformatics and Genomics Program
72No Plane Today
73ncRNAs Comparison
- And ENCODE said
- nearly the entire genome may be represented in
primary transcripts that extensively overlap and
include many non-protein-coding regions - Who Are They?
- tRNA, rRNA, snoRNAs,
- microRNAs, siRNAs
- piRNAs
- long ncRNAs (Xist, Evf, Air, CTN, PINK)
- How Many of them
- Open question
- 30.000 is a common guess
- Harder to detect than proteins
- .
74ncRNAs can have different sequences and Similar
Structures
75ncRNAs are Difficult to Align
- Same Structure ?Low Sequence Identity
- Small Alphabet, Short Sequences ? Alignments
often Non-Significant -
76Obtaining the Structure of a ncRNA is difficult
- Hard to Align The Sequences Without the Structure
- Hard to Predict the Structures Without an
Alignment -
77The Holy Grail of RNA ComparisonSankoff
Algorithm
78The Holy Grail of RNA ComparisonSankoff
Algorithm
- Simultaneous Folding and Alignment
- Time Complexity O(L2n)
- Space Complexity O(L3n)
- In Practice, for Two Sequences
- 50 nucleotides 1 min. 6 M.
- 100 nucleotides 16 min. 256 M.
- 200 nucleotides 4 hours 4 G.
- 400 nucleotides 3 days 3 T.
- Forget about
- Multiple sequence alignments
- Database searches
79The next best Thing Consan
- Consan Sankoff a few constraints
- Use of Stochastic Context Free Grammars
- Tree-shaped HMMs
- Made sparse with constraints
- The constraints are derived from the most
confident positions of the alignment - Equivalent of Banded DP
80Going Multiple.
81Game Rules
- Using Structural Predictions
- Produces better alignments
- Is Computationally expensive
- Use as much structural information as possible
while doing as little computation as possible
82Adapting T-Coffee To RNA Alignments
83T-Coffee and Concistency
84T-Coffee and Concistency
85T-Coffee and Concistency
86T-Coffee and Concistency
87Consistency Conflicts and Information
X
X
Z
Z
Y
Y
W
Z
Z
W
Y is unhappy
X is unhappy
Partly Consistent ? Less Reliable
Fully Consistent ? More Reliable
88(No Transcript)
89R-Coffee Scoring Scheme
R-Score (CC)MAX(TC-Score(CC), TC-Score (GG))
G
C
G
C
90Validating R-Coffee
91RNA Alignments are harder to validate than
Protein Alignments
- Protein Alignments ? Use of Structure based
Reference Alignments - RNA Alignments ?No Real structure based reference
alignments - The structures are mostly predicted from
sequences - Circularity
92BraliBase and the BraliScore
- Database of Reference Alignments
- 388 multiple sequence alignments.
- Evenly distributed between 35 and 95 percent
average sequence identity - Contain 5 sequences selected from the RNA family
database Rfam - The reference alignment is based on a SCFG model
based on the full Rfam seed dataset (100
sequences).
93BraliBase SPS Score
Number of Identically Aligned Pairs
RFam MSA
SPS
Number of Aligned Pairs
94BraliBase SCI Score
R N A p f o l d
Covariance
((()))((..)) DG Seq1
((()))((..)) DG Seq2
((()))((..)) DG Seq3
((()))((..)) DG Seq4
((()))((..)) DG Seq5
((()))((..)) DG Seq6
RNAlifold
Average DG Seq X Cov
SCI
((()))((..)) ALN DG
DG ALN
95BRaliScore
96RM-Coffee Regular Aligners
Method Avg Braliscore Net Improv. direct
T R T R -----------------------------------
------------------------ Poa 0.62 0.65 0.70
48 154Pcma 0.62 0.64 0.67 34 120Prrn 0.64
0.61 0.66 -63 45ClustalW 0.65 0.65 0.69 -7
83Mafft_fftnts 0.68 0.68 0.72 17
68ProbConsRNA 0.69 0.67 0.71 -49
39Muscle 0.69 0.69 0.73 -17
42Mafft_ginsi 0.70 0.68 0.72 -49
39 -----------------------------------------------
------------ RM-Coffee4 0.71 / 0.74 / 84
97How Best is the Best.
Method vs. R-Coffee-Consan vs. RM-Coffee4
Poa 241 217
T-Coffee 241 199
Prrn 232 198
Pcma 218 151
Proalign 216 150
Mafft fftns 206 148
ClustalW 203 136
Probcons 192 128
Mafft ginsi 170 115
Muscle 169 111
M-Locarna 234 183
Stral 169 62
FoldalignM 146 61
Murlet 130 -12
Rnasampler 129 -27
T-Lara 125 -30
98Range of Performances
Effect of Compensated Mutations
99Conclusion/Future Directions
- T-Coffee/Consan is currently the best MSA
protocol for ncRNAs - Testing how important is the accuracy of the
secondary structure prediction - Going deeper into Sankoffs territory predicting
and aligning simultaneously
100Credits and Web Servers
- Andreas Wilm
- Des Higgins
- Sebastien Moretti
- Ioannis Xenarios
- Cedric Notredame
- CGR, SIB, UCD
www.tcoffee.org cedric.notredame_at_europe.com
101(No Transcript)
102(No Transcript)