Growing Trees on the Right Compost - PowerPoint PPT Presentation

About This Presentation

Title:

Growing Trees on the Right Compost

Description:

Title: Beyond ClustalW Author: Notredame Last modified by: Notredame Created Date: 6/12/2005 1:14:42 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:162

Avg rating:3.0/5.0

Slides: 103

Provided by: Notr151

Learn more at: https://tcoffee.org

Category:

more less

Transcript and Presenter's Notes

Title: Growing Trees on the Right Compost

1
Growing Trees on the Right Compost

Cédric Notredame
Comparative Bioinformatics Group
Bioinformatics and Genomics Program

2
Manguel M, Samaniego F.J., Abraham Walds Work
on Aircraft Suvivability, J. American
Statistical Association. 79, 259-270, (1984)
3
(No Transcript)
4
What s in a Multiple Sequence Alignment
Evolution Inertia Common Ancestry Shows up In
the sequences
Selection Important Features Are Preserved
Functional Constraint Same Function Same
Sequence Convergence
Phylogenetic Footprint, Evolutionary Trace
5
Why So Much Interest For Multiple Alignments ?
Extrapolation
Structure Prediction
Motifs/Patterns
SNP Analysis
Profiles
Regulatory Elements
Phylogeny
Reactivity Analysis
6
Whats in a Multiple Alignment ?

The MSA contains what you put inside
Structural Similarity
Evolutive Similarity
Sequence Similarity
You can view your MSA as
A record of evolution
A summary of a protein family
A collection of experiments made for you by
Nature

7
Producing The Right Alignment

Multiple Sequence Alignments Influence
Phylogenetic Trees
Choice of Method is not Neutral
Different Methods
Different Alignments
Different Trees
Using The Right Models insures Producing the
right Tree

8
Model Based Alignments vs Naïve Alignments

Naïve Alignment
Lexicographic Alignment
Maximizing the number of identities
At best using a substitution matrix
Model Based Alignments
Using a model
Protein structure information
RNA Structure information
Combining/Confronting Modeling methods
Template based Alignments
Model based Alignments through the use of
Templates

9
T-Coffee and Model Based Alignments

T-Coffee Algorithm
Expresso Aligning Protein Structures
R-Coffee Aligning RNA structures
M-Coffee Combining methods

10
T-Coffee An extension of the progressive
Alignment Algorithm
11
T-Coffee and Concistency
SeqA GARFIELD THE LAST FAT CAT
SeqB GARFIELD THE FAST CAT
SeqC GARFIELD THE VERY FAST CAT
SeqD THE FAT CAT
SeqA GARFIELD THE LAST FA-T CAT SeqB GARFIELD THE
FAST CA-T --- SeqC GARFIELD THE VERY FAST
CAT SeqD -------- THE ---- FA-T CAT
12
T-Coffee and Concistency
SeqA GARFIELD THE LAST FAT CAT Prim. Weight
88 SeqB GARFIELD THE FAST CAT --- SeqA GARFIELD
THE LAST FA-T CAT Prim. Weight 77 SeqC
GARFIELD THE VERY FAST CAT SeqA GARFIELD THE
LAST FAT CAT Prim. Weight 100 SeqD --------
THE ---- FAT CAT SeqB GARFIELD THE ---- FAST CAT
Prim. Weight 100 SeqC GARFIELD THE VERY FAST
CAT SeqC GARFIELD THE VERY FAST CAT Prim.
Weight 100 SeqD -------- THE ---- FA-T CAT
13
T-Coffee and Concistency
14
T-Coffee and Concistency
15
T-Coffee and Concistency
16
T-Coffee and Concistency
17
T-Coffee and Concistency
18
T-Coffee and Concistency
19
When Sequences Are not Enough3D-Coffee and
Expresso
20
3D-Coffee Combining Sequences and Structures
Within Multiple Sequence Alignments
21
3D-Coffee Combining Sequences and Structures
Within Multiple Sequence Alignments
22
Expresso Finding the Right Structure
Sources
BLAST
BLAST
SAP
Templates
Templates
Template Alignment
Source Template Alignment
Library
Remove Templates
23
3D-Coffee Combining Sequences and Structures
Within Multiple Sequence Alignments
24
Incorporating RNA Information Within the T-Coffee
Algorithm
25
ncRNAs Can Evolve Rapidly
CCAGGCAAGACGGGACGAGAGTTGCCTGG CCTCCGTTCAGAGGTGCATA
GAACGGAGG -------------------
26
R-Coffee Modifying T-Coffee at the Right Place

Incorporation of Secondary Structure information
within the Library
Two Extra Components for the T-Coffee Scoring
Scheme
A new Library
A new Scoring Scheme

27
R-Coffee Extension
TC Library
G
C
G G Score X C C Score Y
G
C
G
C
G
C

Goal Embedding RNA Structures Within The
T-Coffee Libraries
The R-extension can be added on the top of any
existing method.

28
R-Coffee Structural Aligners
Method Avg Braliscore Net Improv. direct
T R T R -----------------------------------
------------------------ Stemloc 0.62 0.75 0.76
104 113Mlocarna 0.66 0.69 0.71
101 133Murlet 0.73 0.70 0.72
-132 -73Pmcomp 0.73 0.73 0.73
142 145T-Lara 0.74 0.74 0.69 -36
-8 Foldalign 0.75 0.77 0.77 72
73 -----------------------------------------------
------------ Dyalign --- 0.63 0.62
--- --- Consan --- 0.79 0.79
--- --- ------------------------------------------
----------------- Improvement R-Coffee wins -
R-Coffee looses over 170 test sets
29
R-Coffee Regular Aligners
Method Avg Braliscore Net Improv. direct
T R T R -----------------------------------
------------------------ Poa 0.62 0.65 0.70
48 154Pcma 0.62 0.64 0.67 34 120Prrn 0.64
0.61 0.66 -63 45ClustalW 0.65 0.65 0.69 -7
83Mafft_fftnts 0.68 0.68 0.72 17
68ProbConsRNA 0.69 0.67 0.71 -49
39Muscle 0.69 0.69 0.73 -17
42Mafft_ginsi 0.70 0.68 0.72 -49
39 -----------------------------------------------
------------
Improvement R-Coffee wins - R-Coffee looses
over 388 test sets
30
Choosing the right modeling methodM-Coffee
31
Combining Many MSAs into ONE
ClustalW
MAFFT
T-Coffee
MUSCLE
???????
32
Comparing Methods
MAFFT
33
(No Transcript)
34
(No Transcript)
35
Where to Trust Your Alignments
Most Methods Disagree
Most Methods Agree
36
What To Do Without Structures
37
Conclusion

Model Based Alignments Give the best Accuracy
Template based alignment is a very efficient way
to turn Naïve aligners into model based aligners
Sequence Alignments are not necessarily reliable
over their entire lengths

38
www.tcoffee.org

Fabrice Armougom (CNRS, FR)
Sebastien Moretti (CNRS, FR)
Olivier Poirot (CNRS, FR)
Frederic Reinier (CRS4, IT)
Karsten Suhre (CNRS, FR)
Vladimir Saudek (Sanofi-Aventis, FR)
Des Higgins (UCD, IE)
Orla OSullivan (UCD, IE)
Iain Wallace (UCD, IE)
Victor Jongeneel (SIB/VitalIT, CH)
Bruno Nyfler (VitalIT, CH)
Roger Hersch (EPFL, CH)
Pierre Dumas (EPFL, CH)
Basile Schaeli (EPFL, CH)

www.tcoffee.org cedric.notredame_at_europe.com
39
www.tcoffee.org
www.tcoffee.org cedric.notredame_at_europe.com
40
(No Transcript)
41
Building and Using Models
35.67 Angstrom
42
Computing the Correct Alignment is a Complicated
Problem
43
Stochastic Optimization
44
Stochastic Optimization

Exploration of Complex Optimization Problems With
Multiple Constraints
Genomic Alignments
RNA Alignments
Generation of Population of Suboptimal Solutions
Qualityf( optimality )
Specification of Concistency Objective Function
of T-Coffee

45
Three Types of Algorithms

Progressive ClustalW
Iterative Muscle
Concistency Based T-Coffee and Probcons

46
T-Coffee and Concistency

Each Library Line is a Soft Constraint (a wish)
You cant satisfy them all
You must satisfy as many as possible (The easy
ones)

47
Concistency Based Algorithms T-Coffee

Gotoh (1990)
Iterative strategy using consistency
Martin Vingron (1991)
Dot Matrices Multiplications
Accurate but too stringeant
Dialign (1996, Morgenstern)
Concistency
Agglomerative Assembly
T-Coffee (2000, Notredame)
Concistency
Progressive algorithm

48
How Good Is My Method ?
49
Structures Vs Sequences
50
Validation Using BaliBase
51
Too Many Methods for ONE AlignmentM-Coffee
52
(No Transcript)
53
Estimating the Accuracy of your MSA
54
What To Do Without Structures
55
3D-Coffee Combining Sequences and Structures
Within Multiple Sequence Alignments
56
Expresso Finding the Right Structure
Why Not Using Structure Based Alignments
57
Template Based Multiple Sequence Alignments
58
Template Based Multiple Sequence Alignments
Sources
-Structure -Profile -
Template Aligner
-Structure -Profile -
Templates
Templates
Template Alignment
Source Template Alignment
Library
Remove Templates
59
Method Score Templates Prefab Homstrad
-------------------------------------------------
------------- ClustalW Matrix ---- 61.80 ----
Kalign Matrix ---- 63.00 ---- MUSCLE Matrix
---- 68.00 45.0 --------------------------------
------------------------------ T-Coffee Consisten
cy ---- 69.97 44.0 ProbCons Consistency ---- 7
0.54 ---- Mafft Consistency ---- 72.20 ---- M-
Coffee Consistency ---- 72.91 ---- MUMMALS Consi
stency ---- 73.10 ---- -------------------------
------------------------------------- Clustal-db
Matrix Profiles ---- ---- PRALINE Matrix Profi
les ---- 50.2 PROMALS Consistency Profiles 79.00
---- SPEM Matrix Profiles 77.00 ---- ---------
--------------------------------------------------
--- EXPRESSO Consistency Structures ---- 71.9
T-Lara Consistency Structures ---- ---- ------
--------------------------------------------------
------ Table 1. Summary of all the methods
described in the review. Validation figures were
compiled from several sources, and selected for
the compatibility. Prefab refers to some
validation made on Prefab Version 3. The HOMSTRAD
validation was made on datasets having less than
30 identity. The source of each figure is
indicated by a reference. The EXPRESSO figure
comes from a slightly more demanding subset of
HOMSTRAD (HOM39) made of sequences less than 25
identical.
60
Improving The Evaluation
61
How Do We Perform In The Twilight Zone?

Concistency Based Methods Have an Edge
Hard to tell Methods Apart
Sequence Alignment is NOT solved

62
More Than Structure based Alignments

Structural Correctness Is Only the Easy Side of
the Coin.
In practice MSA are intermediate models used to
generate other models

Data Model Type Benchmark
Homology Profile Yes
Evolution Trees No
Structure 3D-Structure CASP
Function Annotation No
63
Conclusion

Template based Multiple Sequence Alignments
Projecting any relevant information onto the
sequences
Using this Information
Need for new evaluation procedures
Functional Analysis
Phylogenetic Analysis
Homology Search (Profiles)
Homology Modelling
Integrating data ? Making sure your bits of data
can fight with one another

64
Turning Data into Models

Data
Columbus, considered that the landmass occupied
225, leaving only 135 of water (Marinus of
Tyre, 70 AD).
Columbus believed that 1 represented only 56
miles (Alfraganus, XIth century)
He knew there was an island named Japan off the
cost of China
Model
Circumference of the Earth as 25,255 km at most,
Canary Island to Japan 3,700 km (Reality
12,000 km.)

65
The More Structures The Merrier
Average Improvement over T-Coffee
Struc/Seq Ratio
66
The Right Mixt of Methods
67
3D-Coffee Combining Sequences and Structures
Within Multiple Sequence Alignments
68
Applications
69
Looking-Up The DNA Behind The Sequences PROTOGENE
70
SAR Analysis

Correlate Alignment Variations with Reactivity
Application to the Human Kinome
Collaboration with Sanofi-Aventis
Main Issue
Training problem ? Proper Benchmarking

71
ncRNA Multiple Alignments with R-Coffee

Laundering the Genome Dark Matter
Cédric Notredame
Comparative Bioinformatics Group
Bioinformatics and Genomics Program

72
No Plane Today
73
ncRNAs Comparison

And ENCODE said
nearly the entire genome may be represented in
primary transcripts that extensively overlap and
include many non-protein-coding regions
Who Are They?
tRNA, rRNA, snoRNAs,
microRNAs, siRNAs
piRNAs
long ncRNAs (Xist, Evf, Air, CTN, PINK)
How Many of them
Open question
30.000 is a common guess
Harder to detect than proteins
.

74
ncRNAs can have different sequences and Similar
Structures
75
ncRNAs are Difficult to Align

Same Structure ?Low Sequence Identity
Small Alphabet, Short Sequences ? Alignments
often Non-Significant

76
Obtaining the Structure of a ncRNA is difficult

Hard to Align The Sequences Without the Structure
Hard to Predict the Structures Without an
Alignment

77
The Holy Grail of RNA ComparisonSankoff
Algorithm
78
The Holy Grail of RNA ComparisonSankoff
Algorithm

Simultaneous Folding and Alignment
Time Complexity O(L2n)
Space Complexity O(L3n)
In Practice, for Two Sequences
50 nucleotides 1 min. 6 M.
100 nucleotides 16 min. 256 M.
200 nucleotides 4 hours 4 G.
400 nucleotides 3 days 3 T.
Forget about
Multiple sequence alignments
Database searches

79
The next best Thing Consan

Consan Sankoff a few constraints
Use of Stochastic Context Free Grammars
Tree-shaped HMMs
Made sparse with constraints
The constraints are derived from the most
confident positions of the alignment
Equivalent of Banded DP

80
Going Multiple.

Structural Aligners

81
Game Rules

Using Structural Predictions
Produces better alignments
Is Computationally expensive
Use as much structural information as possible
while doing as little computation as possible

82
Adapting T-Coffee To RNA Alignments
83
T-Coffee and Concistency
84
T-Coffee and Concistency
85
T-Coffee and Concistency
86
T-Coffee and Concistency
87
Consistency Conflicts and Information
X
X
Z
Z
Y
Y
W
Z
Z
W
Y is unhappy
X is unhappy
Partly Consistent ? Less Reliable
Fully Consistent ? More Reliable
88
(No Transcript)
89
R-Coffee Scoring Scheme
R-Score (CC)MAX(TC-Score(CC), TC-Score (GG))
G
C
G
C
90
Validating R-Coffee
91
RNA Alignments are harder to validate than
Protein Alignments

Protein Alignments ? Use of Structure based
Reference Alignments
RNA Alignments ?No Real structure based reference
alignments
The structures are mostly predicted from
sequences
Circularity

92
BraliBase and the BraliScore

Database of Reference Alignments
388 multiple sequence alignments.
Evenly distributed between 35 and 95 percent
average sequence identity
Contain 5 sequences selected from the RNA family
database Rfam
The reference alignment is based on a SCFG model
based on the full Rfam seed dataset (100
sequences).

93
BraliBase SPS Score
Number of Identically Aligned Pairs
RFam MSA
SPS
Number of Aligned Pairs
94
BraliBase SCI Score
R N A p f o l d
Covariance
((()))((..)) DG Seq1
((()))((..)) DG Seq2
((()))((..)) DG Seq3
((()))((..)) DG Seq4
((()))((..)) DG Seq5
((()))((..)) DG Seq6
RNAlifold
Average DG Seq X Cov
SCI
((()))((..)) ALN DG
DG ALN
95
BRaliScore

Braliscore SCISPS

96
RM-Coffee Regular Aligners
Method Avg Braliscore Net Improv. direct
T R T R -----------------------------------
------------------------ Poa 0.62 0.65 0.70
48 154Pcma 0.62 0.64 0.67 34 120Prrn 0.64
0.61 0.66 -63 45ClustalW 0.65 0.65 0.69 -7
83Mafft_fftnts 0.68 0.68 0.72 17
68ProbConsRNA 0.69 0.67 0.71 -49
39Muscle 0.69 0.69 0.73 -17
42Mafft_ginsi 0.70 0.68 0.72 -49
39 -----------------------------------------------
------------ RM-Coffee4 0.71 / 0.74 / 84
97
How Best is the Best.

Method vs. R-Coffee-Consan vs. RM-Coffee4
Poa 241 217
T-Coffee 241 199
Prrn 232 198
Pcma 218 151
Proalign 216 150
Mafft fftns 206 148
ClustalW 203 136
Probcons 192 128
Mafft ginsi 170 115
Muscle 169 111
M-Locarna 234 183
Stral 169 62
FoldalignM 146 61
Murlet 130 -12
Rnasampler 129 -27
T-Lara 125 -30
98
Range of Performances
Effect of Compensated Mutations
99
Conclusion/Future Directions