Growing Trees on the Right Compost - PowerPoint PPT Presentation

About This Presentation
Title:

Growing Trees on the Right Compost

Description:

Title: Beyond ClustalW Author: Notredame Last modified by: Notredame Created Date: 6/12/2005 1:14:42 PM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:156
Avg rating:3.0/5.0
Slides: 103
Provided by: Notr151
Learn more at: https://tcoffee.org
Category:

less

Transcript and Presenter's Notes

Title: Growing Trees on the Right Compost


1
Growing Trees on the Right Compost
  • Cédric Notredame
  • Comparative Bioinformatics Group
  • Bioinformatics and Genomics Program

2
Manguel M, Samaniego F.J., Abraham Walds Work
on Aircraft Suvivability, J. American
Statistical Association. 79, 259-270, (1984)
3
(No Transcript)
4
What s in a Multiple Sequence Alignment
Evolution Inertia Common Ancestry Shows up In
the sequences
Selection Important Features Are Preserved
Functional Constraint Same Function Same
Sequence Convergence
Phylogenetic Footprint, Evolutionary Trace
5
Why So Much Interest For Multiple Alignments ?
Extrapolation
Structure Prediction
Motifs/Patterns
SNP Analysis
Profiles
Regulatory Elements
Phylogeny
Reactivity Analysis
6
Whats in a Multiple Alignment ?
  • The MSA contains what you put inside
  • Structural Similarity
  • Evolutive Similarity
  • Sequence Similarity
  • You can view your MSA as
  • A record of evolution
  • A summary of a protein family
  • A collection of experiments made for you by
    Nature

7
Producing The Right Alignment
  • Multiple Sequence Alignments Influence
    Phylogenetic Trees
  • Choice of Method is not Neutral
  • Different Methods
  • Different Alignments
  • Different Trees
  • Using The Right Models insures Producing the
    right Tree

8
Model Based Alignments vs Naïve Alignments
  • Naïve Alignment
  • Lexicographic Alignment
  • Maximizing the number of identities
  • At best using a substitution matrix
  • Model Based Alignments
  • Using a model
  • Protein structure information
  • RNA Structure information
  • Combining/Confronting Modeling methods
  • Template based Alignments
  • Model based Alignments through the use of
    Templates

9
T-Coffee and Model Based Alignments
  • T-Coffee Algorithm
  • Expresso Aligning Protein Structures
  • R-Coffee Aligning RNA structures
  • M-Coffee Combining methods

10
T-Coffee An extension of the progressive
Alignment Algorithm
11
T-Coffee and Concistency
SeqA GARFIELD THE LAST FAT CAT
SeqB GARFIELD THE FAST CAT
SeqC GARFIELD THE VERY FAST CAT
SeqD THE FAT CAT
SeqA GARFIELD THE LAST FA-T CAT SeqB GARFIELD THE
FAST CA-T --- SeqC GARFIELD THE VERY FAST
CAT SeqD -------- THE ---- FA-T CAT
12
T-Coffee and Concistency
SeqA GARFIELD THE LAST FAT CAT Prim. Weight
88 SeqB GARFIELD THE FAST CAT --- SeqA GARFIELD
THE LAST FA-T CAT Prim. Weight 77 SeqC
GARFIELD THE VERY FAST CAT SeqA GARFIELD THE
LAST FAT CAT Prim. Weight 100 SeqD --------
THE ---- FAT CAT SeqB GARFIELD THE ---- FAST CAT
Prim. Weight 100 SeqC GARFIELD THE VERY FAST
CAT SeqC GARFIELD THE VERY FAST CAT Prim.
Weight 100 SeqD -------- THE ---- FA-T CAT
13
T-Coffee and Concistency
14
T-Coffee and Concistency
15
T-Coffee and Concistency
16
T-Coffee and Concistency
17
T-Coffee and Concistency
18
T-Coffee and Concistency
19
When Sequences Are not Enough3D-Coffee and
Expresso
20
3D-Coffee Combining Sequences and Structures
Within Multiple Sequence Alignments
21
3D-Coffee Combining Sequences and Structures
Within Multiple Sequence Alignments
22
Expresso Finding the Right Structure
Sources
BLAST
BLAST
SAP
Templates
Templates
Template Alignment
Source Template Alignment
Library
Remove Templates
23
3D-Coffee Combining Sequences and Structures
Within Multiple Sequence Alignments
24
Incorporating RNA Information Within the T-Coffee
Algorithm
25
ncRNAs Can Evolve Rapidly
CCAGGCAAGACGGGACGAGAGTTGCCTGG CCTCCGTTCAGAGGTGCATA
GAACGGAGG -------------------
26
R-Coffee Modifying T-Coffee at the Right Place
  • Incorporation of Secondary Structure information
    within the Library
  • Two Extra Components for the T-Coffee Scoring
    Scheme
  • A new Library
  • A new Scoring Scheme

27
R-Coffee Extension
TC Library
G
C
G G Score X C C Score Y
G
C
G
C
G
C
  • Goal Embedding RNA Structures Within The
    T-Coffee Libraries
  • The R-extension can be added on the top of any
    existing method.

28
R-Coffee Structural Aligners
Method Avg Braliscore Net Improv. direct
T R T R -----------------------------------
------------------------ Stemloc 0.62 0.75 0.76
104 113Mlocarna 0.66 0.69 0.71
101 133Murlet 0.73 0.70 0.72
-132 -73Pmcomp 0.73 0.73 0.73
142 145T-Lara 0.74 0.74 0.69 -36
-8 Foldalign 0.75 0.77 0.77 72
73 -----------------------------------------------
------------ Dyalign --- 0.63 0.62
--- --- Consan --- 0.79 0.79
--- --- ------------------------------------------
----------------- Improvement R-Coffee wins -
R-Coffee looses over 170 test sets
29
R-Coffee Regular Aligners
Method Avg Braliscore Net Improv. direct
T R T R -----------------------------------
------------------------ Poa 0.62 0.65 0.70
48 154Pcma 0.62 0.64 0.67 34 120Prrn 0.64
0.61 0.66 -63 45ClustalW 0.65 0.65 0.69 -7
83Mafft_fftnts 0.68 0.68 0.72 17
68ProbConsRNA 0.69 0.67 0.71 -49
39Muscle 0.69 0.69 0.73 -17
42Mafft_ginsi 0.70 0.68 0.72 -49
39 -----------------------------------------------
------------
Improvement R-Coffee wins - R-Coffee looses
over 388 test sets
30
Choosing the right modeling methodM-Coffee
31
Combining Many MSAs into ONE
ClustalW
MAFFT
T-Coffee
MUSCLE
???????
32
Comparing Methods
MAFFT
33
(No Transcript)
34
(No Transcript)
35
Where to Trust Your Alignments
Most Methods Disagree
Most Methods Agree
36
What To Do Without Structures
37
Conclusion
  • Model Based Alignments Give the best Accuracy
  • Template based alignment is a very efficient way
    to turn Naïve aligners into model based aligners
  • Sequence Alignments are not necessarily reliable
    over their entire lengths

38
www.tcoffee.org
  • Fabrice Armougom (CNRS, FR)
  • Sebastien Moretti (CNRS, FR)
  • Olivier Poirot (CNRS, FR)
  • Frederic Reinier (CRS4, IT)
  • Karsten Suhre (CNRS, FR)
  • Vladimir Saudek (Sanofi-Aventis, FR)
  • Des Higgins (UCD, IE)
  • Orla OSullivan (UCD, IE)
  • Iain Wallace (UCD, IE)
  • Victor Jongeneel (SIB/VitalIT, CH)
  • Bruno Nyfler (VitalIT, CH)
  • Roger Hersch (EPFL, CH)
  • Pierre Dumas (EPFL, CH)
  • Basile Schaeli (EPFL, CH)

www.tcoffee.org cedric.notredame_at_europe.com
39
www.tcoffee.org
www.tcoffee.org cedric.notredame_at_europe.com
40
(No Transcript)
41
Building and Using Models
35.67 Angstrom
42
Computing the Correct Alignment is a Complicated
Problem
43
Stochastic Optimization
44
Stochastic Optimization
  • Exploration of Complex Optimization Problems With
    Multiple Constraints
  • Genomic Alignments
  • RNA Alignments
  • Generation of Population of Suboptimal Solutions
  • Qualityf( optimality )
  • Specification of Concistency Objective Function
    of T-Coffee

45
Three Types of Algorithms
  • Progressive ClustalW
  • Iterative Muscle
  • Concistency Based T-Coffee and Probcons

46
T-Coffee and Concistency
  • Each Library Line is a Soft Constraint (a wish)
  • You cant satisfy them all
  • You must satisfy as many as possible (The easy
    ones)

47
Concistency Based Algorithms T-Coffee
  • Gotoh (1990)
  • Iterative strategy using consistency
  • Martin Vingron (1991)
  • Dot Matrices Multiplications
  • Accurate but too stringeant
  • Dialign (1996, Morgenstern)
  • Concistency
  • Agglomerative Assembly
  • T-Coffee (2000, Notredame)
  • Concistency
  • Progressive algorithm

48
How Good Is My Method ?
49
Structures Vs Sequences
50
Validation Using BaliBase
51
Too Many Methods for ONE AlignmentM-Coffee
52
(No Transcript)
53
Estimating the Accuracy of your MSA
54
What To Do Without Structures
55
3D-Coffee Combining Sequences and Structures
Within Multiple Sequence Alignments
56
Expresso Finding the Right Structure
Why Not Using Structure Based Alignments
57
Template Based Multiple Sequence Alignments
58
Template Based Multiple Sequence Alignments
Sources
-Structure -Profile -
Template Aligner
-Structure -Profile -
Templates
Templates
Template Alignment
Source Template Alignment
Library
Remove Templates
59
Method Score Templates Prefab Homstrad
-------------------------------------------------
------------- ClustalW Matrix ---- 61.80 ----
Kalign Matrix ---- 63.00 ---- MUSCLE Matrix
---- 68.00 45.0 --------------------------------
------------------------------ T-Coffee Consisten
cy ---- 69.97 44.0 ProbCons Consistency ---- 7
0.54 ---- Mafft Consistency ---- 72.20 ---- M-
Coffee Consistency ---- 72.91 ---- MUMMALS Consi
stency ---- 73.10 ---- -------------------------
------------------------------------- Clustal-db
Matrix Profiles ---- ---- PRALINE Matrix Profi
les ---- 50.2 PROMALS Consistency Profiles 79.00
---- SPEM Matrix Profiles 77.00 ---- ---------
--------------------------------------------------
--- EXPRESSO Consistency Structures ---- 71.9
T-Lara Consistency Structures ---- ---- ------
--------------------------------------------------
------ Table 1. Summary of all the methods
described in the review. Validation figures were
compiled from several sources, and selected for
the compatibility. Prefab refers to some
validation made on Prefab Version 3. The HOMSTRAD
validation was made on datasets having less than
30 identity. The source of each figure is
indicated by a reference. The EXPRESSO figure
comes from a slightly more demanding subset of
HOMSTRAD (HOM39) made of sequences less than 25
identical.
60
Improving The Evaluation
61
How Do We Perform In The Twilight Zone?
  • Concistency Based Methods Have an Edge
  • Hard to tell Methods Apart
  • Sequence Alignment is NOT solved

62
More Than Structure based Alignments
  • Structural Correctness Is Only the Easy Side of
    the Coin.
  • In practice MSA are intermediate models used to
    generate other models

Data Model Type Benchmark
Homology Profile Yes
Evolution Trees No
Structure 3D-Structure CASP
Function Annotation No
63
Conclusion
  • Template based Multiple Sequence Alignments
  • Projecting any relevant information onto the
    sequences
  • Using this Information
  • Need for new evaluation procedures
  • Functional Analysis
  • Phylogenetic Analysis
  • Homology Search (Profiles)
  • Homology Modelling
  • Integrating data ? Making sure your bits of data
    can fight with one another

64
Turning Data into Models
  • Data
  • Columbus, considered that the landmass occupied
    225, leaving only 135 of water (Marinus of
    Tyre, 70 AD).
  • Columbus believed that 1 represented only 56
    miles (Alfraganus, XIth century)
  • He knew there was an island named Japan off the
    cost of China
  • Model
  • Circumference of the Earth as 25,255 km at most,
  • Canary Island to Japan 3,700 km (Reality
    12,000 km.)

65
The More Structures The Merrier
Average Improvement over T-Coffee
Struc/Seq Ratio
66
The Right Mixt of Methods
67
3D-Coffee Combining Sequences and Structures
Within Multiple Sequence Alignments
68
Applications
69
Looking-Up The DNA Behind The Sequences PROTOGENE
70
SAR Analysis
  • Correlate Alignment Variations with Reactivity
  • Application to the Human Kinome
  • Collaboration with Sanofi-Aventis
  • Main Issue
  • Training problem ? Proper Benchmarking

71
ncRNA Multiple Alignments with R-Coffee
  • Laundering the Genome Dark Matter
  • Cédric Notredame
  • Comparative Bioinformatics Group
  • Bioinformatics and Genomics Program

72
No Plane Today
73
ncRNAs Comparison
  • And ENCODE said
  • nearly the entire genome may be represented in
    primary transcripts that extensively overlap and
    include many non-protein-coding regions
  • Who Are They?
  • tRNA, rRNA, snoRNAs,
  • microRNAs, siRNAs
  • piRNAs
  • long ncRNAs (Xist, Evf, Air, CTN, PINK)
  • How Many of them
  • Open question
  • 30.000 is a common guess
  • Harder to detect than proteins
  • .

74
ncRNAs can have different sequences and Similar
Structures
75
ncRNAs are Difficult to Align
  • Same Structure ?Low Sequence Identity
  • Small Alphabet, Short Sequences ? Alignments
    often Non-Significant

76
Obtaining the Structure of a ncRNA is difficult
  • Hard to Align The Sequences Without the Structure
  • Hard to Predict the Structures Without an
    Alignment

77
The Holy Grail of RNA ComparisonSankoff
Algorithm
78
The Holy Grail of RNA ComparisonSankoff
Algorithm
  • Simultaneous Folding and Alignment
  • Time Complexity O(L2n)
  • Space Complexity O(L3n)
  • In Practice, for Two Sequences
  • 50 nucleotides 1 min. 6 M.
  • 100 nucleotides 16 min. 256 M.
  • 200 nucleotides 4 hours 4 G.
  • 400 nucleotides 3 days 3 T.
  • Forget about
  • Multiple sequence alignments
  • Database searches

79
The next best Thing Consan
  • Consan Sankoff a few constraints
  • Use of Stochastic Context Free Grammars
  • Tree-shaped HMMs
  • Made sparse with constraints
  • The constraints are derived from the most
    confident positions of the alignment
  • Equivalent of Banded DP

80
Going Multiple.
  • Structural Aligners

81
Game Rules
  • Using Structural Predictions
  • Produces better alignments
  • Is Computationally expensive
  • Use as much structural information as possible
    while doing as little computation as possible

82
Adapting T-Coffee To RNA Alignments
83
T-Coffee and Concistency
84
T-Coffee and Concistency
85
T-Coffee and Concistency
86
T-Coffee and Concistency
87
Consistency Conflicts and Information
X
X
Z
Z
Y
Y
W
Z
Z
W
Y is unhappy
X is unhappy
Partly Consistent ? Less Reliable
Fully Consistent ? More Reliable
88
(No Transcript)
89
R-Coffee Scoring Scheme
R-Score (CC)MAX(TC-Score(CC), TC-Score (GG))
G
C
G
C
90
Validating R-Coffee
91
RNA Alignments are harder to validate than
Protein Alignments
  • Protein Alignments ? Use of Structure based
    Reference Alignments
  • RNA Alignments ?No Real structure based reference
    alignments
  • The structures are mostly predicted from
    sequences
  • Circularity

92
BraliBase and the BraliScore
  • Database of Reference Alignments
  • 388 multiple sequence alignments.
  • Evenly distributed between 35 and 95 percent
    average sequence identity
  • Contain 5 sequences selected from the RNA family
    database Rfam
  • The reference alignment is based on a SCFG model
    based on the full Rfam seed dataset (100
    sequences).

93
BraliBase SPS Score
Number of Identically Aligned Pairs
RFam MSA
SPS
Number of Aligned Pairs
94
BraliBase SCI Score
R N A p f o l d
Covariance
((()))((..)) DG Seq1
((()))((..)) DG Seq2
((()))((..)) DG Seq3
((()))((..)) DG Seq4
((()))((..)) DG Seq5
((()))((..)) DG Seq6
RNAlifold
Average DG Seq X Cov
SCI
((()))((..)) ALN DG
DG ALN
95
BRaliScore
  • Braliscore SCISPS

96
RM-Coffee Regular Aligners
Method Avg Braliscore Net Improv. direct
T R T R -----------------------------------
------------------------ Poa 0.62 0.65 0.70
48 154Pcma 0.62 0.64 0.67 34 120Prrn 0.64
0.61 0.66 -63 45ClustalW 0.65 0.65 0.69 -7
83Mafft_fftnts 0.68 0.68 0.72 17
68ProbConsRNA 0.69 0.67 0.71 -49
39Muscle 0.69 0.69 0.73 -17
42Mafft_ginsi 0.70 0.68 0.72 -49
39 -----------------------------------------------
------------ RM-Coffee4 0.71 / 0.74 / 84
97
How Best is the Best.

Method vs. R-Coffee-Consan vs. RM-Coffee4
Poa 241 217
T-Coffee 241 199
Prrn 232 198
Pcma 218 151
Proalign 216 150
Mafft fftns 206 148
ClustalW 203 136
Probcons 192 128
Mafft ginsi 170 115
Muscle 169 111
M-Locarna 234 183
Stral 169 62
FoldalignM 146 61
Murlet 130 -12
Rnasampler 129 -27
T-Lara 125 -30
98
Range of Performances
Effect of Compensated Mutations
99
Conclusion/Future Directions
  • T-Coffee/Consan is currently the best MSA
    protocol for ncRNAs
  • Testing how important is the accuracy of the
    secondary structure prediction
  • Going deeper into Sankoffs territory predicting
    and aligning simultaneously

100
Credits and Web Servers
  • Andreas Wilm
  • Des Higgins
  • Sebastien Moretti
  • Ioannis Xenarios
  • Cedric Notredame
  • CGR, SIB, UCD

www.tcoffee.org cedric.notredame_at_europe.com
101
(No Transcript)
102
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com