Title: Konstantinos Mavrommatis
1Metagenomic Analysis Benchmarking Methods and
Tools
- Konstantinos Mavrommatis
- Genome Biology Group
- DOE Joint Genome Institute
- KMavrommatis_at_lbl.gov
2Metagenomes
- The vast majority of living microorganisms are
difficult to study in isolation because they fail
to grow under laboratory conditions, or depend on
other organisms for critical processes. Isolation
and sequencing of DNA from mixed communities of
organisms circumvent this obstacle and allows
study of organisms and relations between them.
This new type of study led to the emergence of a
new field, which is referred to as metagenomics - The aggregated genome of a metagenome is derived
from a pool of cells, some of which are
genetically related and probably correspond to
different strains of the same species and some
are genetically distinct.
3Background Microbial Community Metagenomes
Gutless Worm Planktonic Archaea EBPR
Sludge Groundwater
AMD Nanoarchaea Alaskan Soil
Termite Hindgut TA-degrading Bioreactor
Antarctic Hypersaline Mats
Soil Archaea Korarchaeota
Bacterioplankton
Enrichment
4Complexity
Acid Mine Drainage
Sargasso Sea
Soil
Human Gut
Termite Hindgut
Species complexity
1 10 100 1000
10000
5Metagenomic analysis
pyrosequencing
High throughput sequence
Methods developed for isolate genomes
3, 8, 40 kb
NO FINISHING!
6Outline
- Tools developed to facilitate study of
metagenomes - Integrated Microbial Genomes with Microbiome
Samples (IMG/M) - (Markowitz et al. Bioinformatics, 2006 Jul 15
22(14) - Evaluation of methods used to analyze
metagenomes Simulated datasets - (Mavromatis et al. submitted for publication)
7IMG/M Synopsis
- Premise
- Effective metagenome analysis requires a
comprehensive data management system - Valuable even with immature data processing
- Strategy
- Develop system supporting systematic
collection, management, and maintenance of
metagenomic data in the context of integrated
isolate microbial genome data (IMG) - Important for assessing/improving quality of data
processing - Aims
- Initial
- Validate premise by developing experimental
system based on IMG - Current
- Provide support for metagenomic projects at JGI
- http//img.jgi.doe.gov/cgi-bin/m/main.cgi
8IMG/M Data Content
Isolate Genomes
9Questions
- Metagenomic analysis
- Who is there? (find the organisms that are
present in a sample) - Broad grouping
- Identify species
- Find strains
- What is the role of each organism?
- What is the metabolic capacity of the community?
- Why is this community different from another
10IMG/M Data Exploration Metagenome Summaries
11Find organisms
12IMG/M SNP Analysis
13What is the role of an organism in the community ?
14IMG/M Data Analysis What is the metabolic
potential ?
Gene on pathway in context of other
Sludge/Accumulibacter (blue) and IMG/M (orange)
genes
15IMG/M Function Abundance Overviews
16IMG/M Overview
- IMG/M integrates microbial community (microbiome)
genome data with IMG's isolate microbial genomes
and provides support for the comparative analysis
of the aggregate microbiome genomes (metagenomes)
in the context of isolate microbial genomes and
other metagenomes.
17Outline
- Tools developed to facilitate study of
metagenomes - Integrated Microbial Genomes with Microbiome
Samples (IMG/M) - (Markowitz et al. Bioinformatics, 2006 Jul 15
22(14) - Evaluation of methods used to analyze
metagenomes Simulated datasets - (Mavromatis et al. submitted for publication)
18Benchmarking ?
- Methods for assembly and gene prediction were
developed for isolate genomes analysis - How accurate are they in metagenomes?
- Create simulated metagenomes from isolate genomes
- Assemble, predict genes, bin the metagenomes
- Track reads of individual organisms in the
assembled contigs and bins. - Evaluate each assembly method
- Compare gene predictions against isolate genomes
- Evaluate binning methods
- Provide datasets for benchmarking of new methods
19Simulated datasets
EBPR sludge
Farm soil
Acid mine drainage
Tyson et al. Nature 2004
Tringe et al. Science 2005
Garcia Martin et al. in review
Species complexity
1 10 100 1000 10000
20Dataset composition
All 3 simsets comprise 100,000 Sanger
reads. Where possible, we tried to match GC
content, genome size and read depth.
21 Assembly overview
- Assembly
- Phrap (Phil Green, U. Wash.)
- JAZZ (JGI)
- Arachne (Broad Institute)
22Assembly
- Phrap assembles the largest amount of sequence.
- Arachne and jazz are more conservative
23Assembly fidelity
24Assembly comparison
25Assembly summary
- Phrap is the greediest assembler.
- Arachne and jazz are more conservative and
produce better quality contigs. - Small contigs (lt5000nt) are more likely to be
chimeric. - Sim. soil is practically unassembled.
- The presence of closely related species creates
chimeric contigs by mixing reads from different
strains in sim. AMD.
26Gene prediction overview
- Gene prediction
- fgenesb (Victor Solovyev, SoftBerry UK)
- ORNL pipeline (Critica, Glimmer)
27Gene prediction
Predicted genes
- Fgenes predicts genes more accurately
- Quality of assembly affects gene prediction
- greedy phrap allows larger number of predicted
genes in contigs. - Gene prediction is similar for all datasets when
Singlets are included
Genes in contigs
Genes in contigs and Singlets
135924
16968
17390
92941
86083
38192
7687
48734
9834
29839
1882
7676
genes of the original dataset
jazz
jazz
jazz
jazz
jazz
jazz
phrap
phrap
phrap
phrap
phrap
phrap
Arachne
Arachne
Arachne
Arachne
Arachne
Arachne
AMD
Sludge
soil
AMD
Sludge
soil
dataset
28Binning overview
- Pattern discovery order (PhyloPythia).
- Sequence composition strain(7mer, 8mer)
- 7mer NNNNNNN, 8mer NNxNNxNN
- Assignment based on blast hits class (spdi)
29Binning assignment
- Does the majority of the contigs belong to the
predicted taxonomic group?
30Bin assignment to taxonomic levels
Actual content
Predicted content
31Bin assignment to taxonomic levels
- Performance depends on the dataset
- Phylopythia assigns bins more accurately
- OF has high of wrongly assigned bins
- No method produces highly accurate results
32Conclusions ???
- Assembly
- Fidelity Arachne gt Jazz gt Phrap
- Volume Phrap gtgt Jazz gtArachne
- Extensive chimerism of contigs lt 5 kb
- Binning
- Isolate typically assigned to multiple bins
(unavoidable ?) - Individual bin typically contains more than one
isolate - PhyloPythia is more consistent
- Gene calling
- fgenes gt ORNL pipeline
- There is no gold standard for metagenomic
analysis. Each tool/ combination of tools has
different advantages and disadvantages. - Each dataset poses different challenges and
requires different handling. - Iterative process or use of multiple methods
could be the key.
33Thanks to...
- IMG/M
- Development
- Ernest Szeto, Krishna Palaniappan, Frank
Korzeniewski - Inna Dubchak al.
- Biology
- Nikos Kyrpides, Natalia Ivanova, Athanasios
Lykides, Iain Anderson, Kostas Mavromatis - Phil Hugenholtz, Hector Garcia Martin, Victor
Kunin, Falk Warnecke - Benchmark
- Assembly
- Eugene Goltzman, Kerrie Barry, Harris Shapiro
- Binning
- Frank Korzeniewski, Asaf Salamov, Isidore
Rigoutsos, Alice McHardy (IBM) - Gene prediction
- Asaf Salamov, Miriam Land (Oak Ridge)
- Funding
- DOE/JGI, Berkeley National Lab RD Program