Konstantinos Mavrommatis - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Konstantinos Mavrommatis

Description:

Konstantinos Mavrommatis – PowerPoint PPT presentation

Number of Views:144
Avg rating:3.0/5.0
Slides: 34
Provided by: csblB
Category:
Tags: aetv | com | din | konstantinos | mavrommatis | mga | nba | ty

less

Transcript and Presenter's Notes

Title: Konstantinos Mavrommatis


1
Metagenomic Analysis Benchmarking Methods and
Tools
  • Konstantinos Mavrommatis
  • Genome Biology Group
  • DOE Joint Genome Institute
  • KMavrommatis_at_lbl.gov

2
Metagenomes
  • The vast majority of living microorganisms are
    difficult to study in isolation because they fail
    to grow under laboratory conditions, or depend on
    other organisms for critical processes. Isolation
    and sequencing of DNA from mixed communities of
    organisms circumvent this obstacle and allows
    study of organisms and relations between them.
    This new type of study led to the emergence of a
    new field, which is referred to as metagenomics
  • The aggregated genome of a metagenome is derived
    from a pool of cells, some of which are
    genetically related and probably correspond to
    different strains of the same species and some
    are genetically distinct.

3
Background Microbial Community Metagenomes
Gutless Worm Planktonic Archaea EBPR
Sludge Groundwater
AMD Nanoarchaea Alaskan Soil
Termite Hindgut TA-degrading Bioreactor
Antarctic Hypersaline Mats
Soil Archaea Korarchaeota
Bacterioplankton
Enrichment
4
Complexity
Acid Mine Drainage
Sargasso Sea
Soil
Human Gut
Termite Hindgut
Species complexity
1 10 100 1000
10000
5
Metagenomic analysis
pyrosequencing
High throughput sequence
Methods developed for isolate genomes
3, 8, 40 kb
NO FINISHING!
6
Outline
  • Tools developed to facilitate study of
    metagenomes
  • Integrated Microbial Genomes with Microbiome
    Samples (IMG/M)
  • (Markowitz et al. Bioinformatics, 2006 Jul 15
    22(14)
  • Evaluation of methods used to analyze
    metagenomes Simulated datasets
  • (Mavromatis et al. submitted for publication)

7
IMG/M Synopsis
  • Premise
  • Effective metagenome analysis requires a
    comprehensive data management system
  • Valuable even with immature data processing
  • Strategy
  • Develop system supporting systematic
    collection, management, and maintenance of
    metagenomic data in the context of integrated
    isolate microbial genome data (IMG)
  • Important for assessing/improving quality of data
    processing
  • Aims
  • Initial
  • Validate premise by developing experimental
    system based on IMG
  • Current
  • Provide support for metagenomic projects at JGI
  • http//img.jgi.doe.gov/cgi-bin/m/main.cgi

8
IMG/M Data Content
Isolate Genomes
9
Questions
  • Metagenomic analysis
  • Who is there? (find the organisms that are
    present in a sample)
  • Broad grouping
  • Identify species
  • Find strains
  • What is the role of each organism?
  • What is the metabolic capacity of the community?
  • Why is this community different from another

10
IMG/M Data Exploration Metagenome Summaries
11
Find organisms
12
IMG/M SNP Analysis
13
What is the role of an organism in the community ?
14
IMG/M Data Analysis What is the metabolic
potential ?
Gene on pathway in context of other
Sludge/Accumulibacter (blue) and IMG/M (orange)
genes
15
IMG/M Function Abundance Overviews
16
IMG/M Overview
  • IMG/M integrates microbial community (microbiome)
    genome data with IMG's isolate microbial genomes
    and provides support for the comparative analysis
    of the aggregate microbiome genomes (metagenomes)
    in the context of isolate microbial genomes and
    other metagenomes.

17
Outline
  • Tools developed to facilitate study of
    metagenomes
  • Integrated Microbial Genomes with Microbiome
    Samples (IMG/M)
  • (Markowitz et al. Bioinformatics, 2006 Jul 15
    22(14)
  • Evaluation of methods used to analyze
    metagenomes Simulated datasets
  • (Mavromatis et al. submitted for publication)

18
Benchmarking ?
  • Methods for assembly and gene prediction were
    developed for isolate genomes analysis
  • How accurate are they in metagenomes?
  • Create simulated metagenomes from isolate genomes
  • Assemble, predict genes, bin the metagenomes
  • Track reads of individual organisms in the
    assembled contigs and bins.
  • Evaluate each assembly method
  • Compare gene predictions against isolate genomes
  • Evaluate binning methods
  • Provide datasets for benchmarking of new methods

19
Simulated datasets
EBPR sludge
Farm soil
Acid mine drainage
Tyson et al. Nature 2004
Tringe et al. Science 2005
Garcia Martin et al. in review
Species complexity
1 10 100 1000 10000
20
Dataset composition
All 3 simsets comprise 100,000 Sanger
reads. Where possible, we tried to match GC
content, genome size and read depth.
21
Assembly overview
  • Assembly
  • Phrap (Phil Green, U. Wash.)
  • JAZZ (JGI)
  • Arachne (Broad Institute)

22
Assembly
  • Phrap assembles the largest amount of sequence.
  • Arachne and jazz are more conservative

23
Assembly fidelity
24
Assembly comparison
25
Assembly summary
  • Phrap is the greediest assembler.
  • Arachne and jazz are more conservative and
    produce better quality contigs.
  • Small contigs (lt5000nt) are more likely to be
    chimeric.
  • Sim. soil is practically unassembled.
  • The presence of closely related species creates
    chimeric contigs by mixing reads from different
    strains in sim. AMD.

26
Gene prediction overview
  • Gene prediction
  • fgenesb (Victor Solovyev, SoftBerry UK)
  • ORNL pipeline (Critica, Glimmer)

27
Gene prediction
Predicted genes
  • Fgenes predicts genes more accurately
  • Quality of assembly affects gene prediction
  • greedy phrap allows larger number of predicted
    genes in contigs.
  • Gene prediction is similar for all datasets when
    Singlets are included

Genes in contigs
Genes in contigs and Singlets
135924
16968
17390
92941
86083
38192
7687
48734
9834
29839
1882
7676
genes of the original dataset
jazz
jazz
jazz
jazz
jazz
jazz
phrap
phrap
phrap
phrap
phrap
phrap
Arachne
Arachne
Arachne
Arachne
Arachne
Arachne
AMD
Sludge
soil
AMD
Sludge
soil
dataset
28
Binning overview
  • Pattern discovery order (PhyloPythia).
  • Sequence composition strain(7mer, 8mer)
  • 7mer NNNNNNN, 8mer NNxNNxNN
  • Assignment based on blast hits class (spdi)

29
Binning assignment
  • Does the majority of the contigs belong to the
    predicted taxonomic group?

30
Bin assignment to taxonomic levels
Actual content
Predicted content
31
Bin assignment to taxonomic levels
  • Performance depends on the dataset
  • Phylopythia assigns bins more accurately
  • OF has high of wrongly assigned bins
  • No method produces highly accurate results

32
Conclusions ???
  • Assembly
  • Fidelity Arachne gt Jazz gt Phrap
  • Volume Phrap gtgt Jazz gtArachne
  • Extensive chimerism of contigs lt 5 kb
  • Binning
  • Isolate typically assigned to multiple bins
    (unavoidable ?)
  • Individual bin typically contains more than one
    isolate
  • PhyloPythia is more consistent
  • Gene calling
  • fgenes gt ORNL pipeline
  • There is no gold standard for metagenomic
    analysis. Each tool/ combination of tools has
    different advantages and disadvantages.
  • Each dataset poses different challenges and
    requires different handling.
  • Iterative process or use of multiple methods
    could be the key.

33
Thanks to...
  • IMG/M
  • Development
  • Ernest Szeto, Krishna Palaniappan, Frank
    Korzeniewski
  • Inna Dubchak al.
  • Biology
  • Nikos Kyrpides, Natalia Ivanova, Athanasios
    Lykides, Iain Anderson, Kostas Mavromatis
  • Phil Hugenholtz, Hector Garcia Martin, Victor
    Kunin, Falk Warnecke
  • Benchmark
  • Assembly
  • Eugene Goltzman, Kerrie Barry, Harris Shapiro
  • Binning
  • Frank Korzeniewski, Asaf Salamov, Isidore
    Rigoutsos, Alice McHardy (IBM)
  • Gene prediction
  • Asaf Salamov, Miriam Land (Oak Ridge)
  • Funding
  • DOE/JGI, Berkeley National Lab RD Program
Write a Comment
User Comments (0)
About PowerShow.com