Title: Creating the Genomic Encyclopedia for Bacteria and Archaea
1Creating theGenomic Encyclopedia for Bacteria
and Archaea
- Rick Stevens Eddy Rubin
- Argonne National Laboratory Joint Genome
Institute - The University of Chicago Berkeley Lab
Rob Edwards, Jonathan A. Eisen, Ross Overbeek,
George Garrity, Veronika Vonstein, Sveta Gerdes,
Folker Meyer, Kevin White, Tim Lilburn, Barney
Whitman, et. al.
2The Basic Idea of the Project
- To build an enterprise that can take advantage of
the expected exponential improvements of
sequencing capabilities to sequence all known
cultured and described prokaryotes - Ride the expected Moores law of sequencing
capability - To develop a distributed high-throughput
industrial approach to the cultivation,
characterization, sequencing, annotation and
analysis of prokaryotic genomes - Build a team from groups that have expertise and
track records - To build and curate a database of genome
sequences, metabolic reconstructions, and
standardized phenotype assays associated with
each target organism - Streamline the release of data, provide a
foundation for derivative projects
3Concept of the Bergeys/GEBA Sequencing Project
- A Fixed cost annual investment
- Each year more can be sequenced as sequencing
costs decrease and as cultivation efficiencies
improve based on experience - Leverage the expected improvement of sequencing
costs - Address the overall scope within 5 to 6 years
- Increase amount of near complete sequences per
year - Optimize the choice of organisms to maximize
diversity at each stage - Exploit the Bergeys Trust and International
Committee on Systematics for Prokaryotes for
Taxonomic coverage (e.g. Garrity and Whitman) - Involve the microbiology community for
prioritization - Industrialize the pipeline
- Biological Resource Centers to produce and
characterize type material - DOE JGI, NIAID/DMID Centers, NSF/USDA Centers for
Sequencing - Laboratories for bioinformatics (Argonne, JGI,
TIGR, ORNL, etc.) - Universities and Laboratories for modeling and
analysis
4The Question is not if, but When and How ?
- Why should we want to accelerate this transition?
- Why not just let it happen as a matter of course?
- What is in the current sequencing pipeline?
- Completed Genomes Ongoing/In the Pipeline
- Archaeal 29 56
- Bacterial 397 991
- Eukaryal 44 631
- The existing process of bottoms up selection of
organisms for sequencing is leaving many
important groups underrepresented, closure will
take a long time - There are groups are well represented in the
literature, but not in the sequencing databases - Under representation is also an issue in
environmental sequencing data
5Tapping into prokaryotic biodiversity -
Industrial Biotechnology
- Rapidly growing field
- by 2010 biocatalysis will be used in
production of 60 of fine chemicals (McKinsey
analysis) - In US coordinated by USDA Biobased Products
and Bioenergy Coordination Council (BBCC) - Applications
- pharmaceuticals
- food ingredients (sweeteners, vitamins)
- feed additives and other agrochemicals
- organic solvents
- polymer raw materials
- biofuels
- Advantages over chemical methods
- exquisite substrate specificity
- excellent chemo-, regio- and
stereoselectivity - environmentally friendly green chemistry
based on biorenewables - Needed
- novel enzymes and pathways
Straathof et al. 2002. Curr Opinion Biothech
13548-56
150 compounds are currently produced on
industrial scale using biocatalysts. Examples
Hans E. Schoemaker, et al. 2003. Science
2991694-97
6Analysis of 1000s of new bacterial genomes will
likely yield completely novel pathways and
enzymes for industrial applications
Examples of recently discovered biocatalytic
transformations of novel organic functional
groups
- Current approaches to discovery of new enzymes
-
- Screening environmental samples by enrichment
cultures (BUT only ltlt1 prokaryotes are
currently culturable) - Metagenome approach cloning expression of DNA
samples in a surrogate host, then screening for
desired function (BUT only known functions can
be screened for, new biochemistry cannot be
discovered) - Sequence-based discovery (growing
explosively, generating knowledge base for basic
sciences and biotechnological applications)
L.P. Wackett. 2004. Current Opinion in
Biotechnology, 15280284
Still to be discovered enzymes involved in the
biosynthesis or catabolism of approximately 40
naturally occurring chemical functional groups
are still not known
7Building the Case
- There is a disparity between the literature and
the existing genomes - We cant fully exploit the communitys historical
knowledge and investments without closing this
gap - There is a disparity between the rank/abundance
curves from 16s studies and from environmental
sequencing projects and the existing genomes - We cant fully understand the new datasets
without closing this gap (I.e. lack of complete
sequence coverage of known culturables is holding
back future work) - There is likely to be new biochemical pathways
and novel enzymes in the set of culturable but
unsequenced organisms, sequencing non-cultured
organisms to expand diversity - These represent the low hanging fruit for
discovery since the investment has already be
made in determining culture conditions - A comprehensive database produced under
controlled conditions that includes phenotype
data and genotype data will accelerate research
in understanding the genotype-phenotype
relationship - Genome-Scale reconstruction and modeling will be
dramatically accelerated by comprehensive
databases that include phenotype data
8Estimated Sequencing Rates
Selection of Targets
Produce DNA
Sequencing Assembly
Rapid Annotation (24 Hours)
Metabolic Reconstruction
Model Generation
Phenotype Prediction
Database Repository
9Technical Feasibility FAQ
- How many genomes would the project propose to
sequence? - About 5000 over 5-7 years
- Who would produce the biomass needed for DNA
extraction? - Type culture centers until enrichment and
environmental methods mature - Will the biomass/DNA be available for
distribution? - Yes, both the DNA and the libraries could be
stored for distribution - What throughput is needed for DNA production?
- In the beginning of the project 300 taxa per
year to 2000 per yr at the end - What combinations of sequencing technologies need
to be employed? - Sanger and Pyrosequencing initially, others as
they come online - What throughput is needed for annotation?
- 24 hour turnaround from assembled sequence to
initial availability this has already been
achieved at Argonne, TIGR and elsewhere - Is is possible to have a standard set of
phenotype assays given the broad spectrum of
organisms and conditions? - We are considering Biolog as a model, but it is
too limited - How would the genomes be selected and
prioritized? - At each cycle we choose genomes (e.g. via 16s) to
minimize the diversity gaps - Community input would be solicited to insure the
project is tracking the communities interests - Is it necessary to close the genomes?
- We think no. Libraries would be archived for
groups that might be interested in closing.
10The Project Would Provide a Comprehensive Set of
Genome Sequences for
- Biofuels, and bioproduction of alternative
feedstocks - Understanding and managing the microbial carbon
cycle - Soil and subsurface microbial ecology
- Bioremediation and bioconversion of waste streams
- Evolution and microbial ecological dynamics
- Context for environmental sequencing and
metagenomics - Basis for developing predictive models of
phenotypes - Source of components for synthetic biology
- Improving our understanding of cultivability
- Dramatically improving the reliability and
quality of genome annotations
11How Many Known Cultured Organisms?
- Latest version of the Prokaryotic Taxonomic
Outline will contain 7951 named species of
Bacteria and Archaea. - Of these, 178 are non-cultivable or not
represented by viable type material. - An additional 1222 are synonyms.
- Of the 6543 type strains for which viable
material is reportedly deposited, we have
assembled a minimal set of 6389 strains that are
available from 16 major public culture
collections or biological resource centers in the
US, Europe, and Asia. - The remaining 154 are in minor or non-public
collections. - This information is derived from Release 6.1 of
the Taxonomic Outline of the Prokaryotes which
will be published in 2007 and is current through
May 2006.
12What Has Been Sequenced or is In Play
- Of the 6400 strains available from public sources
- About 380 are human, animal or plant pathogens
- Order 1/3-1/2 of the known pathogens have been
sequenced - 360 complete prokaryotic genomes published
- 56 archaeal and 940 bacterial genomes in progress
- From 897 prokaryotic genomes in progress in GOLD
- 400 are pathogens (many duplicate taxa)
- 221 are supported by DOE (156 biotech, 51
environment) - Approximately 5000 prokaroytes not yet in play
- We estimate about 4800 non-pathogen taxa
13Strain Distribution in Collections
- US Collections / BRCs Strains
- American Type Culture Collection (ATCC) 4027
- USDA ARS Collection (NRRL) 223
- European Collections
- Deutsche Sammlung vor Microoransmen (DSMZ) 1302
- Culture Collection University Gottenberg (CCUG)
183 - Pasteur Institute (CIP) 170
- Laboratory for Micrbiology, Gent (LMG) 101
- National Collection of Industrial and
- Marine Bacteria 25
- French Collection of Phytopathogens (CFPB)
15 - National Collection of Type Cultures (NCTC)
12 - National Collection of Phytopathogenic
- Bacteria 11
- Asia
- Japan Collection of Microorganisms (JCM) 185
- Institute of Fermentation, Osaka (IFO) 34
- Korean Collection of Type Cultures (KCTC) 28
- Institute of Applied Microbiology, Tokyo (IAM)
26
14Distribution of Genome Sizes in the Pipeline
Average Sequence 4Mbp
15Getting Value from the Genomes
- Genomes would be assembled by the groups doing
the sequencing - Assembled contigs would be sent to the initial
high-throughput annotation server for draft
annotations and immediately published on-line - The accumulated (additional) genomes will be used
to improve annotations (gene calls, functional
coupling) - Genomes will be integrated into databases to
support comparative analysis and evolutionary
analysis - Annotated genomes can be used to
semi-automatically construct genome-scale models
which could be used to make metabolic phenotype
predictions
16Background
- online at
- http//www.sequencingbergeys.org
- login required (just ask us)
- guest read-only access after the meeting?
- make maximum information available
- Bergey hierarchy, NCBI taxonomy, 16s RNA, strain
collections, GOLD, SEED,
17List of organisms for sequencing
- based on 16s clusters
18Cluster Page
select strain for cluster
19Bergey Browser
20Species Page