Title: Integrated Microbial Community Genomes (IMCG) Data Management System
1Integrated Microbial Community Genomes (IMCG)
Data Management System
eGenomics Meeting
Sep 7-9 2005
Victor M. Markowitz (BMDTC, CRD)Nikos Kyrpides
(MGAP, JGI)Natalia Ivanova (MGAP, JGI)Phil
Hugenholtz (MEP, JGI)
2Synopsis
- Problem
- Metagenomics needs new analysis methods, such as
for - Determining gene functions metabolic capacity
of microbial communities and member species - Ex metabolic pathways involved in biomass
conversion and biofuel production in termite
hindguts - Studying intra-population variants and their
correlation with environmental parameters - Metagenome analysis is
- Data computation intensive
- Iterative involves evolving data sets methods
- Solution
- Integrated Microbial Community Genomes (IMCG)
system - Collect and manage metagenome data
- Support metagenome analysis in an integrated
context
3Significance
4Metagenome Data Collection Issues
Sargasso Sea
Soil
Acid mine drainage
Data Quality, Precision
Gene classification
Environmental attributes
Existing data repositories not designed to record
metagenome data
5Metagenome Data Analysis Issues
Hypothesis
Genome
- Challenges
- Individual organism genomes are poorly
characterized - Breadth, depth, quality, precision
- Diversity of data sources required for analysis -
data integration - Microbial community genomes
- May require different concepts for modeling,
analysis
6Example Functional Characterization
7Rationale
- Premise
- Effective metagenome analysis requires a
comprehensive data management system - Strategy
- Develop IMCG system supporting systematic
collection, management, maintenance of metagenome
data in context of - Integrated isolate microbial genome data (IMG)
- Environmental, geographical, geochemical data
- Impact
- Accelerate pace of metagenome projects at LBNL,
JGI, etc - Serve as community resource for metagenome data
8Opportunity
- Context
- A metagenome data management system, recognized
as a critical resource for the past several
years, is not available - Expertise
- Microbial genome analysis microbial ecology
(JGI) - Biological data management system development
(BDMTC) - Large scale data storage and computing (NERSC)
- IMG System
- High quality, evolving repository for microbial
genome data - Critical foundation for a metagenome system
- Provides glimpse into value of IMCG
9IMCG IMG Relationship
Data Analysis
Data Repository
Data Processing
Data Acquisition
- Community
- diversity
- Unique genes in
- community
- Marker genes
- Pathway analysis
- COG analysis
- Comparative
- analysis
- .
- Location (long/lat)
- Morphology
- Env conditions
- Physiological cond.
- Temperature
- pH conditions
- Amt collected/used
- Est biomass
- .
10Metagenome Data in IMG /M
- Next
- New
- 2 sludge data sets
- gutless worm
- termite hidgut
- Existing
- AMD
- 4-7 Saragasso Sea data sets
- reassembled
- Soil
11Metagenome Data in IMG
Genes in comparative context
Gene similarity wrt isolate genomes
12Roadmap
- Start
- Official start of project- LBNL RD funding Oct
2005 - Milestones
- Apr 2006 IMCG alpha/preview version
- Oct 2006 first public version
- Strategy
- Load selected data sets gutless worm, termite
hindgut, etc. - Emphasis on data quality, precision
- System will evolve based on community feedback
- Funding
- Grants, collaborations
- Challenge
- Long term funding for maintaining a community
resource
1316S Clone Library Sequence Submission Survey
1416S Clone Library Sequence Submission Survey
- Select Habitat Type
- Water
- Wastewater
- Soil
- Extreme Habitats
- Host-associated
- Anthropogenic
- Sediment
- New Habitat
15Example
16Survey - Response
17Survey - Response
Votes
Relevant for sample type Yes No
Submitter SAI Contact details all 105 13
Contact email address all 118 6
Clone Library SAI DNA / RNA extraction method all 85 26
Forward primer name all 107 10
Forward primer sequence all 102 12
Reverse primer name all 105 9
Reverse primer sequence all 102 12
Annealing temperature all 92 20
PCR cycle number all 81 29
Cloning vector all 82 29
http//www.jgi.doe.gov/16s/
18Sequence-Associated Information (SAI)
Votes
Relevant for sample type Yes No
Sample SAI Latitude and Longitude all 66 9
Sampling location description all 36 1
Sample size all 88 22
URL for additional informations all 92 18
Sample type all 169 0
Temperature all 84 10
Sample treatment and preservation all 17 2
(Examples) pH Water 31 8
Salinity Water 35 5
Dissolved organic carbon Water 27 12
Moisture Soil 32 4
Ground cover / vegetation Soil 16 3
Agricultural use Soil 17 3
Host species Host-associated 21 1
Anatomical site Host-associated 22 2
Association type Host-associated 20 2
http//www.jgi.doe.gov/16s/
19Some quotes from the survey
The best is to make as many menus as possible,
each free text fields gives more chances for
errors and inconsistencies.
entisol, andisol, inceptisol, gelisol,
histisol, aridisol, vertisol, alfisol, mollisol,
ultisol, spodosol, oxisol
Even Marine / Freshwater can be broken down
further into Freshwater Lake, Stream, River,
Pond, Artificial environment My point is that it
may be difficult to request all of this...
so this participant is for free text fields!!
? Trade-off between versatility and increased
chance of inconsistencies!!
http//www.jgi.doe.gov/16s/