Title: Content, Format, and Standards in Genomics Scale Data
1Content, Format, and Standards in Genomics
Scale Data
- The ILSI EBI Collaboration
- Wm. B. Mattes, PhD, DABT
2Outline
- Why do we need a database for toxicogenomics
- How is it envisioned that this will be developed
- What are the issues for such a database
- Who is involved in such development
- The ILSI EBI Collaboration
3Traditional Biology
One tree at a time
4Omic Biology
Forests and Mountains
5Challenge of Genomics
- Its the informatics, period!
- And its awfully tempting to take shortcuts!
6Why do we need a database?
- Volume of data
- Traditional endpoints per animal
- lt20 histopathology observations
- lt10 gross measurements (e.g. weights, food)
- lt25 serum measurements
- lt10 urinalysis measurements
- Genomic endpoints per animal
- 5,000-10,000 transcripts !!!
7Why do we need a database? (cont)
- Influence of technology details
- Influence of probe sequence
- Many genes are alternatively spliced such
events may not be detected unambiguously by a
microarray
8Influence of Probe Sequence
Most arrays target this region of the mRNA!
9Why do we need a database? (cont)
- Influence of technology details
- Influence of probe sequence
- Many genes are alternatively spliced such
events may not be detected unambiguously by a
microarray - For cDNA arrays, probes may hybridize to more
than one sequence - A database that captures probe sequence is
required to resolve discrepancies through
automated bioinformatics
10How are databases being developed?
- Microarray Gene Expression Data Society - MGED
Society - MIAME - Minimum Information About a Microarray
Experiment - the minimum information that should be reported
about a microarray experiment to enable its
unambiguous interpretation and reproduction - Essentially, what should go into the database
11How are databases being developed?
- MIAME Basic Areas
- Experiment Design
- Samples used, extract preparation and labeling
- Hybridization procedures and parameters
- Measurement data and specifications
- Array Design
12How are databases being developed? (cont)
- MGED Society
- MAGE
- Programming conventions and data structures to
communicate Microarray Gene Expression data - MAGE-OM Object Model
- MAGE-ML Markup Language
- Essentially, how the data is exchanged/ how the
database is constructed
13How are databases being developed? (cont)
- MGED Society
- Ontology working group
- Ontologies provide a vocabulary for representing
and communicating knowledge about a
topic,allowing interpretation and use by
computers - MGED Ontology will provide standard terms for the
annotation of microarray experiments, allowing - structured queries
- unambiguous descriptions of experiments
14How are databases being developed? (cont)
- MGED Society
- Data Transformation and Normalization Working
Group - Standards for recording how microarray data are
transformed and normalized.
15What are the issues for a toxicogenomics database?
- Scope of the ILSI effort
- Genotoxicity Group
- 10 array platforms
- 11 compounts
- gt2 time points, up to 10 doses / compound
- Nephrotoxicity Group
- 6 array platforms
- 3 compounds, 260 animals
16What are the issues for a toxicogenomics database?
- Scope of the ILSI effort
- Hepatotoxicity Group
- 8 array platforms
- 2 compounds, 144 animals
- 2 in-life studies / compound
- ALL Groups
- Analysis of each sample at multiple sites
17What are the issues fortoxicogenomics databases?
(cont)
- Traditional toxicology endpoints are not
currently covered by MAGE, MIAME, or the MGED
Ontologies! - Organ weights
- Clinical pathology
- Histopathology
- Etc
18What are the issues for toxicogenomics databases?
- Traditional toxicology endpoints are not
standardized in nomenclature - Clinical pathology/chemistry
- AACC
- IUPAC
- Histopathology
- STP
- WHO/IARC/RITA
- NACAD
- SNOMED
- NTP, TDMS Database Pathology Code Table
19Who is involved in database development
- Private Companies
- Genelogic, Iconix, Curagen
- MSU - dbZach
- NIEHS - CEBS
- NCTR - ArrayTrack
- ILSI - EBI
20ILSI-HESI and EBI collaboration
- Establishment of database for toxicogenomics data
- Capture, store and analyse gene expression data
produced from many different toxicogenomic
experiments, conducted in several different
laboratories worldwide by the ILSI-HESI members - Interrogate the gene array data integrating
information from genomic, experimental and
toxicological domains - Gain knowledge of possible links between gene
expression changes and toxicological endpoints
21ILSI-HESI and EBI collaboration
- Aims of the database and tools
- Provide a way to integrate the different domains
- Control the annotation to achieve data
harmonization - Centralize the information to ease data access
and data sharing - Improve array annotations as the genome
assemblies are released - ALLOW data comparison
22ILSI-HESI and EBI collaboration
- Main challenge
- Get internally consistent data to allow
comparability among the experiments and run
complex queries across and within domains - Note Experiments conducted in 40 different
sites, using different array platforms and
terminologies, measuring parameters with
different units and storing information in
different format !
23ILSI-HESI and EBI collaboration
- Simple question
- Does gene X expression goes up after treatment
with compound Y with biological endpoint Z in
experiments from ILSI-HESI members A and B ? - Not simple question
- Which are the most reproducible gene expression
changes (and the quantitative measure of this
reproducibility) for all experiments on the rat
arrays, with biological endpoint X, and which
functional category these genes belong to and
which are the human homologues ?
24MIAME/Tox
- An international effort aiming to
- Share expertise
- Encourage harmonization
- Promote standardization initiative
- A call for community participation!
25MIAME/Tox objectives
- Standard contextual information
- Establish worldwide scientific consensus on the
minimal information descriptors for array-based
toxicogenomics experiments - Data harmonization
- Encourage use of controlled vocabularies for the
toxicological assessments - Data integration and data sharing
- Link data within a study
- Link several studies from one institution
- Exchange datasets among institutions
- Data storage
- Facilitate development of MIAME/Tox compliant
data management softwares and databases - - ArrayExpress _at_ EBI and CEBS _at_ NIEHS-NCT
26MIAME/Tox document
- Promote standard contextual information
- Defining the core common to most experiments
- - Minimum/sufficient information
- Structured information
- Promote data harmonization, data capture and
communication - MIAME/Tox is based on MIAME
- Focus on toxicological domain
- Sample treatment and conventional toxicology
information - - Clinical pathology, pathology, histopathology
27MIAME/Tox document
- Available at the MGED Society and ILSI-HESI web
sites - Circulate for consensus
- Toxicogenomics, pharmacogenomics and
ecotoxicogenomics communities - - Regulatory bodies
- MGED Meeting (AAAS, Denver, Feb 2003 MGED6,
France, Sept 2003) - - Toxicology societies (SOT Meeting, Salt Lake
City, March 2003) - Review and publish
- Work closely with the MGED working groups
- Ontology working group
- Identify controlled vocabularies for
toxicological metadata
28Data Input As a Key Step
- Capture data in a standard manner
- Tox-MIAMExpress
- Store information domains in database
- ArrayExpress
- Compare/query across and within domains
29Tox-MIAMExpress
30Tox-MIAMExpress
- Array designs
- A set of procedures for formatting the array
design information into a standard referencing
format (ADF) - A set of procedure to re-annotate or up date the
array designs via a link to another database at
EBI (EnsMart)
31Tox-MIAMExpress
- Experiment
- Experiment design, quality controls,
publications - Sample source and treatment
- Conventional toxicology tests data
- Microarray hybridizations data
32Tox-MIAMExpress
33Tox-MIAMExpress
34Tox-MIAMExpress
35ILSI-HESI and EBI collaboration
- Status
- Interface and database infrastructure developed
- Data input ongoing
36Acknowledgments
- Microarray Informatics Team at EBI, in particular
- Alvis Brazma (Team Leader and MGED Society
President) - Susanna-Assunta Sansone
- Philippe Rocca-Serra (Data Management)
- NIEHS-NCT and NTP
- ILSI-HESI EBI Steering Committee
- ILSI-HESI Genomics Committee