Title: Community Standards and Comparative Genomics
1Community Standards and Comparative Genomics
- Doreen Ware USDA-ARS
- Cold Spring Harbor Laboratories
- Plant Ontology Workshop May 30, 2006
2Outline
- Introduction to Gramene
- Data integration why is it important
- Example using Genome Sequencing and Mapping
- Why standards are important for data management
and integration? - Repositories, protocols, nomenclature,
vocabulary, formats, user interfaces - Comparative Resources
- PlexDB www.plexdb.org, Pathways tools in
Gramene, Reactome www.reactome.org
3Gramenewww.gramene.org
4Gramene
- As an information resource, Gramene's purpose is
to provide added value to data sets available
within the public sector, which will facilitate
researchers' ability to leverage the rice genomic
sequence to identify and understand corresponding
genes, pathways and phenotypes in other crop
grasses. - This is achieved by building automated and
curated relationships between rice and other
cereals for both sequence and biology. The
automated and curated relationships are queried
and displayed using controlled vocabularies and
web-based displays. - The controlled vocabularies (Ontologies),
currently being utilized include Gene ontology,
Plant ontology, Trait ontology, Environment
ontology and Gramene Taxonomy ontology. - The web-based displays for phenotypes include the
Genes and Quantitative Trait Loci (QTL) modules.
Sequence based relationships are displayed in the
Genomes module using the genome browser adapted
from Ensembl, in the Maps module using the
comparative map viewer (CMap) from GMOD, and in
the Proteins module displays. BLAST is used to
search for similar sequences. Literature
supporting all the above data is organized in the
Literature database.
5Why data integration?
- It is often the case that two data sets, when
integrated, are far more useful than the two data
sets taken individually.
6Genome Sequence and ESTs
7Genome Sequence and SNP data
8SNPs in context to the a protein sequence
9Automated Annotation
10Synteny view of Rice Chromosome 1 to the maize
genome
http//www.gramene.org/Oryza_sativa/syntenyview?ot
herspeciesZea_mayschr1x15y13
11Bioinformatic Food Chain
Plant Biology Databases A needs Assessment
November 2005
12What are the appropriate repositories for the
data sets?
13Static repositories for long term storage
- GenBank (Benson et al 2004), sequence submissions
- GEO (Barrett et al 2005), a repository of
microarray expression data - PDB (Westbrook et al 2003), a repository of x-ray
crystallographic structures
14If no static repository exist what to keep and
why?
- Examplegenotype data from recombinant inbred
(RI) lines - Multiple labs use the same germplasm for trait
evaluation - Analysis tools change overtime
- Integrate data from multiple experiments
15What is the method for producing the alignments?
16Provide information on the data set, the method
used and the when the analysis was completed
17Standard Operating ProtocolsSOPs
- Data integration often requires additional
analysis, and these require decisions. - Document these to allows end-users and the
individuals working on the project to understand
the process and what decisions were made along
the way. - Establish Quality Assurance and Control at each
step
18Separate the technical infrastructure from the
human infrastructure
- Many automated computational tasks that do not
require specialized species- or biology-specific
knowledge. - In order to avoid redundant and inconsistent
efforts, encourage partnerships between groups
that can provide technical infrastructure for
automated annotation tasks and groups that are
knowledgeable about the underlying biology
associated with the data set.
19Standardize Nomenclature
20Standardize Nomenclature
- Example Genomic Clones used as templates for the
physical map and genomic sequence - 59P7
- c59P7
- c0059P07
- ZM0059P07
- ZMC0059P07
- ZMCUNK_0059P07
21Standardize Vocabulary
22Ontologies to describe attributes of the data set
- Ontologies are sets of vocabulary terms whose
meanings and relations with other terms are
explicitly stated in such a way as to be
comprehensible to humans and computer programs. - Ontology-building has emerged as a major activity
of curated repositories because by annotating
data sets using a shared set of ontologies,
repositories can establish connections both
within the data sets they curate and across data
sets contained within different repositories.
23GOGene Ontologyhttp//www.geneontology.org
- molecular function
- describes activities, such as catalytic or
binding activities, at the molecular level - biological process
- is series of events accomplished by one or more
ordered assemblies of molecular functions - cellular component
- a component of a cell but with the proviso that
it is part of some larger object, which may be an
anatomical structure (e.g. rough endoplasmic
reticulum or nucleus) or a gene product group
(e.g. ribosome, proteasome or a protein dimer)
24POPlant Ontologywww.plantontology.org
- Plant Structure
- A controlled vocabulary of botanical terms
describing morphological and anatomical
structures representing organ, tissue and cell
types and their relationships. Examples are
stamen, gynoecium, petal, parenchyma, guard cell,
etc. - Growth and Developmental Stages
- A controlled vocabulary of terms describing
growth and developmental stages in model plant
species and their relationships. Examples are
embryo development stage, seedling stage,
flowering stage, etc
25OBOOpen biomedical Ontologieshttp//obo.sourcefo
rge.net
26Example ontologies currently available from OBO
27Evidence and Attribution tracking
- Evidence tracking links an assertion contained
within a repository to the underlying evidence
that supports that assertion. - Attribution tracking links a data set and
annotations on the data set to the individual or
group that produced it
28Standardize data formats and user interfaces
- The lack of standard file formats provides
friction that increases the cost and decreases
the pace of active curation and data integration.
- The lack of standardization of user interfaces
leads to frustration on the part of researchers
who cannot easily move from one repository to
another.
29Description of the fields in the database and
intended use
- Example the database field was location
- Mexico
- Jalisco
- third field off of the first left on the main
road out of town
30End-Users Who are they and what are their needs?
- Naïve end-users require easy-to-use and intuitive
interfaces that nevertheless provide them with
access to the full data set. These users are
often satisfied with one-object-at-a-time
interfaces, such as those provided by almost all
biological databases. - More sophisticated users require query interfaces
that allow them to integrate multiple data sets
within the current repository. - The most sophisticated users wish to integrate
multiple data sets across multiple repositories.
31Generic Model Organism Database
- The Generic Model Organism Database (GMOD)
Project is a largely open source project to
develop a complete set of software for creating
and administering a model organism database.
Components of this project include genome
visualization and editing tools, literature
curation tools, a robust database schema,
biological ontology tools, and a set of standard
operating procedures.
32GMODwww.gmod.org
33Comparative Resources
34Comparative Resources
- Microarray Resource
- PlexDB
- MIAME Standard compliant
- Pathway Databases
- MetaCyc pathway tools
- Biochemcial pathways
- Reactome
- Biological processes
35(No Transcript)
36MIAME Standards
37(No Transcript)
38(No Transcript)
39(No Transcript)
40(No Transcript)
41Pathway Database
- Standardized formats are maturing for data
exchange - User meeting are necessary to facilitate the
development - Existing resources provide useful cross species
comparisons, to leverage experimental validation
between organisms, identify gaps in pathways,
and overlap between species
42Inference of a rice reaction set from Arabidopsis
with Pathway Tools in Gramene Focus is on
biochemical pathways
43SkyPainter view from Reactome
Inference of a rice reaction set from human with
Reactome / OrthoMCL Focus on biological process
44http//dev.gramene.org/about/personnel.html
PlexDB Roger Wise Julie Dickerson Pathway
Tools SRI Peter Karp Reactome Peter
dEstachio Guanming Wu Lincoln Stein Funding
NSF and USDA-ARS