Title: OBO and OBD: Biomedical ontologies and data
1OBO and OBD Biomedical ontologies and data
- Chris Mungall
- Howard Hughes Medical Institute, UC Berkeley
- National Center for Biomedical Ontology
2Outline
- A brief history of OBO
- Overview of The National Center for Biomedical
Ontologies - The OBO Foundry
- OBD The OBO Database
- Storing mutant phenotype and disease data in OBD
- Technology for OBD
3OBO History
- 1999 Gene Ontology
- 2003 Open Bio-Ontologies
- 2005 National Center for Biomedical Ontologies
- neo-OBO
- OBD
- 2006 OBO Foundry
4The Gene Ontology
- Application domain
- annotation of genes and gene products
- initially model organism focused
- 3 Orthogonal Ontologies, 20k terms
- molecular function
- biological process
- cellular component
- Formalism
- DAG - is_a and part_of relations
5Gene Ontology software and infrastructure
- Ontologies managed in CVS repository
- Editor software Obo-edit
- Native file format Obo-format
- Annotation data managed at distributed sites
- associations of genes and gene products to GO
terms - Daily uploads into central database
- GODB schema
- AmiGO browser
- Exports
- GO-RDF, Obo-XML, OWL, MySQL
6OBO Open Bio-Ontologies
- Offshoot of the GO
- Initial ontologies
- anatomical
- cell types
- fly anatomy
- mouse anatomy
- zebrafish anatomy
- plant anatomy
- dictostelium anatomy
- human anatomy
- developmental stages
- experimental conditions
- chemical
- phenotype and disease
- phenotypic attributes
- mammalian phenotype
- plant phenotype
- human disease
- relations
7Obol integrating GO and OBO
None of these relationships are explicitly
encoded in the ontology
- GO Biological Process
- cysteine biosynthesis
- myoblast fusion
- snoRNA catabolism
- wing disc pattern formation
- epidermal cell differentiation
- regulation of flower development
- B-cell differentiation
- midbrain development
- Mammalian Phenotype Ontology
- increased activated B-cell number
- kidney hypoplasia
We are currently creating logic definitions for
these composite terms
8Problems with OBO mark 1
- No funding!
- Minimal infrastructure
- Hosted on sourceforge (http//obo.sourceforge.net)
- CVS
- Web Page, ontology summaries
- Periodic downloads and automated format
conversions - No APIs or database
- Minimal review of incoming ontologies
- Existing clinical ontologies and terminologies
not included - SNOMED, UMLS, etc
9National Center for Biomedical Ontologies
- Formed in 2005
- The goal of the Center is to support biomedical
researchers in their knowledge-intensive work, by
providing online tools and a Web portal enabling
them to access, review, and integrate disparate
information resources in all aspects of
biomedical investigation and clinical practice. A
major focus of our work involves the use of
biomedical ontologies to aid in the management
and analysis of data derived from complex
experiments
10NCBO 6 Cores
- Core 1 Ontologies and metadata
- Core 2 Biomedical data annotations
- Core 3 Driving Biological Projects and external
research collaborations - Core 4 Infrastructure
- Core 5 Education
- Core 6 Dissemination
11(No Transcript)
12Core 1 Ontologies (Stanford, UVic, Mayo)
- Develop an ontology registry/library
- Provide ontology services BioPortal
- search
- change management, ontology lifecycle
- peer review
- metadata
- ontology mapping (PROMPT)
- visualisation
- Technology
- Protégé
- Lexgrid
13Core 2 Data (Berkeley)
- Biomedical data annotation
- experimental results
- clinical trial data
- Tools to support annotation
- phenote
- OBD
- Data warehouse for data annotated using OBO
ontologies
14Core 3 Driving Biological Projects
- Linking model organism phenotypes to human
disease genes - Zebrafish
- (Westerfield. Univ of Oregon)
- Fruitfly
- (Ashburner, Univ of Cambridge, UK)
- Clinical trial data
- HIV Clinical trials
- (Sim, UCSF)
15Cores 5 and 6 Education and dissemination
(University at Buffalo)
- Principles of ontology design
- Outreach
- Organisation of meetings and workshops
16Part 2 OBD
- The OBD definition of an ontology
- OBO Foundry
- OBD - Annotations database
17A ontology is
- A formal representation of some aspect of reality
sense organ
- what types of entity exist?
eye disc
is_a
- what are the relationships between these entities?
develops from
eye
part_of
ommatidium
18Types vs instances
- Instances
- what is particular in reality
- Types (universals, kinds)
- what is general in reality
- a potential object of investigation by science
19Ontologies vs Data
- Instances
- what is particular in reality
- represented in databases
- electronic health records
- experimental data
- Types (universals, kinds)
- what is general in reality
- a potential object of investigation by science
- represented in ontologies
20Relations
- Instance level relations
- often time-variant
- this particular ommatidium part_of this
particular compound eye, right now - Type level relations
- specify what is true for all instances of a type
- time-invariant
- all instances of ommatidium part_of some instance
of a compound eye, at all times
21The OBO Relation Ontology
- Foundational relations
- is_a
- part_of
- has_participant
- located_in
- adjacent_to
- transformation_of
- derives_from
- http//obo.sourceforge.net/relationship
- Smith B, Ceusters W, Klagges B, Kohler J, Kumar
A, Lomax J, Mungall CJ, Neuhaus F, Rector A,
Rosse C (2004) Relations in Biomedical Ontologies
.Genome Biology, 2005, 6R46
22- We want to encourage ontologies to be
interoperable and follow certain principles - this is vital for OBD
- We want the neo-OBO to be ecumenical
- cannot impose standards on outside ontologies
23The OBO Foundry
- Collaborative experiment amongst OBO ontology
developers - Establish high-standard orthogonal reference
ontologies
24OBO Foundry principles
- http//www.obofoundry.org/
- intelligibility to biologist curators,
annotators, users - formal robustness
- stability
- compatibility
- interoperability
- support for logic-based reasoning
25OBO Foundry Criteria
- The ontology uses relations which are
unambiguously defined following the pattern of
definitions laid down in the OBO Relation
Ontology. - Assumption if we are to create ontologies which
support logical reasoning then we need to take
time and instances into account
26OBD A Database for OBO
- About OBD
- A use case linking genes to disease
- On the correct representation of phenotypes
- Representation of instances in a database
- SQL/Relational DBs
- Semantic Web DBs
- Deductive DBs
27OBD A Database for OBO
- OBO is a repository of ontologies
- OBD is a repository of data annotated using these
ontologies - to integrate data from various sources
- to allow researchers to retrieve data and perform
advanced queries using ontologies - to help generate and explore hypotheses
- to allow for reasoning over data
28OBD and the NCBO Driving Biological Projects
- OBD is a general purpose data repository
complementary to OBO - generic schema
- Special care for generating, integrating and
analysing data from DBPs - OBD Foundry
- mutant phenotypes
- clinical trials
29Linking diseases to genes
- Humans share genes with other organisms
- orthology
- Mutations in orthologous genes give rise to
similar phenotypes - Understanding phenotypes helps us understand
diseases and disorders - Example holoprosencephaly
30SHH-/
SHH-/-
shh-/
shh-/-
31What is a phenotype?
- Represented using qualities
- Qualities are dependent entities
- Phenotype A collection of one or more quality
types inhering in some entity types (both
continuants and occurrents) - An instance of a phenotype
- One or more quality instances inhering in the
entity instances that comprise a particular
organism instance - Example osteoporosis
- The quality of being low mass inhering in bone
32Representing phenotypes in OBD
types
instances
entity
quality
A Formal Theory of Substances, Qualities and
Universals Neuhaus F, Smith B, Grenon P,
Proceedings of FOIS 2004
33quality
anatomical entity
organism
genotype
environment
is_a (indirect)
eye
red
Drosophila melanogaster
y1 cn1 bw1 sp1 genotype
agar medium
instance_of
34What ontologies does OBD require?
- Continuants
- Multiple species-centric ontologies (fly, fish,
mouse, ..) - OBO Cell ontology
- OBO/GO Cellular component
- Organism type NCBI Taxonomy
- Occurrents
- OBO/GO Biological Process
- OBO Relation ontology
- Environment not yet required
- Phenotypic Qualities
- PATO
35Pre- vs post- composed
- Pre-composed phenotypes
- e.g. from Mammalian Phenotype Ontology (MP)
- osteoporosis
- syndactyly
- pink fur
- Post-composition of phenotypes
- e.g. anatomical ontology and PATO
- (loosely) Entity type Quality type
- bone, low mass
- fingers, fused
- fur, pink
- OBD must support both
36OBD Schema
- Minimal generic schema
- Ontological primitives only
- type
- instance
- relation
- Usable across biological domains
- Instance data only as good as the ontologies used
to type instances - Importance of OBO Foundry
37OBD and OBO
- How do we integrate them?
- Two physical databases, one portal
- requires mediation layer
- efficiency concerns?
- Replicate OBO in OBD
- Mechanisms for managing change and ontology
lifecycle
38Data vs ontologies
- How do we decide what goes in OBO and what goes
in OBD? - OBO types
- OBD instances
- ..but most curated scientific data concerns types
(or at least representative instances) - inferences about gene types, protein types,
genotypes from multiple experiments on multiple
instances - where do we draw the line?
39Technological framework for OBD
- Representations must be encoded using some
technological framework - What are the choices?
- Choice 1 Traditional DBMS
- Choice 2 Semantic Web
- representation language DL
- RDF triple store database
- Choice 3 Logic based
- representation language FOPL or some fragment
- deductive database
- Note XMDR exploring similar paths?
40SQL semantics
-
- Scalable, standard, etc
- Similarities to FOPL
- -
- Fragment of FOPL too limited
- Limited deductive capabilities
- /-
- Closed world assumption
41SQL DBs Prior experience
- The Gene Ontology Database (GODB)
- Predecessor to OBD
- More restricted domain
- genes and gene products
- GO
- Molecular Function
- Biological Process
- Cellular Component
- Technological framework
- SQL database, semi-generic schema
- Pre-computed basic inferences
- Difficult to extend to more general use cases for
OBD
42Semantic Web Technology
- Representation language
- RDF triple based
- Some fragment of OWL entailment
- Query language
- RDF based - SPARQL, SeRQL
- Database systems and tools
- Jena, Sesame, Kowari, SWI-SeRQL engine
43quality
anatomical entity
organism
genotype
environment
is_a
eye
red
Drosophila melanogaster
y- cn b genotype
agar medium
instantiation
44RDF triple representation
45OBD-Sesame
- Trial phenotype dataset from fruitfly and
zebrafish - converted to OWL instance representation
- Ontologies converted to OWL
- PATO - qualities (attributes)
- Anatomical ontologies (fly, fish)
- Cell ontology
- Gene ontology
- function, process, component
- NCBI Taxonomy
- OWL loaded into a Sesame DB
46Example query in SeRQL
find mutations affecting the morphology of the
wing vein
SELECT DISTINCT EI, ET, OrgI, QI, QT, QN FROM
EI rdftype ET rdfslabel EN, EI
OBO_REL_part_of OrgI rdftype Tax rdfslabel
TaxN, EI OBO_REL_has_quality QI
rdftype QT rdfslabel QN WHERE label(EN)
"wing vein" AND label(TaxN) Arthropoda" AND
label(QN) morphology"
results of query on OBD-sesame one instance of
wing vein L2, branched in a fruitfly
47Example query in SeRQL
find mutations bone mass
SELECT DISTINCT EI, ET, OrgI, QI, QT, QN FROM
EI rdftype ET rdfslabel EN, EI
OBO_REL_part_of OrgI rdftype Tax rdfslabel
TaxN, EI OBO_REL_has_quality QI
rdftype QT rdfslabel QN WHERE label(EN)
bone" AND label(QN) reduced mass"
with OWL entailment, should return mutants with
phenotypes instantiating osteoporosis (if this
term is defined in OWL)
48OBD-Sesame preliminary results
- Dependent upon storage layer
- memory-based
- FAST
- RDBMS-based
- SLOW
- Deductive queries implemented inefficiently
- Slower than GODB (SQL)
- Benchmark results
- analysis by Shengqiang Shu
- http//smi.stanford.edu/projects/cbio/mwiki-intern
al/index.php/RDF_Sesame_Demo_Benchmark
49Considerations with RDF/OWL
- More is needed to ensure interoperation
- Foundry ontologies can help
- Foundations in RDF triples
- 3-ary predicates (ie binary relations Rxy)
- n-ary relations must be reified or constructed
from binary relations - real world instance level relations are
time-indexed - Rxy forces us to use a time-slice representation
- multiplication of entities
50Alternative Deductive Databases
- Provides deductive abilities without RDF
constraints - Can represent n-ary relations and time
- Various kinds, different tradeoffs
- SQLdeduction (Datalog)
- Prolog
- Disjunctive databases
- Deductive object-oriented databases (FLORA2)
- FOL theorem provers
- Different OBD instantiations for different
applications?
51OBD-Foundry
- Fundamental tension/conflict
- Accept a variety of instance data typed by a
variety of ontologies - Ensure interoperability
- Solution OBD Foundry
- parallels OBO Foundry experiment
- use driving biological projects as model of
process - use less restricted representation language
- core relations part of language
- deductive databases
52Future
- Integrate OBD with BioPortal
- Incorporate more clinical terminologies into OBO
- Collaborations with external groups and other
National Centers - Bring in more Driving Biological Projects