Title: The EXPO Ontology: Describing Scientific Experiments
1The EXPO OntologyDescribing Scientific
Experiments
- Ross D. King
- Department of Computer Science
- University of Wales, Aberystwyth
2What is e-Science?
- E-science is computationally intensive science.
It is also the type of science that is carried
out in highly distributed network environments,
or science that uses immense data sets that
require grid computing. Examples of this include
social simulations, particle physics, earth
sciences and bio-informatics. ...wikipedia. I
DISAGREE - eScience is about global collaboration in key
areas of science and the next generation of
infrastructure that will enable it. John Taylor
(UK e-Science) I AGREE
3Standard e-Science Vision
- Research is done in Lab X.
- All results, metadata, programs, etc. are stored
electronically in an internationally agreed
standard format, and openly published. - An open access paper is published with links to
the results, metadata, programs, etc. - Research lab Y wishes to replicate/build-on the
published work of Lab X. - This is easy because all the results, metadata,
programs, etc. are publicly available.
4Standard e-Science Projects
- Most e-Science is based around building software
infrastructure. - Software web services
- Databases digital archiving, standards, etc.
- GRID computing Globus, Condor
- Communication Access Grid
- Open Access publishing UK, NIH, etc.
- Services Text mining, Bioinformatic, etc.
5My View of e-Science
- I am interested in the formalisation and
automation of scientific research. - I and my colleagues have two related projects in
this area - EXPO an ontology of scientific experiments.
- The Robot Scientist Project.
6Formalization of Science
- The goal of science is to increase our knowledge
of the natural world through the performance of
experiments. - This knowledge should, ideally, be expressed in a
formal logical language. - Formal languages promote semantic clarity, which
in turn supports the free exchange of scientific
knowledge and simplifies scientific reasoning.
7Motivation
- The formal description of experiments for
efficient analysis, annotation, and sharing of
results is a fundamental objective of science. - Ontologies are required to achieve this goal.
- A few subject-specific ontologies of experiments
currently exist. However, despite the unity of
science, there is no general ontology of
scientific experiments. - We propose the ontology EXPO to meet this need.
8Ontologies
- An ontology is a concise and unambiguous
description of what principal entities are
relevant to an application domain and the
relationship between them.
Schulze-Kremer, S., 2001, Computer and
Information Sci. 6(21)
9The Unity of Science
- We aim to formalise generic knowledge about
scientific experimental design, methodology, and
results representation. - Such a common ontology is both feasible and
desirable because all the sciences follow the
same experimental principles. - Despite their different subject matters, they all
organise, execute, and analyse experiments in
similar ways. - They use related instruments and materials they
describe experimental results in identical
formats, dimensional units, etc.
10Ontologies for Experiments
- The formal description of experiments for
efficient analysis, annotation, and sharing of
results is a fundamental objective of science. - Ontologies are required to achieve this goal.
- A few subject-specific ontologies of experiments
currently exist. The most notable of these is
the MGED Ontology (MO). It was designed to
provide descriptors required by MIAME (Minimum
Information About a Microarray Experiment) - We have developed the ontology EXPO to meet this
need. - Soldatova King (2005) Nature Biotechnology
- Soldatova King (2006) Royal Society Interface
11Advantages of Ontologies
- The utilisation of a common standard ontology for
the annotation of scientific experiments would - Make scientific knowledge more explicit.
- Help detect errors.
- Enable the sharing and reuse of common knowledge.
- Remove redundancies in domain-specific
ontologies. - Promote the interchange and reliability of
experimental methods and conclusions.
12Our Approach to Ontology Building
- Explicitly list the principles of an ontology's
design, its constraints, along with definitions
and axioms. - Provide compliance with a standard upper ontology
(SUO) developed by IEEE P1600.1. - Keep separately domain-dependent and
domain-independent knowledge, as well as
declarative and procedural knowledge. - Build ontologies so that they are
purpose-independent and therefore are
future-proof.
13The Position of EXPO
SUMO
Upper level
EXPO
Measurement ontology
Bibliographic Data Ontology
BiblioReference
Mes.Unit
Generic level
SubjectOfExp.
ObjectOfExp.
Domain level
Domain Model
PSI
MO
Plant ontology
MSI
FuGO
ChEBI
14Small Section of EXPO
15Generic ontology of experiments
e-Science
- Controlled vocabulary of scientific experiments
- Formalized electronic representation of
scientific experiments - Unified standards for representation, annotation,
storage, and access to experimental results - Reasoning over experimental data and conclusions.
Ontology of science (formalization of scientific
methods, technologies, infrastructure of science)
EXPO Ontology of scientific experiments concepts
218 language OWL
Scientific Experiment
Experimental results
Experimental goal
Experimental action
Classification of experiments
Experimental design
Admin info about experiment
Experimental object
16EXPO description
- EXPO v.1
- Concepts 200
- Language OWL
- Tool Hozo Ontology Editor
17Scientific Publication 1
- The traditional way of presenting scientific
knowledge in scientific papers has many
limitations. - The most important and obvious of these is the
use of natural language to describe knowledge -
albeit augmented by various formalisms and
mathematics. - This is problematic because natural language is
notorious for its imprecision and ambiguity.
18Scientific Publication 2
- Use of Natural Language is a great hindrance when
using computers to store and analyse data hence
the growing importance of text-mining. - We argue that the content of scientific papers
should increasingly be expressed in formal
languages. - Is writing a scientific paper closer to writing
poetry or a computer program?
19Applications of EXPO
- Phylogentics
- Particle Physics
- Structural Biology
- Drug Screening and Design
- Physical Chemistry
- Robot Scientist
20Solenodons
Solenodons are endangered insectivores from
Hispaniola and Cuba.
21Phylogenetic Example
- Random paper selected from Nature Roca, A.L.,
Bar-Gal, G.K., Eizirik, E., Helgen, M.K., Maria,
R. Mesozoic origin for West Indian insectivores.
Nature, 429, 649-651 (2004). - Paper investigates the phylogenetic status of the
mammalian species Solenodon cubanus and Solenodon
paradoxus. i.e. the evolutionary relationship of
these animals with all others. - Conclusion - Solenodons diverged in the
Cretaceous.
22Solonedon Annotation
Scientific Experiment Hypothesis-forming,
Hypothesis-driven Admin info about
experiment Title Mesozoic Origin of West
Indian Insectivores Author Roca, A.L., Bar-Gal,
G.K., Eizirik, E., Helgen, M.K.,
Organisation 1. National Cancer Institute,
Frederick, USA Status public
academic Reference Roca, A.L., Bar-Gal, G.K.,
Eizirik, E., Helgen, M.K., Maria, R. at all.
Mesozoic origin for West Indian insectivores.
Nature, 429, 649-651 (2004). Classification of
experiment Taxonomy DDC(Dewey) 575 Evolution
and Genetics Library of Congress QH 367.5
molecular phylogenetics Zoology DDC(Dewey)
599 mammalology Library of Congress
QL351-QL352 Zoology-Classification Experimental
goal To discover the phylogeny of the species
Solenodon paradoxus and Solenodon cubanus Null
hypothesis H01 explicit Representation
style text Linguistic expression natural
language Some have suggested a close
relationship to soricids (shrews) but not to
talpids Linguistic expression arificial
language predicate calculus
experimental action 1.1.1 extraction
and purification object sample of DNA parent
group DNA from Solenodon paradoxus
sampling random sampling instrument Qiagen
DNA cleanup kit experimental action 1.1.2 DNA
amplification
Experimental Conclusions (Formed
Hypotheses) C1) Hypothesis Representation
style text Linguistic expression natural
language There existed an mammal that is the
ancestor of Solenodons, Soricoidea, Talpoidea,
Erinaceidea, and which is not the ancestor of any
other mammal. Linguistic expression artificial
language predicate calculus
EXPO A scientific experiment is a research
method which permits the investigation of
cause-effect relations between known and unknown
(target) variables of the field of study
(domain). An experimental result cannot be known
with certainty in advance.
EXPO A classification of experiments is a
hierarchical system of categories types of
experiments according to their domains or used
models of experiments.
Prolog instantiation(solenodon, So),
instantiation(soricoidea, Sh),
instantiation(talpoidea, T), instantiation(mamma
lia, An), shared_ancestor(So, Sh, T, An).
shared_ancestror(Shared, Not_shared). shared_ances
tor(X,Y, An) - ancestor(An, X). not
ancestor(An, Y). shared_ancestor(XLx,Ly, An)
- shared_ancestor(Lx,Ly, An). ancestor(An,
X). shared_ancestor(Lx,YLy, An)
- shared_ancestor(Lx,Ly, An). not
ancestor(An, Y).
EXPO A null hypothesis is an experimental
hypothesis that states that a known controlled
variable or variables does not have a specified
effect on the unknown (target) variable or
variables of the domain.
XML lt/rdfsClassgt ltrdfsClass rdfID"classificat
ion of experiments"gt ltrdfslabelgtclassification
of experimentslt/rdfslabelgt ltrdfssubClassOf
rdfresource"classification" /gt
ltrdfscommentgt DefA classification of
experiments is a hierarchical system of
categories - types of experiments - according to
their domains or used models of
experiments. Axiom lt/rdfscommentgt
23Problems Highlighted by Annotation 1
- The use of EXPO makes explicit the different
hypotheses described in the paper. - What we have identified in the ltresearch
conclusiongt are not mentioned as hypotheses in
the text. - This contrasts with what we identify as the seven
null-hypotheses, which are mentioned explicitly
in the main text. sub-optimal statistically.
24Problems Highlighted by Annotation 2
- Another aspect of the research which use of EXPO
would have highlighted, was that the DNA
sequences produced during the experiment were
stored in the EMBL database using the taxonomic
term Insectivora. - This taxon is now generally recognised to be
polyphyletic, and its use contradicts the actual
conclusions of the paper.
25Problems Highlighted by Annotation 3
- We formalised the knowledge behind the authors
argument that Cuban Solenodons should be
classified in a distinct genus, Atopogale. - Our analysis indicates that it would be more
internally consistent for the authors to have
classified Cuban Solenodons as a distinct family.
- etc..
26High-energy/particle physics
Another random paper selected from same Nature
issue D0 Collaboration. A precision measurement
of the mass of the top quark. Nature, 429,
639-642 (2004). (350 scientists)
27Experimental Equipment!
28EXPO D0 Example 1
- ltscientific experimentgt ltcomputational
experimentgt ltsimulationgt - ltadmin info about experimentgt
- lttitlegt A precision measurement of the mass
of the top quark - ltclassification by domaingt
- ltdomain of experimentgt High Energy Physics /
Particle Physics - ltDDC(Dewey) classificationgt 539.7 Atomic and
nuclear physics - ltLibrary of Congress classificationgt QC 770-798
Atomic, Nuclear, Particle Physics - ltrelated domaingt Computational Statistics
- ltDDC(Dewey) classificationgt 519 Probabilities
and Applied Mathematics - ltLibrary of Congress classificationgt QA 273-274
Probabilities - ltresearch hypothesisgt ltrepresentation
stylegt lttextgt - ltlinguistic expressiongt ltnatural languagegt
- Given the same observed data use of the new
statistical method M1 will produce a more
accurate estimate of Mtop than the original
method M0. - ltlinguistic expressiongt ltartificial languagegt
- M0(? D0 observations ? ? relevant background
knowledge) ? E0 - M1(? D0 observations ? ? relevant background
knowledge) ? E1 - estimation_error(E0, Mtop) ? Error0
- estimation_error(E1, Mtop) ? Error1
- Error0 gt Error1
29Problems Highlighted by Annotation 1
- Poor science, even though published in Nature!
- This annotation makes it explicit that the
experiment was somewhat unusual in not generating
any new observational data. Instead, it presents
the results of applying a new statistical
analysis method to existing data (a set of
putative top quark pair decays events involving
ejets and µjets)
30Problems Highlighted by Annotation 2
- No explicit hypothesis.
- We argue that the papers implicit experimental
hypothesis was given the same observed data, use
of the new statistical method will produce a more
accurate estimate of Mtop than the original
method. - This is based on the authors statement here we
report a technique that extracts more information
from each top-quark event and yields a greatly
improved precision when compared to previous
measurements. - We prefer the term accuracy to precision
31Problems Highlighted by Annotation 3
- The Carnap principle All relevant knowledge
should be used to decide a scientific question - 91 candidate events were used to calculate the
old value, but only 22 of these were used for the
new value! - The old method estimate of Mtop is 173.3
5.6 (stat) 5.5 (sys) GeV/c2 - The new method estimate of Mtop is 180.1 3.6
(stat) 3.9 (sys) GeV/c2. - The current (June 2005) best estimate for Mtop is
174.3 3.4 GeV/c2
32Problems Highlighted by Annotation 4
- The paper concluded that Mtop is higher than
previously estimated, which deductively implies a
higher mass for the Higgs Boson. As the Higgs
Boson has not yet been observed, even at energies
above its previously predicted maximum likelihood
mass, the newly inferred higher Mtop lent support
to the existence of the Higgs Boson. - However, it would have been possible to argue
validly the other way that the Higgs Boson is
thought highly likely to exist, therefore its non
observation makes more probable a higher value of
Mtop. - This argument was not explicit in the paper, but
may have existed implicitly as a motivation. - The paper would have benefited from making this
argument explicit, even if not used.
33An Ontology for Drug Screening Design
- Funded BBSRC project started in April 2007.
- Extend Expo to formalise meta-data for drug
screening and design. - We are developing our own Drug screening and Drug
design Robot Scientist - Eve. - Collaborating with industry. Working with Pfizer
to develop ontology and experiment annotation
system. Especially important in merging data
that results from corporate merging.
34Structural Biology
- Structural Biology was once a leader in the
development of standards for the preservation and
sharing of data. - This lead has been lost.
- The main data standard, mmCIF, does not meet
state-of-the-art standards in biology for
ontologies. - The main database, PDB, is not relational
although it is meant to be. - We have proposed a way forward using EXPO.
- Nature Biotechnology (2007) 25, 437-442
35ART
- An Ontology Based Tool for the Translation of
Papers into Semantic Web Format. - Focussed on physical chemistry very structured
publications. - Funded by JISC, in collaboration with the Royal
Society of Chemistry, and UKOLN.
36ART 2
- Tool to add value to papers and data stored in a
repository. - The tool will lead authors through a process
where experimental goals, hypotheses,
methodologies, results, etc. are described and
linked to the etx and external data. - The result will be an article in OWL format that
can be archived with the original text version. - The OWL version will be more formalised and
useful for computer processing, e.g. text mining. - SIG/ISMB07 Ontology Workshop / BMC Bioinformatics
37Input free text article
domain independent
Convert to SciXML article
DC PRISM
Markup paper metadata (title, author,)
Named entity recognition
domain dependent
ChEBI FIX REX
Markup domain concepts (molecule, bond,)
Request/ Confirm/ Explain
EXPO OBI ECO
Recognition of generic scientific concepts (goal,
hypothesis,)
domain independent
user
Markup generic scientific concepts
Generate Summary, RSS feed
Output xml/ owl article
38The Concept of a Robot Scientist
We have developed the first computer system that
is capable of originating its own experiments,
physically doing them, interpreting the results,
and then repeating the cycle.
Background Knowledge
Analysis
Hypothesis Formation
Consistent Hypotheses
Experiment
Experiment selection
Results Interpretation
Final Theory
Robot
King et al. (2004) Nature, 427, 247-252.
39Motivation 1 Philosophical
- What is Science?
- The question whether it is possible to automate
the scientific discovery process seems to me
central to understanding science. - There is a strong philosophical position which
holds that we do not fully understand a
phenomenon unless we can make a machine which
reproduces it.
40Motivation 2 Technological
- In many areas of science our ability to generate
data is outstripping our ability to analyse the
data. - One scientific area where this is true is
functional genomics, where data is now being
generated on an industrial scale. - The analysis of scientific data needs to become
as industrialised as its generation.
41The Application Domain
- Systems Biology
- Yeast (S. cerevisiae) best understood
eukaryotic organism. - Strain libraries, e.g. EUROFAN 2 has knocked out
each of the 6,000 genes. - Task to learn models of yeast metabolism using
selected mutant strains and quantitative growth
experiments.
42Movie
43Some Example Growth Curves
- Soldatova et al., CS Dept., Aberystwyth, UK
44The need for a Robot Scientist ontology (EXPO-RS)
- The robot requires detailed and formalized
description domains, background knowledge,
experiment methods, technologies, hypotheses
formation and experiment designing rules, etc. - Integrity of data and metadata.
- Open access of the RS experimental data and
metadata to the scientific community. - Soldatova, Sparkes, Clare, King (2006)
Bioinformatics
45EXPO-RS
- Formalization of the entities involved in Robot
Scientist experiments. - A controlled vocabulary for all the participants
of the project. - Identification of metadata essential for the
experiment's description and repeatability. - Coordination of the planning of experiments,
their execution, access to the results, technical
support of the robot, etc. - Modelling a database for the storage of
experiment data and track experiment execution.
46Conclusions
- The unity of science implies that an accepted
general ontology of experiments is both possible
and desirable. - Such an ontology would promote the sharing of
results within and between subjects, reducing
both the duplication and loss of knowledge. - It is also an essential step in formalising
science, and fully exploiting computer reasoning
in science. - We propose EXPO as a general ontology for
scientific experiments. - We have demonstrated the utility of EXPO on
applications in phylogenetics, high-energy
physics, chemistry, and high-throughput Systems
Biology.
47Acknowledgements
- Larisa Soldatova Aberystwyth
- Amanda Schierz Aberystwyth
- Ken Whelan Aberystwyth
- Amanda Clare Aberystwyth
- Mike Young Aberystwyth
- Jem Rowland Aberystwyth
- Andrew Sparkes Aberystwyth
- Wayne Aubrey Aberystwyth
- Emma Byrne Aberystwyth
- Larisa Soldatova Aberystwyth
- Magda Markham Aberystwyth
- Steve Oliver Manchester
- Riichiro Mizoguchi Osaka
- BBSRC, JISC