The EXPO Ontology: Describing Scientific Experiments - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

The EXPO Ontology: Describing Scientific Experiments

Description:

Examples of this include social simulations, particle physics, earth sciences ... There existed an mammal that is the ancestor of: Solenodons, Soricoidea, ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 48
Provided by: rossd4
Category:

less

Transcript and Presenter's Notes

Title: The EXPO Ontology: Describing Scientific Experiments


1
The EXPO OntologyDescribing Scientific
Experiments
  • Ross D. King
  • Department of Computer Science
  • University of Wales, Aberystwyth

2
What is e-Science?
  • E-science is computationally intensive science.
    It is also the type of science that is carried
    out in highly distributed network environments,
    or science that uses immense data sets that
    require grid computing. Examples of this include
    social simulations, particle physics, earth
    sciences and bio-informatics. ...wikipedia. I
    DISAGREE
  • eScience is about global collaboration in key
    areas of science and the next generation of
    infrastructure that will enable it.  John Taylor
    (UK e-Science) I AGREE

3
Standard e-Science Vision
  • Research is done in Lab X.
  • All results, metadata, programs, etc. are stored
    electronically in an internationally agreed
    standard format, and openly published.
  • An open access paper is published with links to
    the results, metadata, programs, etc.
  • Research lab Y wishes to replicate/build-on the
    published work of Lab X.
  • This is easy because all the results, metadata,
    programs, etc. are publicly available.

4
Standard e-Science Projects
  • Most e-Science is based around building software
    infrastructure.
  • Software web services
  • Databases digital archiving, standards, etc.
  • GRID computing Globus, Condor
  • Communication Access Grid
  • Open Access publishing UK, NIH, etc.
  • Services Text mining, Bioinformatic, etc.

5
My View of e-Science
  • I am interested in the formalisation and
    automation of scientific research.
  • I and my colleagues have two related projects in
    this area
  • EXPO an ontology of scientific experiments.
  • The Robot Scientist Project.

6
Formalization of Science
  • The goal of science is to increase our knowledge
    of the natural world through the performance of
    experiments.
  • This knowledge should, ideally, be expressed in a
    formal logical language.
  • Formal languages promote semantic clarity, which
    in turn supports the free exchange of scientific
    knowledge and simplifies scientific reasoning.

7
Motivation
  • The formal description of experiments for
    efficient analysis, annotation, and sharing of
    results is a fundamental objective of science.
  • Ontologies are required to achieve this goal.
  • A few subject-specific ontologies of experiments
    currently exist. However, despite the unity of
    science, there is no general ontology of
    scientific experiments.
  • We propose the ontology EXPO to meet this need.

8
Ontologies
  • An ontology is a concise and unambiguous
    description of what principal entities are
    relevant to an application domain and the
    relationship between them.

Schulze-Kremer, S., 2001, Computer and
Information Sci. 6(21)
9
The Unity of Science
  • We aim to formalise generic knowledge about
    scientific experimental design, methodology, and
    results representation.
  • Such a common ontology is both feasible and
    desirable because all the sciences follow the
    same experimental principles.
  • Despite their different subject matters, they all
    organise, execute, and analyse experiments in
    similar ways.
  • They use related instruments and materials they
    describe experimental results in identical
    formats, dimensional units, etc.

10
Ontologies for Experiments
  • The formal description of experiments for
    efficient analysis, annotation, and sharing of
    results is a fundamental objective of science.
  • Ontologies are required to achieve this goal.
  • A few subject-specific ontologies of experiments
    currently exist. The most notable of these is
    the MGED Ontology (MO). It was designed to
    provide descriptors required by MIAME (Minimum
    Information About a Microarray Experiment)
  • We have developed the ontology EXPO to meet this
    need.
  • Soldatova King (2005) Nature Biotechnology
  • Soldatova King (2006) Royal Society Interface

11
Advantages of Ontologies
  • The utilisation of a common standard ontology for
    the annotation of scientific experiments would
  • Make scientific knowledge more explicit.
  • Help detect errors.
  • Enable the sharing and reuse of common knowledge.
  • Remove redundancies in domain-specific
    ontologies.
  • Promote the interchange and reliability of
    experimental methods and conclusions.

12
Our Approach to Ontology Building
  • Explicitly list the principles of an ontology's
    design, its constraints, along with definitions
    and axioms.
  • Provide compliance with a standard upper ontology
    (SUO) developed by IEEE P1600.1.
  • Keep separately domain-dependent and
    domain-independent knowledge, as well as
    declarative and procedural knowledge.
  • Build ontologies so that they are
    purpose-independent and therefore are
    future-proof.

13
The Position of EXPO
SUMO
Upper level
EXPO
Measurement ontology
Bibliographic Data Ontology
BiblioReference
Mes.Unit
Generic level
SubjectOfExp.
ObjectOfExp.
Domain level
Domain Model
PSI
MO
Plant ontology
MSI
FuGO
ChEBI
14
Small Section of EXPO
15
Generic ontology of experiments
e-Science
  • Controlled vocabulary of scientific experiments
  • Formalized electronic representation of
    scientific experiments
  • Unified standards for representation, annotation,
    storage, and access to experimental results
  • Reasoning over experimental data and conclusions.

Ontology of science (formalization of scientific
methods, technologies, infrastructure of science)
EXPO Ontology of scientific experiments concepts
218 language OWL
Scientific Experiment
Experimental results
Experimental goal
Experimental action
Classification of experiments
Experimental design
Admin info about experiment
Experimental object
16
EXPO description
  • EXPO v.1
  • Concepts 200
  • Language OWL
  • Tool Hozo Ontology Editor

17
Scientific Publication 1
  • The traditional way of presenting scientific
    knowledge in scientific papers has many
    limitations.
  • The most important and obvious of these is the
    use of natural language to describe knowledge -
    albeit augmented by various formalisms and
    mathematics.
  • This is problematic because natural language is
    notorious for its imprecision and ambiguity.

18
Scientific Publication 2
  • Use of Natural Language is a great hindrance when
    using computers to store and analyse data hence
    the growing importance of text-mining.
  • We argue that the content of scientific papers
    should increasingly be expressed in formal
    languages.
  • Is writing a scientific paper closer to writing
    poetry or a computer program?

19
Applications of EXPO
  • Phylogentics
  • Particle Physics
  • Structural Biology
  • Drug Screening and Design
  • Physical Chemistry
  • Robot Scientist

20
Solenodons
Solenodons are endangered insectivores from
Hispaniola and Cuba.
21
Phylogenetic Example
  • Random paper selected from Nature Roca, A.L.,
    Bar-Gal, G.K., Eizirik, E., Helgen, M.K., Maria,
    R. Mesozoic origin for West Indian insectivores.
    Nature, 429, 649-651 (2004).
  • Paper investigates the phylogenetic status of the
    mammalian species Solenodon cubanus and Solenodon
    paradoxus. i.e. the evolutionary relationship of
    these animals with all others.
  • Conclusion - Solenodons diverged in the
    Cretaceous.

22
Solonedon Annotation
Scientific Experiment Hypothesis-forming,
Hypothesis-driven Admin info about
experiment Title Mesozoic Origin of West
Indian Insectivores Author Roca, A.L., Bar-Gal,
G.K., Eizirik, E., Helgen, M.K.,
Organisation 1. National Cancer Institute,
Frederick, USA Status public
academic Reference Roca, A.L., Bar-Gal, G.K.,
Eizirik, E., Helgen, M.K., Maria, R. at all.
Mesozoic origin for West Indian insectivores.
Nature, 429, 649-651 (2004). Classification of
experiment Taxonomy DDC(Dewey) 575 Evolution
and Genetics Library of Congress QH 367.5
molecular phylogenetics Zoology DDC(Dewey)
599 mammalology Library of Congress
QL351-QL352 Zoology-Classification Experimental
goal To discover the phylogeny of the species
Solenodon paradoxus and Solenodon cubanus Null
hypothesis H01 explicit Representation
style text Linguistic expression natural
language Some have suggested a close
relationship to soricids (shrews) but not to
talpids Linguistic expression arificial
language predicate calculus
experimental action 1.1.1 extraction
and purification object sample of DNA parent
group DNA from Solenodon paradoxus
sampling random sampling instrument Qiagen
DNA cleanup kit experimental action 1.1.2 DNA
amplification
Experimental Conclusions (Formed
Hypotheses) C1) Hypothesis Representation
style text Linguistic expression natural
language There existed an mammal that is the
ancestor of Solenodons, Soricoidea, Talpoidea,
Erinaceidea, and which is not the ancestor of any
other mammal. Linguistic expression artificial
language predicate calculus
EXPO A scientific experiment is a research
method which permits the investigation of
cause-effect relations between known and unknown
(target) variables of the field of study
(domain). An experimental result cannot be known
with certainty in advance.
EXPO A classification of experiments is a
hierarchical system of categories types of
experiments according to their domains or used
models of experiments.
Prolog instantiation(solenodon, So),
instantiation(soricoidea, Sh),
instantiation(talpoidea, T), instantiation(mamma
lia, An), shared_ancestor(So, Sh, T, An).
shared_ancestror(Shared, Not_shared). shared_ances
tor(X,Y, An) - ancestor(An, X). not
ancestor(An, Y). shared_ancestor(XLx,Ly, An)
- shared_ancestor(Lx,Ly, An). ancestor(An,
X). shared_ancestor(Lx,YLy, An)
- shared_ancestor(Lx,Ly, An). not
ancestor(An, Y).
EXPO A null hypothesis is an experimental
hypothesis that states that a known controlled
variable or variables does not have a specified
effect on the unknown (target) variable or
variables of the domain.
XML lt/rdfsClassgt ltrdfsClass rdfID"classificat
ion of experiments"gt ltrdfslabelgtclassification
of experimentslt/rdfslabelgt ltrdfssubClassOf
rdfresource"classification" /gt
ltrdfscommentgt DefA classification of
experiments is a hierarchical system of
categories - types of experiments - according to
their domains or used models of
experiments. Axiom lt/rdfscommentgt
23
Problems Highlighted by Annotation 1
  • The use of EXPO makes explicit the different
    hypotheses described in the paper.
  • What we have identified in the ltresearch
    conclusiongt are not mentioned as hypotheses in
    the text.
  • This contrasts with what we identify as the seven
    null-hypotheses, which are mentioned explicitly
    in the main text. sub-optimal statistically.

24
Problems Highlighted by Annotation 2
  • Another aspect of the research which use of EXPO
    would have highlighted, was that the DNA
    sequences produced during the experiment were
    stored in the EMBL database using the taxonomic
    term Insectivora.
  • This taxon is now generally recognised to be
    polyphyletic, and its use contradicts the actual
    conclusions of the paper.

25
Problems Highlighted by Annotation 3
  • We formalised the knowledge behind the authors
    argument that Cuban Solenodons should be
    classified in a distinct genus, Atopogale.
  • Our analysis indicates that it would be more
    internally consistent for the authors to have
    classified Cuban Solenodons as a distinct family.
  • etc..

26
High-energy/particle physics
Another random paper selected from same Nature
issue D0 Collaboration. A precision measurement
of the mass of the top quark. Nature, 429,
639-642 (2004). (350 scientists)
27
Experimental Equipment!
28
EXPO D0 Example 1
  • ltscientific experimentgt ltcomputational
    experimentgt ltsimulationgt
  • ltadmin info about experimentgt
  • lttitlegt A precision measurement of the mass
    of the top quark
  • ltclassification by domaingt
  • ltdomain of experimentgt High Energy Physics /
    Particle Physics
  • ltDDC(Dewey) classificationgt 539.7 Atomic and
    nuclear physics
  • ltLibrary of Congress classificationgt QC 770-798
    Atomic, Nuclear, Particle Physics
  • ltrelated domaingt Computational Statistics
  • ltDDC(Dewey) classificationgt 519 Probabilities
    and Applied Mathematics
  • ltLibrary of Congress classificationgt QA 273-274
    Probabilities
  • ltresearch hypothesisgt ltrepresentation
    stylegt lttextgt
  • ltlinguistic expressiongt ltnatural languagegt
  • Given the same observed data use of the new
    statistical method M1 will produce a more
    accurate estimate of Mtop than the original
    method M0.
  • ltlinguistic expressiongt ltartificial languagegt
  • M0(? D0 observations ? ? relevant background
    knowledge) ? E0
  • M1(? D0 observations ? ? relevant background
    knowledge) ? E1
  • estimation_error(E0, Mtop) ? Error0
  • estimation_error(E1, Mtop) ? Error1
  • Error0 gt Error1

29
Problems Highlighted by Annotation 1
  • Poor science, even though published in Nature!
  • This annotation makes it explicit that the
    experiment was somewhat unusual in not generating
    any new observational data. Instead, it presents
    the results of applying a new statistical
    analysis method to existing data (a set of
    putative top quark pair decays events involving
    ejets and µjets)

30
Problems Highlighted by Annotation 2
  • No explicit hypothesis.
  • We argue that the papers implicit experimental
    hypothesis was given the same observed data, use
    of the new statistical method will produce a more
    accurate estimate of Mtop than the original
    method.
  • This is based on the authors statement here we
    report a technique that extracts more information
    from each top-quark event and yields a greatly
    improved precision when compared to previous
    measurements.
  • We prefer the term accuracy to precision

31
Problems Highlighted by Annotation 3
  • The Carnap principle All relevant knowledge
    should be used to decide a scientific question
  • 91 candidate events were used to calculate the
    old value, but only 22 of these were used for the
    new value!
  • The old method estimate of Mtop is 173.3
    5.6 (stat) 5.5 (sys) GeV/c2
  • The new method estimate of Mtop is 180.1 3.6
    (stat) 3.9 (sys) GeV/c2.
  • The current (June 2005) best estimate for Mtop is
    174.3 3.4 GeV/c2

32
Problems Highlighted by Annotation 4
  • The paper concluded that Mtop is higher than
    previously estimated, which deductively implies a
    higher mass for the Higgs Boson. As the Higgs
    Boson has not yet been observed, even at energies
    above its previously predicted maximum likelihood
    mass, the newly inferred higher Mtop lent support
    to the existence of the Higgs Boson.
  • However, it would have been possible to argue
    validly the other way that the Higgs Boson is
    thought highly likely to exist, therefore its non
    observation makes more probable a higher value of
    Mtop.
  • This argument was not explicit in the paper, but
    may have existed implicitly as a motivation.
  • The paper would have benefited from making this
    argument explicit, even if not used.

33
An Ontology for Drug Screening Design
  • Funded BBSRC project started in April 2007.
  • Extend Expo to formalise meta-data for drug
    screening and design.
  • We are developing our own Drug screening and Drug
    design Robot Scientist - Eve.
  • Collaborating with industry. Working with Pfizer
    to develop ontology and experiment annotation
    system. Especially important in merging data
    that results from corporate merging.

34
Structural Biology
  • Structural Biology was once a leader in the
    development of standards for the preservation and
    sharing of data.
  • This lead has been lost.
  • The main data standard, mmCIF, does not meet
    state-of-the-art standards in biology for
    ontologies.
  • The main database, PDB, is not relational
    although it is meant to be.
  • We have proposed a way forward using EXPO.
  • Nature Biotechnology (2007) 25, 437-442

35
ART
  • An Ontology Based Tool for the Translation of
    Papers into Semantic Web Format.
  • Focussed on physical chemistry very structured
    publications.
  • Funded by JISC, in collaboration with the Royal
    Society of Chemistry, and UKOLN.

36
ART 2
  • Tool to add value to papers and data stored in a
    repository.
  • The tool will lead authors through a process
    where experimental goals, hypotheses,
    methodologies, results, etc. are described and
    linked to the etx and external data.
  • The result will be an article in OWL format that
    can be archived with the original text version.
  • The OWL version will be more formalised and
    useful for computer processing, e.g. text mining.
  • SIG/ISMB07 Ontology Workshop / BMC Bioinformatics

37
Input free text article
domain independent
Convert to SciXML article
DC PRISM
Markup paper metadata (title, author,)
Named entity recognition
domain dependent
ChEBI FIX REX
Markup domain concepts (molecule, bond,)
Request/ Confirm/ Explain
EXPO OBI ECO
Recognition of generic scientific concepts (goal,
hypothesis,)
domain independent
user
Markup generic scientific concepts
Generate Summary, RSS feed
Output xml/ owl article
38
The Concept of a Robot Scientist
We have developed the first computer system that
is capable of originating its own experiments,
physically doing them, interpreting the results,
and then repeating the cycle.
Background Knowledge
Analysis
Hypothesis Formation
Consistent Hypotheses
Experiment
Experiment selection
Results Interpretation
Final Theory
Robot
King et al. (2004) Nature, 427, 247-252.
39
Motivation 1 Philosophical
  • What is Science?
  • The question whether it is possible to automate
    the scientific discovery process seems to me
    central to understanding science.
  • There is a strong philosophical position which
    holds that we do not fully understand a
    phenomenon unless we can make a machine which
    reproduces it.

40
Motivation 2 Technological
  • In many areas of science our ability to generate
    data is outstripping our ability to analyse the
    data.
  • One scientific area where this is true is
    functional genomics, where data is now being
    generated on an industrial scale.
  • The analysis of scientific data needs to become
    as industrialised as its generation.

41
The Application Domain
  • Systems Biology
  • Yeast (S. cerevisiae) best understood
    eukaryotic organism.
  • Strain libraries, e.g. EUROFAN 2 has knocked out
    each of the 6,000 genes.
  • Task to learn models of yeast metabolism using
    selected mutant strains and quantitative growth
    experiments.

42
Movie
43
Some Example Growth Curves
  • Soldatova et al., CS Dept., Aberystwyth, UK

44
The need for a Robot Scientist ontology (EXPO-RS)
  • The robot requires detailed and formalized
    description domains, background knowledge,
    experiment methods, technologies, hypotheses
    formation and experiment designing rules, etc.
  • Integrity of data and metadata.
  • Open access of the RS experimental data and
    metadata to the scientific community.
  • Soldatova, Sparkes, Clare, King (2006)
    Bioinformatics

45
EXPO-RS
  • Formalization of the entities involved in Robot
    Scientist experiments.
  • A controlled vocabulary for all the participants
    of the project.
  • Identification of metadata essential for the
    experiment's description and repeatability.
  • Coordination of the planning of experiments,
    their execution, access to the results, technical
    support of the robot, etc.
  • Modelling a database for the storage of
    experiment data and track experiment execution.

46
Conclusions
  • The unity of science implies that an accepted
    general ontology of experiments is both possible
    and desirable.
  • Such an ontology would promote the sharing of
    results within and between subjects, reducing
    both the duplication and loss of knowledge.
  • It is also an essential step in formalising
    science, and fully exploiting computer reasoning
    in science.
  • We propose EXPO as a general ontology for
    scientific experiments.
  • We have demonstrated the utility of EXPO on
    applications in phylogenetics, high-energy
    physics, chemistry, and high-throughput Systems
    Biology.

47
Acknowledgements
  • Larisa Soldatova Aberystwyth
  • Amanda Schierz Aberystwyth
  • Ken Whelan Aberystwyth
  • Amanda Clare Aberystwyth
  • Mike Young Aberystwyth
  • Jem Rowland Aberystwyth
  • Andrew Sparkes Aberystwyth
  • Wayne Aubrey Aberystwyth
  • Emma Byrne Aberystwyth
  • Larisa Soldatova Aberystwyth
  • Magda Markham Aberystwyth
  • Steve Oliver Manchester
  • Riichiro Mizoguchi Osaka
  • BBSRC, JISC
Write a Comment
User Comments (0)
About PowerShow.com