Title: Knowledge and Provenance: A knowledge model perspective
1Knowledge and ProvenanceA knowledge model
perspective
- Carole Goble,
- University of Manchester, UK
2Talk roadmap
What is this provenance about and for?
Knowledge for Provenance
The Provenance of Knowledge
Knowledge technologies
How do we represent knowledge for and about
provenance?
Where do knowledge assertions come from?
3my Context
- Knowledge-driven Middleware for data
intensive in silico experiments in biology - http//www.mygrid.org.uk
4A real bio provenance log
5Any and every experimental item attracts
provenance (so long as you can ID it).
- Experimental design components
- workflow specifications query specifications
notes describing objectives applications
databases relevant papers the web pages of
important workers, services - Experimental instances that are records of
enacted experiments - data results a history of services invoked by a
workflow engine instances of services invoked
parameters set for an application notes
commenting on the results - Experimental glue that groups and links design
and instance components - a query and its results a workflow linked with
its outcome links between a workflow and its
previous and subsequent versions a group of all
these things linked to a document discussing the
conclusions of the biologist
6Provenance is metadata
- intended for sharing, retrieving, integrating,
aggregating and processing. - generated with the hope that it is comprehensive
enough to be future-proofed. - recorded for those who we do not yet know will
use the object and who will likely use it in a
different way. - machine computational free text of limited help.
- Provenance is the knowledge that makes
- An item interpretable and reusable within a
context - An item reproducible or at least repeatable.
- Its part of the information model of any system
7Question What ATPase superfamily proteins are
found in mouse?
1. Q9CQV8 O70468 143B_MOUSE from Swiss-Prot
version 30, 05/11/02, 1645 GMT, EBI server. 2.
O70455, P54775 143B_MOUSE from Swiss-Prot version
29, 05/11/02 1645 GMT, local copy. 3. P43686 and
P54775 derived by a distributed query over DB1
and DB2. 4. InterPro (no particular version) is
a pattern database for protein superfamilies and
domains for GPCRs but you need an account. 5.
The publicly available workflow mouse ATPase
(http//www.somelab.edu/bio/carole/wf/3345.wsfl)
will generate the result from data in your
personal repository and you have permission to
run the services it needs. Click to run it. 6.
The Attwood lab expertise is in nucleotide
binding proteins (ATPase superfamily proteins are
nucleotide binding proteins). 7. Jones published
a new paper on this in Nature Genetics two weeks
ago, and you have an account to access it
on-line. 8. Smith in your lab asked this question
yesterday and the answer he got is annotated by a
commentary in his e-Log Book. 9. P43686 (human)
calculated by applying the algorithm ABC located
at NCBI using data in database AAA
Provenance (know-wherefrom)
Database query (know-what)
Replicas (know-which)
Virtual data products (know-how)
Ontology and Inference (know-whether)
Workflow (know-how)
Authorisation, Authentication and
Accounting (know-who)
Personalised profile (know-whom-to)
Collaboration community (know-where, know-when)
Explanation (know-why)
Digital archive (know-which)
Annotation notes (know-that)
8Provenance is contextual metadata
- We look at the same things in different ways and
different things in the same way - Our data alone does not describe our work
- We have to capture this context.
Hero http//hero.geog.psu.edu/ Hero_knowledge_mana
gement.pdf Downloaded 301103
9Provenance forms
- Derivations
- A path like a workflow, script or query.
- Linking items, usually in a directed graph.
- An explanation of when, who, how something
produced. - Execution Process-centric
- Annotations
- Attached to items or collections of items, in a
structured, semi-structured or free text form. - Annotations on one item or linking items.
- An explanation of why, when, where, who, what,
how. - Data-centric
10Workflows as in silico experiments
- Freefluo workflow enactment engine
- WSFL
- Scufl
- Semantic Workflow discovery
- Finding workflows that others have done, and that
I have done myself - Semantic service discovery
- Finding classes of services
- Guiding service composition
- (We dont do automated composition)
- Dynamic workflow enactment service discovery and
invocation - Choose services instances when running workflow
- User involvement
11Semantic discovery services workflows
- Services and workflows in registry have RDF and
OWL descriptions - Selection by the types of inputs they use,
outputs they produce, the bioinformatics tasks
they perform - Querying using RDQL over RDF UDDI registry for
operational metadata - Matching using FaCT OWL classification for
concept-based metadata
A registry browser
A workflow wizard
12Provenance forms in myGrid
- Derivations
- FreeFluo Workflow Enactment Engine provides a
detailed provenance record stored in the myGrid
Information Repository (mIR) describing what was
done, with what services and when - XML document, soon to be an RDF model
- Annotations
- Every mIR object has Dublin Core provenance
properties described in an attribute value model
13Provenance of data
- Operational execution trail
GeneAC005412.6
SNP000010197
input
output
processstart timeend time
run_for
by_service
urn Clare Jennings
lsidHGVBase_retrieve
14Provenance of knowledge
- Declarative semantic execution trail
contains_single_nucleotide_polymorphism
GeneAC005412.6
SNP000010197
input
output
as stated by
processstart timeend time
run_for
by_service
urn Claire Jennings
lsidHGVBase_retrieve
15Provenance of knowledge
urn Carole Goble
disputed by
contains_single_nucleotide_polymorphism
GeneAC005412.6
SNP000010197
input
output
as stated by
processstart timeend time
run_for
by_service
urn Claire Jennings
lsidHGVBase_retrieve
16Provenance of knowledge
- Aggregation and integration
processstart timeend time
run_for
by_service
urn Bill Jones
lsidBIGDbretrieve
as stated by
contains_single_nucleotide_polymorphism
GeneAC005412.6
SNP000010197
1720,000 feet and ground level
- Top Down provenance
- What is going on?
- Unification and summaries of collective
provenance knowledge. - Collaborative, Awareness, Experience base,
Scientific Corporate memory. - What projects have something to do with human
SNPs? - What experiments use the PSI-BLAST service
regardless of version?
- Bottom Up provenance
- Where did this data object http//doh.dah.ac.uk/
come from? - Which version of Swiss-Prot was run in workflow
http/blah.ac.uk/?
Build up layers of provenance knowledge
18Provenance for People and Machines
Subjective
People
Experiment
User
Manual/ semi-automated
Trust
Services
Domain
Objective
Data
Contextual
Execution
Workflow
Machines
Context-free
Automated
191. Explicitly capture Context
- Reuse methods and strategies (e.g., protocols)
- Make explicit the situational bias that is
normally implicit - Enable future generations of scientists to
follow our work - To capture meaning, we must devise a way of
representing concepts and their relationships
Hero http//hero.geog.psu.edu/ Hero_knowledge_mana
gement.pdf Downloaded 301103
201. Explicitly capture Context
- Using models and terms
- that can be shared and interpreted
- that are extensible and preclude premature
restrictions - that are navigable and computationally processable
Hero http//hero.geog.psu.edu/ Hero_knowledge_mana
gement.pdf Downloaded 301103
212. Bridge islands of exported provenance
Service 1
Service 2
Workflow 1
Experimental Investigation 1
Data 1
22Not all exports are the same
Service 1
Service 2
Workflow 1
Experimental Investigation 1
Data 1
23So we need to
- Uniquely identify items through URIs and Life
Science Identifiers (GSH/GSR/Handle.net) - Explicitly expose provenance by assertions in a
common data model - Publish and share consensually agreed ontologies
so we can share the provenance metadata and add
in background knowledge - Then we can query, filter, integrate and
aggregate the provenance metadata - and reason over it to infer more provenance
metadata using rules - and attribute trust to the provenance
- Flexibly so that do not cast in stone models and
terms, and so can cope with different degrees of
description.
Whats an Ontology? A common vocabulary of
terms Some specification of the meaning of the
terms Concepts, relationships, axioms A shared
consensual understanding for people and machines
24W3C Metadata language/model Resource Description
Framework
- Common model for metadata
- Assertions as triples (subject, predicate,
object) forming graphs. - Associate URIs (LSIDs) with other URIs (LSIDs).
- Associate URIs with OWL concepts (which are
URIs). - RDQL, repositories, integration tools,
presentation tools - Query over, Link together, Aggregate, Integrate
assertions. - Avoids pre-commitment
- Self-describing
- Incremental
- Extensible
- Advantage and drawback.
Graphic based on Tim Berners-Lee
http//www.w3.org/2003/Talks/0521-www-keynote-tbl/
slide22-0.html
25Bridging islands
Service 1
Service 2
Workflow 1
Experimental Investigation 1
Data 1
26Bridging islands Concepts and LSID
Service 1
Service 2
Workflow 1
RDF
RDF
RDF
RDF
RDF
RDF
Experimental Investigation 1
Data 1
27W3C Ontology language/model OWL
- Continuum of expressivity
- Concepts, roles, individuals, axioms
- From simple frames to description logics
- Sound and complete formal semantics
- Compositional and property based
- Reasoning to infer classification
- Eas(ier) to extend and evolve and merge
ontologies - A web language
- Tools, tools, tools!
28Bridging islands Concepts and LSIDs
Service 1
Service 2
Workflow 1
RDF
RDF
RDF
RDF
RDF
RDF
Experimental Investigation 1
Data 1
29Bridging islands Concepts and LSIDs
Service 1
LSID
LSID
Service 2
LSID
Workflow 1
RDF
LSID
LSID
RDF
RDF
RDF
LSID
LSID
RDF
RDF
LSID
LSID
Experimental Investigation 1
Data 1
LSID
LSID
30Layers of Knowledge Languages
Attribution
Explanation
Rules Inference
Ontologies
Metadata
Standard Syntax
Identity
Wedding cake courtesy of Tim Berners-Lee
31myGrid everything has a concept LSID
Workflows
Provenance record of workflow runs
Notes
People
Data holdings
Services
32Linking objects to objects via URIs and LSIDs
People to notify of the workflow status
Provenance of the workflow template. Related
workflows.
Ontologies describing workflows
33Lymphocyte and neutrophil are subsumed by the
concept white blood cell
Generated link anchors
34Annotating a workflow log with concepts
5. Create the annotation
4. Provide a description
3. Select the concept
1. Choose the ontology
2. Select an area to annotate with
35Generating provenance
Data and metadata from the run
startTime, endTime, service instances invoked
RDFOWL
Workflow execution Template
Scufl
RDFOWL
mIR
Identify workflow
Execution Provenance log
FreeFluo WFEE
Bind services
Input data parameters
Knowledge Provenance log
Workflow knowledge template
RDF registry
OWL descriptions
RDFOWL
Knowledge arising from workflow
36P Afflard et al The Grid(s)? _at_ Novartis presented
at PRISM PharmaGrid retreat, July 2003
37William Pike, Ola Ahlqvist, Mark Gahegan, Sachin
Oswal Supporting Collaborative Science through a
Knowledge and Data Management Portal in 1st
Semantic Web Conference (ISWC2003) Workshop on
Retrieval of Scientific Data, Florida, USA,
October 2003
38Two views of a gravity model conceptfrom
the Hero CODEX web tool
William Pike, Ola Ahlqvist, Mark Gahegan, Sachin
Oswal Supporting Collaborative Science through a
Knowledge and Data Management Portal in 1st
Semantic Web Conference (ISWC2003) Workshop on
Retrieval of Scientific Data, Florida, USA,
October 2003
- An ontological description shows how one
geoscientist constructs a model
- a social network reveals which users favour
different instances of the model, with edge
length suggesting the degree of support.
39Collaboratory for Multi-Scale ChemicalScience
CMCS Pedigree Graph portlet showing provenance
relationships between resources (colour coded by
original relationship type).
CMCS Pedigree Browser showing the metadata and
relationships of the selected data set.
40Provenance dimensions connected by concepts and
identifiers
project
Services
Workflow instances
Author
project
workflow template
Based on http//www.w3.org/2003/Talks/0521-www-key
note-tbl/slide22-0.html
41Reflections annotations
- Annotation metadata model for myGrid holdings are
a Graph - If it waddles like RDF and quacks like RDF, its
RDF - Experiments in RDF scalability
- Co-existence of RDF and other data models
(relational) - Acquisition of annotations and adverts
- Automated by mining WSDL docs, mining ws-info
docs - Deep annotation works ok for bioinformatic
service concepts (its an EMBL record) but - Annotating with biologically meaningful concepts
is harder - Data in the mIR (its a lymphocyte)
- Manual annotation cost is high!
- Service/workflow publication tools
- Dealing with change
- Ontology changes service changes annotations
change.
42Random Thoughts
- Where does the knowledge come from (see Luc)?
- How do we model trust (see Luc)?
- Scalability of Semantic Web technologies?
- Visualisation of knowledge (see monica)?
- Whats the lifecycle of provenance?
- Different knowledge models for different
disciplines? - Layers of provenance
- Provenance that is domain knowledge
- Provenance for context vs execution
- People vs machine
- Different models for different items but still
needs to be integrated - Technologies for sharing and integrating that are
flexible.
knowledge
workflow
provenance
43Talk provenance
- myGrid http//www.mygrid.org.uk
- Jun Zhao, Mark Greenwood, Chris Wroe, Phil Lord,
Chris Greenhalgh, Luc Moreau, Robert Stevens - Hero http//hero.geog.psu.edu/
- William Pike, Ola Ahlqvist, Mark Gahegan, Sachin
Oswal - Collaboratory for Multi-Scale ChemicalScience
CMSC - James D. Myers, Carmen Pancerella, Carina
Lansing, Karen L. Schuchardt, Brett Didier - Chimera
- Michael Wilde, Ian Foster
- Knowledge Space
- Novartis
- And special thanks to Ian Cottam for heroic
support when my laptop died yesterday. Afternoon.