Title: Provenance-based reasoning in e-Science
1Provenance-based reasoning in e-Science
- Professor Luc Moreau
- L.Moreau_at_ecs.soton.ac.uk
- University of Southampton
2Acknowledgements
- Simon Miles, Paul Groth, Miguel Branco (Pasoa
Southampton) - Victor Tan, Liming Chen, Fenglian Xu (EU
Provenance Southampton) - Ian Wootten, Shrija Rajbhandari, Omer Rana, David
Walker (PASOA Cardiff) - Steven Willmott, Javier Vazquez, Laszlo Varga,
Arpad Andics, John Ibbotson, Neil Hardman, Alexis
Biller (EU Provenance)
3Overview
- Context
- Provenance Concept Definitions
- Architectural Design
- Provenance based Reasoning
- Protocol for P-Assertions Recording
- Provenance Queries
- Conclusions
4Context Importance of Past Processes
5Context (1)
Bioinformatics verification and auditing of
experiments (e.g. for drug approval)
High Energy Physics tracking, analysing,
verifying data sets in the ATLAS Experiment of
the Large Hadron Collider (CERN)
6Context (2)
- Aerospace engineering maintain a historical
record of design processes, up to 99 years.
Organ transplant management tracking of previous
decisions, crucial to maximise the efficiency in
matching and recovery rate of patients
7Concepts Definitions
8Provenance common sense definition
- Oxford English Dictionary
- the fact of coming from some particular source or
quarter origin, derivation - the history or pedigree of a work of art,
manuscript, rare book, etc. concretely, a record
of the ultimate derivation and passage of an item
through its various owners. - Merriam-Webster Online dictionary
- the origin, source
- the history of ownership of a valued object or
work of art or literature - Concept vs representation
9Provenance Definition
- Our definition of provenance in the context
e-Science, for which process matters to end
users - The provenance of a piece of data is the process
that led to that piece of data - Our aim is to conceive a computer-based
representation of provenance that allows us to
perform useful analysis and reasoning to support
our use cases
10Provenance Lifecycle
Core Interfaces to Provenance Store
Provenance Store
Query Provenance of Data
Administer Store and its contents
11Nature of Documentation
- We represent the provenance of some data by
documenting the process that led to the data - documentation can be complete or partial
- it can be accurate or inaccurate
- it can present conflicting or consensual views of
the actors involved - it can provide operational details of execution
or it can be abstract.
12p-assertion
- A given element of process documentation will be
referred to as a p-assertion - p-assertion is an assertion that is made by an
actor and pertains to a process.
13Three views of provenance
Provenance as a concept
Since our goal is to support as many use cases as
possible, we record as much as we can
Set of all p-assertions recorded during
execution
Provenance query results in set of p-assertions
or other derived representation
Hence, we need clearly specified scoping
mechanisms
Scalability concern and optimisation of recording
for specific queries
14From Computer to Physical World
Application producing electronic data
Assuming a one to one mapping between services
and physical actuators/sensors, e.g. robot
Application is composed by a set of services
What if mapping is not one to one?
Application
Physical actuators/sensors result in physical
artefact
Provenance of data can be queried
During execution, p-assertions are recorded
Provenance Store
Can we derive the provenance of the physical
artefact from a query about the provenance of the
electronic data?
Electronic data is a proxy for physical artefact
Physical artefact
Data
15Architectural Design
16Service Oriented Architecture
- Broad definition of service as component that
takes some inputs and produces some outputs. - Services are brought together to solve a given
problem typically via a workflow definition that
specifies their composition. - Interactions with services take place with
messages that are constructed according to
services interface specification. - The term actor denotes either a client or a
service in a SOA. - A process is defined as execution of a workflow
17Process Documentation (1)
From these p-assertions, we can derive that M3
was sent by Actor 1 and received by Actor 2 (and
likewise for M4)
Actor 2
Actor 1
M1
M3
If actors are black boxes, these assertions are
not very useful because we do not know
dependencies between messages
M4
M2
18Process Documentation (2)
Actor 2
Actor 1
M1
M3
These assertions help identify order of
messages, but not how data were computed
M4
M2
19Process Documentation (3)
Actor 2
Actor 1
M1
M3
These assertions help identify how data is
computed, but provide no information about
non-functional characteristics of the
computation (time, resources used, etc)
M4
M2
20Process Documentation (4)
Actor 2
Actor 1
M1
M3
M4
M2
21Types of p-assertions (1)
- Interaction p-assertion is an assertion of the
contents of a message by an actor that has sent
or received that message
22Types of p-assertions (2)
- Relationship p-assertion is an assertion, made
by an actor, that describes how the actor
obtained output data or messages by applying some
function to some input data or messages.
23Types of p-assertions (3)
- Actor state p-assertion assertion made by an
actor about its internal state in the context of
a specific interaction
I used sparc processor I used algorithm
x version x.y.z
24Data flow
- Interaction p-assertions allow us to specify a
flow of data between actors - Relationship p-assertions allow us to
characterise the flow of data inside an actor - Overall data flow (internal external)
constitutes a DAG
25Provenance Modelling
26Interfaces to Provenance Store
Provenance Store
Query Provenance of Data
Administer Store and its contents
27(No Transcript)
28Provenance-based reasoning
29The case (1)
- Static Validation
- Operates on workflow source code
- Programming language static analyses (e.g. type
inference, escape analysis, etc.) - Workflow specific (concurrency analysis,
graph-based partitioning, model checking, quality
of service) - Workflow script may not be accessible or may be
expressed in a language not supported by analysis
tool - Dynamic Validation
- Service based interface matching, runtime type
checking
30The case (2)
- Provenance based reasoning
- Allows for validation of experiments after
execution - Third parties, such as reviewers and other
scientists, may want to verify that the results
obtained were computed correctly according to
some criteria. - These criteria may not be known when the
experiment was designed or run. - Important because science progresses (and models
evolve!)
31Bioinformatics Scenario
- A biologist has a set of proteins, for each of
which he/she wishes to determine a particular
biological property
?
32Experiment Design
- The biologist designs a high-level plan of an
experiment, describing each activity that must be
performed - Each activity determines new information from
analysing the information discovered in previous
steps - The steps can be seen as a linked flow of data
from the protein to the final property
HIGH LEVEL PLAN
?
33Experiment Services
- For each activity in the plan, the biologist must
decide on the concrete service they will perform - Each service may be designed by the biologist
him/herself or adopted from the work of another
biologist - For each service there is a description of that
service stating - what the service does
- what type of data it analyses (its inputs) and
- what type of results it produces (its outputs)
- All the descriptions are stored in a registry
Registry
Description of Service A Function .. Inputs
.. Outputs ..
34Performing Experiment
- The biologist now performs the experiment as many
times as they wish - The details of each experimental process is
documented in a provenance store - This is achieved each service documenting its own
execution using the recording interface
35Provenance Questions
- Later, the biologist examines the experiment
results and has questions about the validity of
process that produced them - Did the services I used actually fulfil my high
level plan? - Two of the experiments were performed on the same
protein but have different results did I alter
the services between these experiments? - Did I perform each service on the type of data
that the service was intended to analyse, i.e.
were the inputs and outputs of each activity
compatible?
36Answering the Questions
- Using the documentation in the Provenance Store,
we can reconstruct the process that led to each
result - Along with the high level plan and the
descriptions in the registry we have all the
information required to answer the questions
37Q1 Did the experiment follow the plan?
Retrieve procedure descriptions
Compare procedure function to planned activity
Description of Service A Function .. Inputs
.. Outputs ..
Retrieve documentation of experiment that led to
a result
A
?
38Q2 Do services differ between experiments?
Retrieve documentation of experiments
Highlight differences in services between
experiments
39Q3 Were the inputs and outputs compatible?
Retrieve descriptions for each service
A
Retrieve each pair of services performed in an
experiment, where one services output is the
others input
B
Description of Service A Function .. Inputs
.. Outputs ..
Description of Service B Function .. Inputs
.. Outputs ..
Compare the output type of the first service with
the input type of the second
40Ontological Reasoning
Ontology
- In some cases, the high-level activity may be
described in a more general way than the service
which performs it - Also, one services input may be a generalisation
of the preceding services output - Therefore, exact matching of types may produce a
false negative the biologist will wrongly be
told the experiment was invalid - By using an ontology, describing how types are
related, we can reason about types and determine
whether they are truly compatible
Protein
is generalisation of
Human Protein
41P-assertions Recording
- Recording patterns
- Protocol PReP
- Properties
- Implementation PReServ
- Performance
42Separate Store Pattern (service)
43Separate Store Pattern (client)
44Context Passing Pattern
Provenance Store
Provenance Store
Record P-Assertions
Record P-Assertions
context
Client
Service
45Shared Store Pattern
46Repeated Application of Patterns
Provenance Store 1
Provenance Store 2
Client Initiator
Workflow Enactment Engine
Provenance Store 3
Service 1
47PReP P-Assertion Recording Protocol
- Formalisation as abstract machine
- Properties
- Termination
- Liveness
- Safety
- Stateless
48PReServ Groth et al. 04
- Implementation of PReP protocol and Query
Interface - Provenance store implemented as a Web Service
- Client side libraries for using Provenance Store
- Axis Handler for automatically recording
communication between Axis-based Web Services
49PReServ Implementation Diagram
50Evaluation in a Bioinformatics Application
- Bioinformatics workflow studying compressibility
of biological sequences - Implemented as a VDT workflow, scheduled by
Condor - Each service, script, command records
p-assertions
HPDC05
51Bioinformatics Application (2)
52Query API
53Structure of Documentation
- The documentation of processes recorded by actors
can be categorised into a hierarchy
All documentation
Message exchange
Message exchange
Message senders view
Message receivers view
Message content
State of actor during exchange
Relationships
54Query Interface Miles et al. 05
- Purpose
- Obtain the provenance of some specific data
- Allow for navigation of the data structure
representing provenance - Abstract interface
- Allows us to view the provenance store as if
containing XML data structures - Independent of technology used for running
application and internal store representation - Seamless navigation of application dependent and
application independent provenance representation
55XML Query Languages
- Two existing query languages provide ways of
navigating hierarchical data XPath and XQuery - For instance, we can use XPath to refer to
- The message exchange with ID 345
- The clients view of that exchange
- The body of the message exchanged
- // messageExchange id345
- / clientView / messageContent
56Navigating Message Content
- If message content is in XML format, or can be
mapped to it, then XPath and XQuery can be used
to navigate into the message content - For example, we can add application-specific
navigation to the previous XPath - The SOAP envelope that encloses the message
- The body of the message within the envelope
- The customer name within the body
// messageExchange id345 / clientView /
messageContent / soapenvelope / soapbody
// customerName
57Other Query Requirements
- Execution Filtering include/exclude all
p-assertions that are marked as part of an
execution by a single actor. - Functionality Filtering include/exclude
p-assertions that have one of a given set of
operation types. - Process Filtering include/exclude p-assertions
that belong to a given (set of) process(es).
58Conclusions
59Conclusions
- Mostly unexplored area that is crucial to develop
trusted systems - Definition of provenance
- Specification of provenance representation
- Recording protocol
- Querying interfaces
- Reasoning based on provenance
- Presents lots of opportunities
60Conclusions
- Current work
- System and protocol designing, architecture
specification, generic support for use cases - Pursue the deployment in concrete application and
performance evaluation - Work towards a standardisation proposal
- Download our software from www.pasoa.org
- Tell us about your use cases we are keen to find
new collaborations in this space!
61Publications
- Sylvia C. Wong, Simon Miles, Weijian Fang, Paul
Groth and Luc Moreau. Provenance-based Validation
of E-Science Experiments. In Proceedings of the
International Semantic Web Conference (ISWC05),
Nov 2005. - Paul Groth, Simon Miles, Weijian Fang, Sylvia C.
Wong, Klaus-Peter Zauner, and Luc Moreau.
Recording and Using Provenance in a Protein
Compressibility Experiment. In Proceedings of the
14th IEEE International Symposium on High
Performance Distributed Computing (HPDC'05), July
2005. - Paul T. Groth. Recording Provenance in
Service-Oriented Architectures. 9 Month Report,
University of Southampton Faculty of
Engineering, Science and Mathematics School of
Electronics and Computer Science, 2004. - Paul Groth, Michael Luck, and Luc Moreau. A
protocol for recording provenance in
service-oriented Grids. In Proceedings of the 8th
International Conference on Principles of
Distributed Systems (OPODIS'04), Grenoble,
France, December 2004. - Paul Groth, Michael Luck, and Luc Moreau.
Formalising a protocol for recording provenance
in Grids. In Proceedings of the UK OST e-Science
second All Hands Meeting 2004 (AHM'04),
Nottingham, UK, September 2004. - Simon Miles, Paul Groth, Miguel Branco, and Luc
Moreau. The requirements of recording and using
provenance in e-Science experiments. Technical
report, University of Southampton, 2005. - Paul Townend, Paul Groth, and Jie Xu. A
Provenance-Aware Weighted Fault Tolerance Scheme
for Service-Based Applications. In Proc. of the
8th IEEE International Symposium on
Object-oriented Real-time distributed Computing
(ISORC 2005), May 2005.