Provenance-based reasoning in e-Science - PowerPoint PPT Presentation

About This Presentation
Title:

Provenance-based reasoning in e-Science

Description:

Victor Tan, Liming Chen, Fenglian Xu (EU Provenance Southampton) Ian Wootten, Shrija Rajbhandari, Omer Rana, David Walker (PASOA Cardiff) ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 62
Provided by: lucm151
Category:

less

Transcript and Presenter's Notes

Title: Provenance-based reasoning in e-Science


1
Provenance-based reasoning in e-Science
  • Professor Luc Moreau
  • L.Moreau_at_ecs.soton.ac.uk
  • University of Southampton

2
Acknowledgements
  • Simon Miles, Paul Groth, Miguel Branco (Pasoa
    Southampton)
  • Victor Tan, Liming Chen, Fenglian Xu (EU
    Provenance Southampton)
  • Ian Wootten, Shrija Rajbhandari, Omer Rana, David
    Walker (PASOA Cardiff)
  • Steven Willmott, Javier Vazquez, Laszlo Varga,
    Arpad Andics, John Ibbotson, Neil Hardman, Alexis
    Biller (EU Provenance)

3
Overview
  • Context
  • Provenance Concept Definitions
  • Architectural Design
  • Provenance based Reasoning
  • Protocol for P-Assertions Recording
  • Provenance Queries
  • Conclusions

4
Context Importance of Past Processes
5
Context (1)
Bioinformatics verification and auditing of
experiments (e.g. for drug approval)
High Energy Physics tracking, analysing,
verifying data sets in the ATLAS Experiment of
the Large Hadron Collider (CERN)
6
Context (2)
  • Aerospace engineering maintain a historical
    record of design processes, up to 99 years.

Organ transplant management tracking of previous
decisions, crucial to maximise the efficiency in
matching and recovery rate of patients
7
Concepts Definitions
8
Provenance common sense definition
  • Oxford English Dictionary
  • the fact of coming from some particular source or
    quarter origin, derivation
  • the history or pedigree of a work of art,
    manuscript, rare book, etc. concretely, a record
    of the ultimate derivation and passage of an item
    through its various owners.
  • Merriam-Webster Online dictionary
  • the origin, source
  • the history of ownership of a valued object or
    work of art or literature
  • Concept vs representation

9
Provenance Definition
  • Our definition of provenance in the context
    e-Science, for which process matters to end
    users
  • The provenance of a piece of data is the process
    that led to that piece of data
  • Our aim is to conceive a computer-based
    representation of provenance that allows us to
    perform useful analysis and reasoning to support
    our use cases

10
Provenance Lifecycle
Core Interfaces to Provenance Store
Provenance Store
Query Provenance of Data
Administer Store and its contents
11
Nature of Documentation
  • We represent the provenance of some data by
    documenting the process that led to the data
  • documentation can be complete or partial
  • it can be accurate or inaccurate
  • it can present conflicting or consensual views of
    the actors involved
  • it can provide operational details of execution
    or it can be abstract.

12
p-assertion
  • A given element of process documentation will be
    referred to as a p-assertion
  • p-assertion is an assertion that is made by an
    actor and pertains to a process.

13
Three views of provenance
Provenance as a concept
Since our goal is to support as many use cases as
possible, we record as much as we can
Set of all p-assertions recorded during
execution
Provenance query results in set of p-assertions
or other derived representation
Hence, we need clearly specified scoping
mechanisms
Scalability concern and optimisation of recording
for specific queries
14
From Computer to Physical World
Application producing electronic data
Assuming a one to one mapping between services
and physical actuators/sensors, e.g. robot
Application is composed by a set of services
What if mapping is not one to one?
Application
Physical actuators/sensors result in physical
artefact
Provenance of data can be queried
During execution, p-assertions are recorded
Provenance Store
Can we derive the provenance of the physical
artefact from a query about the provenance of the
electronic data?
Electronic data is a proxy for physical artefact
Physical artefact
Data
15
Architectural Design
16
Service Oriented Architecture
  • Broad definition of service as component that
    takes some inputs and produces some outputs.
  • Services are brought together to solve a given
    problem typically via a workflow definition that
    specifies their composition.
  • Interactions with services take place with
    messages that are constructed according to
    services interface specification.
  • The term actor denotes either a client or a
    service in a SOA.
  • A process is defined as execution of a workflow

17
Process Documentation (1)
From these p-assertions, we can derive that M3
was sent by Actor 1 and received by Actor 2 (and
likewise for M4)
Actor 2
Actor 1
M1
M3
If actors are black boxes, these assertions are
not very useful because we do not know
dependencies between messages
M4
M2
18
Process Documentation (2)
Actor 2
Actor 1
M1
M3
These assertions help identify order of
messages, but not how data were computed
M4
M2
19
Process Documentation (3)
Actor 2
Actor 1
M1
M3
These assertions help identify how data is
computed, but provide no information about
non-functional characteristics of the
computation (time, resources used, etc)
M4
M2
20
Process Documentation (4)
Actor 2
Actor 1
M1
M3
M4
M2
21
Types of p-assertions (1)
  • Interaction p-assertion is an assertion of the
    contents of a message by an actor that has sent
    or received that message

22
Types of p-assertions (2)
  • Relationship p-assertion is an assertion, made
    by an actor, that describes how the actor
    obtained output data or messages by applying some
    function to some input data or messages.

23
Types of p-assertions (3)
  • Actor state p-assertion assertion made by an
    actor about its internal state in the context of
    a specific interaction

I used sparc processor I used algorithm
x version x.y.z
24
Data flow
  • Interaction p-assertions allow us to specify a
    flow of data between actors
  • Relationship p-assertions allow us to
    characterise the flow of data inside an actor
  • Overall data flow (internal external)
    constitutes a DAG

25
Provenance Modelling
26
Interfaces to Provenance Store
Provenance Store
Query Provenance of Data
Administer Store and its contents
27
(No Transcript)
28
Provenance-based reasoning
29
The case (1)
  • Static Validation
  • Operates on workflow source code
  • Programming language static analyses (e.g. type
    inference, escape analysis, etc.)
  • Workflow specific (concurrency analysis,
    graph-based partitioning, model checking, quality
    of service)
  • Workflow script may not be accessible or may be
    expressed in a language not supported by analysis
    tool
  • Dynamic Validation
  • Service based interface matching, runtime type
    checking

30
The case (2)
  • Provenance based reasoning
  • Allows for validation of experiments after
    execution
  • Third parties, such as reviewers and other
    scientists, may want to verify that the results
    obtained were computed correctly according to
    some criteria.
  • These criteria may not be known when the
    experiment was designed or run.
  • Important because science progresses (and models
    evolve!)

31
Bioinformatics Scenario
  • A biologist has a set of proteins, for each of
    which he/she wishes to determine a particular
    biological property

?
32
Experiment Design
  • The biologist designs a high-level plan of an
    experiment, describing each activity that must be
    performed
  • Each activity determines new information from
    analysing the information discovered in previous
    steps
  • The steps can be seen as a linked flow of data
    from the protein to the final property

HIGH LEVEL PLAN
?
33
Experiment Services
  • Service C
  • .
  • ..
  • .
  • For each activity in the plan, the biologist must
    decide on the concrete service they will perform
  • Each service may be designed by the biologist
    him/herself or adopted from the work of another
    biologist
  • For each service there is a description of that
    service stating
  • what the service does
  • what type of data it analyses (its inputs) and
  • what type of results it produces (its outputs)
  • All the descriptions are stored in a registry
  • Service B
  • .
  • ..
  • .
  • Service A
  • .
  • ..
  • .

Registry
Description of Service A Function .. Inputs
.. Outputs ..
34
Performing Experiment
  • Service A
  • .
  • ..
  • The biologist now performs the experiment as many
    times as they wish
  • The details of each experimental process is
    documented in a provenance store
  • This is achieved each service documenting its own
    execution using the recording interface
  • Service B
  • .
  • ..
  • Service C
  • .
  • ..
  • Service D
  • .
  • ..
  • Service E
  • .
  • ..

35
Provenance Questions
  • Later, the biologist examines the experiment
    results and has questions about the validity of
    process that produced them
  • Did the services I used actually fulfil my high
    level plan?
  • Two of the experiments were performed on the same
    protein but have different results did I alter
    the services between these experiments?
  • Did I perform each service on the type of data
    that the service was intended to analyse, i.e.
    were the inputs and outputs of each activity
    compatible?

36
Answering the Questions
  • Using the documentation in the Provenance Store,
    we can reconstruct the process that led to each
    result
  • Along with the high level plan and the
    descriptions in the registry we have all the
    information required to answer the questions

37
Q1 Did the experiment follow the plan?
Retrieve procedure descriptions
Compare procedure function to planned activity
Description of Service A Function .. Inputs
.. Outputs ..
Retrieve documentation of experiment that led to
a result
A
?
38
Q2 Do services differ between experiments?
Retrieve documentation of experiments
  • Service A
  • .
  • ..
  • Service A
  • .
  • ..
  • .

Highlight differences in services between
experiments
39
Q3 Were the inputs and outputs compatible?
Retrieve descriptions for each service
A
Retrieve each pair of services performed in an
experiment, where one services output is the
others input
B
Description of Service A Function .. Inputs
.. Outputs ..
Description of Service B Function .. Inputs
.. Outputs ..
Compare the output type of the first service with
the input type of the second
40
Ontological Reasoning
Ontology
  • In some cases, the high-level activity may be
    described in a more general way than the service
    which performs it
  • Also, one services input may be a generalisation
    of the preceding services output
  • Therefore, exact matching of types may produce a
    false negative the biologist will wrongly be
    told the experiment was invalid
  • By using an ontology, describing how types are
    related, we can reason about types and determine
    whether they are truly compatible

Protein
is generalisation of
Human Protein
41
P-assertions Recording
  • Recording patterns
  • Protocol PReP
  • Properties
  • Implementation PReServ
  • Performance

42
Separate Store Pattern (service)
43
Separate Store Pattern (client)
44
Context Passing Pattern
Provenance Store
Provenance Store
Record P-Assertions
Record P-Assertions
context
Client
Service
45
Shared Store Pattern
46
Repeated Application of Patterns
Provenance Store 1
Provenance Store 2
Client Initiator
Workflow Enactment Engine
Provenance Store 3
Service 1
47
PReP P-Assertion Recording Protocol
  • Formalisation as abstract machine
  • Properties
  • Termination
  • Liveness
  • Safety
  • Stateless

48
PReServ Groth et al. 04
  • Implementation of PReP protocol and Query
    Interface
  • Provenance store implemented as a Web Service
  • Client side libraries for using Provenance Store
  • Axis Handler for automatically recording
    communication between Axis-based Web Services

49
PReServ Implementation Diagram
50
Evaluation in a Bioinformatics Application
  • Bioinformatics workflow studying compressibility
    of biological sequences
  • Implemented as a VDT workflow, scheduled by
    Condor
  • Each service, script, command records
    p-assertions

HPDC05
51
Bioinformatics Application (2)
  • Recording Scalability
  • Querying Scalability

52
Query API
53
Structure of Documentation
  • The documentation of processes recorded by actors
    can be categorised into a hierarchy

All documentation
Message exchange
Message exchange
Message senders view
Message receivers view
Message content
State of actor during exchange
Relationships
54
Query Interface Miles et al. 05
  • Purpose
  • Obtain the provenance of some specific data
  • Allow for navigation of the data structure
    representing provenance
  • Abstract interface
  • Allows us to view the provenance store as if
    containing XML data structures
  • Independent of technology used for running
    application and internal store representation
  • Seamless navigation of application dependent and
    application independent provenance representation

55
XML Query Languages
  • Two existing query languages provide ways of
    navigating hierarchical data XPath and XQuery
  • For instance, we can use XPath to refer to
  • The message exchange with ID 345
  • The clients view of that exchange
  • The body of the message exchanged
  • // messageExchange id345
  • / clientView / messageContent

56
Navigating Message Content
  • If message content is in XML format, or can be
    mapped to it, then XPath and XQuery can be used
    to navigate into the message content
  • For example, we can add application-specific
    navigation to the previous XPath
  • The SOAP envelope that encloses the message
  • The body of the message within the envelope
  • The customer name within the body

// messageExchange id345 / clientView /
messageContent / soapenvelope / soapbody
// customerName
57
Other Query Requirements
  • Execution Filtering include/exclude all
    p-assertions that are marked as part of an
    execution by a single actor.
  • Functionality Filtering include/exclude
    p-assertions that have one of a given set of
    operation types.
  • Process Filtering include/exclude p-assertions
    that belong to a given (set of) process(es).

58
Conclusions
59
Conclusions
  • Mostly unexplored area that is crucial to develop
    trusted systems
  • Definition of provenance
  • Specification of provenance representation
  • Recording protocol
  • Querying interfaces
  • Reasoning based on provenance
  • Presents lots of opportunities

60
Conclusions
  • Current work
  • System and protocol designing, architecture
    specification, generic support for use cases
  • Pursue the deployment in concrete application and
    performance evaluation
  • Work towards a standardisation proposal
  • Download our software from www.pasoa.org
  • Tell us about your use cases we are keen to find
    new collaborations in this space!

61
Publications
  1. Sylvia C. Wong, Simon Miles, Weijian Fang, Paul
    Groth and Luc Moreau. Provenance-based Validation
    of E-Science Experiments. In Proceedings of the
    International Semantic Web Conference (ISWC05),
    Nov 2005.
  2. Paul Groth, Simon Miles, Weijian Fang, Sylvia C.
    Wong, Klaus-Peter Zauner, and Luc Moreau.
    Recording and Using Provenance in a Protein
    Compressibility Experiment. In Proceedings of the
    14th IEEE International Symposium on High
    Performance Distributed Computing (HPDC'05), July
    2005.
  3. Paul T. Groth. Recording Provenance in
    Service-Oriented Architectures. 9 Month Report,
    University of Southampton Faculty of
    Engineering, Science and Mathematics School of
    Electronics and Computer Science, 2004.
  4. Paul Groth, Michael Luck, and Luc Moreau. A
    protocol for recording provenance in
    service-oriented Grids. In Proceedings of the 8th
    International Conference on Principles of
    Distributed Systems (OPODIS'04), Grenoble,
    France, December 2004.
  5. Paul Groth, Michael Luck, and Luc Moreau.
    Formalising a protocol for recording provenance
    in Grids. In Proceedings of the UK OST e-Science
    second All Hands Meeting 2004 (AHM'04),
    Nottingham, UK, September 2004.
  6. Simon Miles, Paul Groth, Miguel Branco, and Luc
    Moreau. The requirements of recording and using
    provenance in e-Science experiments. Technical
    report, University of Southampton, 2005.
  7. Paul Townend, Paul Groth, and Jie Xu. A
    Provenance-Aware Weighted Fault Tolerance Scheme
    for Service-Based Applications. In Proc. of the
    8th IEEE International Symposium on
    Object-oriented Real-time distributed Computing
    (ISORC 2005), May 2005.
Write a Comment
User Comments (0)
About PowerShow.com