Title: Provenance: an open approach to experiment validation in eScience
1Provenance an open approach to experiment
validation in e-Science
- Professor Luc Moreau
- L.Moreau_at_ecs.soton.ac.uk
- University of Southampton
- www.ecs.soton.ac.uk/lavm
2Provenance PASOA Teams
- University of Southampton
- Luc Moreau, Paul Groth, Simon Miles, Victor Tan,
Miguel Branco, Sofia Tsasakou, Sheng Jiang, Steve
Munroe, Zheng Chen - IBM UK (EU Project Coordinator)
- John Ibbotson, Neil Hardman, Alexis Biller
- University of Wales, Cardiff
- Omer Rana, Arnaud Contes, Vikas Deora, Ian
Wootten - Universitad Politecnica de Catalunya (UPC)
- Steven Willmott, Javier Vazquez
- SZTAKI
- Laszlo Varga, Arpad Andics
- German Aerospace
- Andreas Schreiber, Guy Kloss,
- Frank Danneman
3Contents
- Motivation
- Provenance Concepts
- Provenance Architecture
- Standardisation
- Provenance Queries
- Conclusions
4Motivation
5Scientific Research
Academic Peer Review
6Business Regulations
Accounting
Banking
7Accounting
Audit (Sabanes-Oxley)
8Banking
Audit (Basel II)
9Health Care Management
European Recommendation R(97)5 on the protection
of medical data
10e-Science datasets
- How to undertake peer-reviewing and validation of
e-Scientific results?
11Sarbanes-Oxley
- The American Competitiveness and Corporate
Accountability Act of 2002, commonly known as the
Sarbanes-Oxley Act, was signed into law on July
30, 2002. - The law is intended to protect investors by
improving the accuracy and reliability of
corporate disclosures. - Sarbanes-Oxley also defines a higher level of
responsibility, accountability, and financial
reporting transparency - changes that are
intended to return confidence to investors, as
well.
12Food Drug Administration
13Basel II
14Compliance to Regulations
- The next-compliance problem
- Can we be certain that by ensuring compliance to
a new regulation, we do not break previous
compliance?
15Current Solutions
- Proprietary, Monolithic
- Silos, Closed
- Do not inter-operate with other applications
- Not adaptable to new regulations
16Provenance
- Oxford English Dictionary
- the fact of coming from some particular source or
quarter origin, derivation - the history or pedigree of a work of art,
manuscript, rare book, etc. - concretely, a record of the ultimate derivation
and passage of an item - through its various owners.
- Concept vs representation
17Provenance in Computer Systems
- Our definition of provenance in the context of
applications for which process matters to end
users - The provenance of a piece of data is the
process that led to that piece of data - Our aim is to conceive a computer-based
representation of provenance that allows us to
perform useful analysis and reasoning to support
our use cases
18Our Approach
- Define core concepts pertaining to provenance
- Specify functionality required to become
provenance-aware - Define open data models and protocols that allow
systems to inter-operate - Standardise data models and protocols
- Provide a reference implementation
- Provide reasoning capability
19Context (1)
- Aerospace engineering maintain a historical
record of design processes, up to 99 years.
Organ transplant management tracking of previous
decisions, crucial to maximise the efficiency in
matching and recovery rate of patients
20Context (2)
Bioinformatics verification and auditing of
experiments (e.g. for drug approval)
High Energy Physics tracking, analysing,
verifying data sets in the ATLAS Experiment of
the Large Hadron Collider (CERN)
21Provenance Concepts
22Provenance Lifecycle
Core Interfaces to Provenance Store
Provenance Store
Query and Reason over Provenance of Data
Administer Store and its contents
23Nature of Documentation
- We represent the provenance of some data by
documenting the process that led to the data - documentation can be complete or partial
- it can be accurate or inaccurate
- it can present conflicting or consensual views of
the actors involved - it can provide operational details of execution
or it can be abstract.
24p-assertion
- A given element of process documentation will be
referred to as a p-assertion - p-assertion is an assertion that is made by an
actor and pertains to a process.
25Service Oriented Architecture
- Broad definition of service as component that
takes some inputs and produces some outputs. - Services are brought together to solve a given
problem typically via a workflow definition that
specifies their composition. - Interactions with services take place with
messages that are constructed according to
services interface specification. - The term actor denotes either a client or a
service in a SOA. - A process is defined as execution of a workflow
26Process Documentation (1)
From these p-assertions, we can derive that M3
was sent by Actor 1 and received by Actor 2 (and
likewise for M4)
Actor 2
Actor 1
M1
M3
If actors are black boxes, these assertions are
not very useful because we do not know
dependencies between messages
M4
M2
27Process Documentation (2)
Actor 2
Actor 1
M1
M3
These assertions help identify order of
messages, but not how data was computed
M4
M2
28Process Documentation (3)
Actor 2
Actor 1
M1
M3
These assertions help identify how data is
computed, but provide no information about
non-functional characteristics of the
computation (time, resources used, etc)
M4
M2
29Process Documentation (4)
Actor 2
Actor 1
M1
M3
M4
M2
30Types of p-assertions (1)
- Interaction p-assertion is an assertion of the
contents of a message by an actor that has sent
or received that message
31Types of p-assertions (2)
- Relationship p-assertion is an assertion, made
by an actor, that describes how the actor
obtained output data or the whole message sent in
an interaction by applying some function to
input data or messages from other interactions.
32Types of p-assertions (3)
- Actor state p-assertion assertion made by an
actor about its internal state in the context of
a specific interaction
I used sparc processor I used algorithm
x version x.y.z
33Data flow
- Interaction p-assertions allow us to specify a
flow of data between actors - Relationship p-assertions allow us to
characterise the flow of data inside an actor - Overall data flow (internal external)
constitutes a DAG, which characterises the
process that led to a result
34Provenance Architecture
35Interfaces to Provenance Store
Provenance Store
Query and Reason over Provenance Of Data
Administer Store and its contents
36(No Transcript)
37P-Assertion schemas
38The p-structure (1)
- The p-structure is a common logical structure of
the provenance store shared by all asserting and
querying actors - Hierarchical
- Indexed by Interactions
39The p-structure (2)
40Recording Protocol (Groth04-06)
- Abstract machines
- DS Properties
- Termination
- Liveness
- Safety
- Statelessness
- Documentation Properties
- Immutability
- Attribution
- Datatype safety
- Foundation for adding necessary cryptographic
techniques
41Querying Functionality (Miles06)
- Process Documentation Query Interface allows for
navigation of the documentation of execution - Allows us to view the provenance store (i.e. the
p-structure) as if containing XML data structures - Independent of technology used for running
application and internal store representation - Seamless navigation of application dependent and
application independent provenance representation
42Querying Functionality (Miles06)
- Provenance Query Interface allows us to obtain
the provenance of some specific data - A recognition that there is not one provenance
for a piece of data, but there may be different,
depending on the end-users interest - Hence, provenance is seen as a query
- Identify a piece of data
- Scope of the process of interest
- Filter in/out p-assertions according to actors,
process, types of relationships, etc
43Available Software
- PReServ (Paul Groth Simon Miles)
- Offer recording and querying interfaces
- Available from www.pasoa.org
- Soon ogsa-dai based version available from
www.gridprovenance.org - Is being used in a bioinformatics application
(cf. hpdc05, iswc05)
44Standardisation
45Standardisation Options
46Purpose of Standardisation
Application
Application
Provenance Stores
Allow for multiple applications to document their
execution. Applications may be running in
different institutions.
47Purpose of Standardisation
Application
Provenance Store
Provenance Store
Provenance Store
Allow for multiple stores from multiple IT
providers
48Purpose of Standardisation
Provenance Store
Provenance Store
Query Provenance Of Data
Allow for multiple stores from multiple IT
providers
49Purpose of Standardisation
Convert in standard data format
Allow for legacy, monolithic applications to
expose their contents (according to standard
schema)
50Purpose of Standardisation
Application
Allow third parties to host provenance stores,
which are trusted by application owners but also
auditors
51Compliance Oriented Architectures
Application
Provenance Store
Query Provenance Of Data
52Compliance Oriented Architectures
- Separate execution documentation from compliance
verification - Allows for multiple compliance verifications
- Allows for validation to take place across
multiple applications, possibly run by different
institutions (in particular, allows for
outsourcing and subcontracting). - Approach is suitable for e-scientific
peer-reviewing and business compliance
verification
53Provenance Queries(Miles06)
54Example Application
GUI
1. average (7, 5)
Averager
Divider
2. divide (12, 2)
3. answer (6)
4. answer (6)
Averager(in1,in2) return
(in1in2)/2 Averager delegates the division
operation to the service Divider
5. store (6, file1)
Store
55Example Application
GUI
1. average (7, 5)
Averager
Divider
2. divide (12, 2)
3. answer (6)
- Relationships
- 12 in msg 2 is sum of 7, 5 in msg 1
- 6 in msg 3 is division of 12, 2 in msg 2
- 6 in msg 4 is copy of 6 in msg 3
- 6 in msg 4 is average of 7, 5 in msg 1
- 6 in msg 6 is copy of 6 in msg 4
- Tracers
- are used to demarcate activities (aka sets of
services) - added by Averager in call to Divider
- returned by Divider in response
5. store (6, file1)
Store
56Identifying what to Find the Provenance of
- Identify the event where the entity is
documented - In this case, the event is the receipt of a
request to store the data in file named file1 - Identify the data entity within that message
- In this case, the data of interest is the 6
stored in file1
file1
Store
57Provenance Graph
5
7
Averager
GUI
Averager
GUI
Sum of
12
2
Divider
Averager
Divider
Averager
Dividend
Divisor
Division of
6
Averager
Divider
Average of
Copy of
6
GUI
Averager
Copy of
6
Store
GUI
58Scoped Provenance Graph
5
7
Averager
GUI
Averager
GUI
Allows us to take a given perspective on the
provenance of a piece of data e.g. looking at
the restorations of a painting rather than its
various owners
Sum of
12
2
Divider
Averager
Divider
Averager
Dividend
Divisor
Division of
6
Averager
Divider
Average of
Copy of
Filter to exclude Average of relationships
6
GUI
Averager
Copy of
6
Store
GUI
59Scoped Provenance Graph
5
7
Averager
GUI
Averager
GUI
Allows us to consider a given service (and all
its inferior invocations) as a black box e.g.
no detail should be provided about the internals
of Averager
Sum of
12
2
Divider
Averager
Divider
Averager
Dividend
Divisor
Division of
6
Averager
Divider
Average of
Copy of
Filter to exclude messages containing tracer
This is equivalent to hiding the
internal operation of Averager
6
GUI
Averager
Copy of
6
Store
GUI
60Scoped Provenance Graph
5
7
Averager
GUI
Averager
GUI
Allows us to scope the provenance graph according
to types of data
Sum of
12
2
Divider
Averager
Divider
Averager
Dividend
Divisor
Division of
6
Averager
Divider
Average of
Copy of
Filter to exclude Divisor parameters
6
GUI
Averager
Copy of
6
Store
GUI
61Provenance Query
62Practically
- Data Identification (here Event)
- //psinteractionRecord
- psinteractionKey/psmessageSink/
- wsaEndpointReference/wsaAddress
- "http//www.example.com/store"
- Unscoped query
- /
- Exclude averageOf relation
- /pqrelationshipTargetpsrelation!
- "http//www.example.comaverageOf"
- Exclude tracer introduced by Averager
- /pqrelationshipTarget/psinteractionPAssertion
- not(exenvelope/phpheader/
- phinteractionMetaData
- phtracer"process//sub/1")
63Conclusions
64To Sum Up
Finance
Distribution
Aerospace
Standardising the documentation of Business
Processes
Healthcare
Automobile
Pharmaceutical
- Compliance check
- Rerun/Reproduce
- Analyse
Query
Slide from John Ibbotson
65Conclusions
- Crucial topic for many applications
- Full architectural specification
- An implementation available for download
- Methodology to make application provenance-aware
- www.pasoa.org
- www.gridprovenance.org
66www.ipaw.info/ipaw06
67Publications
- Paul Groth, Simon Miles, Weijian Fang, Sylvia C.
Wong, Klaus-Peter Zauner, and Luc Moreau.
Recording and Using Provenance in a Protein
Compressibility Experiment. In Proceedings of the
14th IEEE International Symposium on High
Performance Distributed Computing (HPDC'05), July
2005. - Paul Groth, Michael Luck, and Luc Moreau. A
protocol for recording provenance in
service-oriented Grids. In Proceedings of the 8th
International Conference on Principles of
Distributed Systems (OPODIS'04), Grenoble,
France, December 2004. - Paul Groth, Michael Luck, and Luc Moreau.
Formalising a protocol for recording provenance
in Grids. In Proceedings of the UK OST e-Science
second All Hands Meeting 2004 (AHM'04),
Nottingham, UK, September 2004. - Simon Miles, Paul Groth, Miguel Branco, and Luc
Moreau. The requirements of recording and using
provenance in e-Science experiments. Technical
report, University of Southampton, 2005. - Luc Moreau, Syd Chapman, Andreas Schreiber, Rolf
Hempel, Omer Rana, Lazslo Varga, Ulises Cortes,
and Steven Willmott. Provenance-based Trust for
Grid Computing --- Position Paper. In , 2003. - Paul Townend, Paul Groth, and Jie Xu. A
Provenance-Aware Weighted Fault Tolerance Scheme
for Service-Based Applications. In Proc. of the
8th IEEE International Symposium on
Object-oriented Real-time distributed Computing
(ISORC 2005), May 2005. - Paul Groth, Simon Miles, Victor Tan, and Luc
Moreau. Architecture for Provenance Systems.
Technical report, University of Southampton,
October 2005.
68Questions
69(No Transcript)
70OTM Application