Title: Provenance: overview
1Provenance overview
- Professor Luc Moreau
- L.Moreau_at_ecs.soton.ac.uk
- University of Southampton
- www.ecs.soton.ac.uk/lavm
2Provenance PASOA Teams
- University of Southampton
- Luc Moreau, Paul Groth, Simon Miles, Victor Tan,
Miguel Branco, Sofia Tsasakou, Sheng Jiang, Steve
Munroe, Zheng Chen - IBM UK (EU Project Coordinator)
- John Ibbotson, Neil Hardman, Alexis Biller
- University of Wales, Cardiff
- Omer Rana, Arnaud Contes, Vikas Deora, Ian
Wootten, Shrija Rajbhandari - Universitad Politecnica de Catalunya (UPC)
- Steven Willmott, Javier Vazquez
- SZTAKI
- Laszlo Varga, Arpad Andics,
- Tamas Kifor
- German Aerospace
- Andreas Schreiber, Guy Kloss,
- Frank Danneman
3Contents
- Motivation
- Provenance Concepts
- Provenance Architecture
- Standardisation
- Conclusions
4Motivation
5Scientific Research
Academic Peer Review
6Business Regulations
Accounting
Banking
7Health Care Management
European Recommendation R(97)5 on the protection
of medical data
8e-Science datasets
- How to undertake peer-reviewing and validation of
e-Scientific results?
9Compliance to Regulations
- The next-compliance problem
- Can we be certain that by ensuring compliance to
a new regulation, we do not break previous
compliance?
10Current Solutions
- Proprietary, Monolithic
- Silos, Closed
- Do not inter-operate with other applications
- Not adaptable to new regulations
11Provenance
- Oxford English Dictionary
- the fact of coming from some particular source or
quarter origin, derivation - the history or pedigree of a work of art,
manuscript, rare book, etc. - concretely, a record of the passage
- of an item through its various
- owners.
- Concept vs representation
12Provenance in Computer Systems
- Our definition of provenance in the context of
applications for which process matters to end
users - The provenance of a piece of data is the
process that led to that piece of data - Our aim is to conceive a computer-based
representation of provenance that allows us to
perform useful analysis and reasoning to support
our use cases
13Our Approach
- Define core concepts pertaining to provenance
- Specify functionality required to become
provenance-aware - Define open data models and protocols that allow
systems to inter-operate - Standardise data models and protocols
- Provide a reference implementation
- Provide reasoning capability
14Context (1)
- Aerospace engineering maintain a historical
record of design processes, up to 99 years.
Organ transplant management tracking of previous
decisions, crucial to maximise the efficiency in
matching and recovery rate of patients
15Context (2)
Bioinformatics verification and auditing of
experiments (e.g. for drug approval)
High Energy Physics tracking, analysing,
verifying data sets in the ATLAS Experiment of
the Large Hadron Collider (CERN)
16Provenance Concepts
17Provenance Lifecycle
Core Interfaces to Provenance Store
Provenance Store
Query and Reason over Provenance of Data
Administer Store and its contents
18Nature of Documentation
- We represent the provenance of some data by
documenting the process that led to the data - documentation can be complete or partial
- it can be accurate or inaccurate
- it can present conflicting or consensual views of
the actors involved - it can provide operational details of execution
or it can be abstract.
19p-assertion
- A given element of process documentation will be
referred to as a p-assertion - p-assertion is an assertion that is made by an
actor and pertains to a process.
20Service Oriented Architecture
- Broad definition of service as component that
takes some inputs and produces some outputs. - Services are brought together to solve a given
problem typically via a workflow definition that
specifies their composition. - Interactions with services take place with
messages that are constructed according to
services interface specification. - The term actor denotes either a client or a
service in a SOA. - A process is defined as execution of a workflow
21Process Documentation (1)
From these p-assertions, we can derive that M3
was sent by Actor 1 and received by Actor 2 (and
likewise for M4)
Actor 2
Actor 1
M1
M3
M4
M2
If actors are black boxes, these assertions are
not very useful because we do not know
dependencies between messages
22Process Documentation (2)
These assertions help identify order of
messages, but not how data was computed
Actor 2
Actor 1
M1
M3
M4
M2
23Process Documentation (3)
These assertions help identify how data is
computed, but provide no information about
non-functional characteristics of the
computation (time, resources used, etc)
Actor 2
Actor 1
M1
M3
M4
M2
24Process Documentation (4)
Actor 2
Actor 1
M1
M3
M4
M2
25Types of p-assertions (1)
- Interaction p-assertion is an assertion of the
contents of a message by an actor that has sent
or received that message
26Types of p-assertions (2)
- Relationship p-assertion is an assertion, made
by an actor, that describes how the actor
obtained an output message sent in an
interaction by applying some function to input
messages from other interactions (likewise for
data)
27Types of p-assertions (3)
- Actor state p-assertion assertion made by an
actor about its internal state in the context of
a specific interaction
I used sparc processor I used algorithm
x version x.y.z
28Data flow
- Interaction p-assertions allow us to specify a
flow of data between actors - Relationship p-assertions allow us to
characterise the flow of data inside an actor - Overall data flow (internal external)
constitutes a DAG, which characterises the
process that led to a result
29Provenance Architecture
30Interfaces to Provenance Store
Provenance Store
Query and Reason over Provenance of Data
Administer Store and its contents
31(No Transcript)
32P-Assertion schemas
33The p-structure
- The p-structure is a common logical structure of
the provenance store shared by all asserting and
querying actors - Hierarchical
- Indexed by interactions (interaction 1 message
exchange)
34Recording Protocol (Groth04-06)
- Abstract machines
- DS Properties
- Termination
- Liveness
- Safety
- Statelessness
- Documentation Properties
- Immutability
- Attribution
- Datatype safety
- Foundation for adding necessary cryptographic
techniques
35Querying Functionality (Miles06)
- Process Documentation Query Interface allows for
navigation of the documentation of execution - Allows us to view the provenance store (i.e. the
p-structure) as if containing XML data structures - Independent of technology used for running
application and internal store representation - Seamless navigation of application dependent and
application independent process documentation
36Querying Functionality (Miles06)
- Provenance Query Interface allows us to obtain
the provenance of some specific data - A recognition that there is not one provenance
for a piece of data, but there may be different,
depending on the end-users interest - Hence, provenance is seen as the result of a
query - Identify a piece of data at a specific execution
point - Scope of the process of interest
- Filter in/out p-assertions according to actors,
process, types of relationships, etc
37Standardisation
38Standardisation Options
39Purpose of Standardisation
Application
Application
Provenance Stores
Allow for multiple applications to document their
execution. Applications may be running in
different institutions.
40Purpose of Standardisation
Application
Provenance Store
Provenance Store
Provenance Store
Allow for multiple stores from multiple IT
providers
41Purpose of Standardisation
Provenance Store
Provenance Store
Query Provenance of Data
Allow for multiple stores from multiple IT
providers
42Purpose of Standardisation
Convert in standard data format
Allow for legacy, monolithic applications to
expose their contents (according to standard
schema)
43Purpose of Standardisation
Application
Allow third parties to host provenance stores,
which are trusted by application owners but also
auditors
44Compliance Oriented Architectures
- Separate execution documentation from compliance
verification - Allows for multiple compliance verifications
- Allows for validation to take place across
multiple applications, possibly run by different
institutions (in particular, allows for
outsourcing and subcontracting). - Approach is suitable for e-scientific
peer-reviewing and business compliance
verification
45Organ Transplant Scenario
Hospital
Electronic Healthcare Management Service
Testing Lab
46Hospital Actors
User Interface
Donor Data Collector
Brain Death Manager
47Whats on the CD
- Documents relating to both PASOA and EU
Provenance projects - All the talks presented today
- Handouts
- Software
- PReServ (Paul Groth Simon Miles)
- The EU Provenance client side library
48Conclusions
49To Sum Up
Finance
Aerospace
Distribution
Healthcare
Automobile
Pharmaceutical
- Compliance check
- Rerun/Reproduce
- Analyse
Query
Slide from John Ibbotson
50Overview of Todays Talks
- Provenance Data Structures
- Recording and Querying Provenance
- Break (30 minutes)
- Distribution and Scalability
- Security
- Methodology
51Questions