Provenance: an open approach to experiment validation in eScience - PowerPoint PPT Presentation

About This Presentation
Title:

Provenance: an open approach to experiment validation in eScience

Description:

Provenance: an open approach to experiment validation in eScience – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 65
Provided by: lucmo4
Category:

less

Transcript and Presenter's Notes

Title: Provenance: an open approach to experiment validation in eScience


1
Provenance an open approach to experiment
validation in e-Science
  • Professor Luc Moreau
  • L.Moreau_at_ecs.soton.ac.uk
  • University of Southampton
  • www.ecs.soton.ac.uk/lavm

2
Provenance PASOA Teams
  • University of Southampton
  • Luc Moreau, Paul Groth, Simon Miles, Victor Tan,
    Miguel Branco, Sofia Tsasakou, Sheng Jiang, Steve
    Munroe, Zheng Chen
  • IBM UK (EU Project Coordinator)
  • John Ibbotson, Neil Hardman, Alexis Biller
  • University of Wales, Cardiff
  • Omer Rana, Arnaud Contes, Vikas Deora, Ian
    Wootten
  • Universitad Politecnica de Catalunya (UPC)
  • Steven Willmott, Javier Vazquez
  • SZTAKI
  • Laszlo Varga, Arpad Andics
  • German Aerospace
  • Andreas Schreiber, Guy Kloss,
  • Frank Danneman

3
Contents
  • Motivation
  • Provenance Concepts
  • Provenance Architecture
  • Standardisation
  • Provenance Queries
  • Conclusions

4
Motivation
5
Scientific Research
Academic Peer Review
6
Business Regulations
Accounting
Banking
7
Accounting
Audit (Sabanes-Oxley)
8
Banking
Audit (Basel II)
9
Health Care Management
European Recommendation R(97)5 on the protection
of medical data
10
e-Science datasets
  • How to undertake peer-reviewing and validation of
    e-Scientific results?

11
Sarbanes-Oxley
  • The American Competitiveness and Corporate
    Accountability Act of 2002, commonly known as the
    Sarbanes-Oxley Act, was signed into law on July
    30, 2002.
  • The law is intended to protect investors by
    improving the accuracy and reliability of
    corporate disclosures.
  • Sarbanes-Oxley also defines a higher level of
    responsibility, accountability, and financial
    reporting transparency - changes that are
    intended to return confidence to investors, as
    well.

12
Food Drug Administration
13
Basel II
14
Compliance to Regulations
  • The next-compliance problem
  • Can we be certain that by ensuring compliance to
    a new regulation, we do not break previous
    compliance?

15
Current Solutions
  • Proprietary, Monolithic
  • Silos, Closed
  • Do not inter-operate with other applications
  • Not adaptable to new regulations

16
Provenance
  • Oxford English Dictionary
  • the fact of coming from some particular source or
    quarter origin, derivation
  • the history or pedigree of a work of art,
    manuscript, rare book, etc.
  • concretely, a record of the ultimate derivation
    and passage of an item
  • through its various owners.
  • Concept vs representation

17
Provenance in Computer Systems
  • Our definition of provenance in the context of
    applications for which process matters to end
    users
  • The provenance of a piece of data is the
    process that led to that piece of data
  • Our aim is to conceive a computer-based
    representation of provenance that allows us to
    perform useful analysis and reasoning to support
    our use cases

18
Our Approach
  • Define core concepts pertaining to provenance
  • Specify functionality required to become
    provenance-aware
  • Define open data models and protocols that allow
    systems to inter-operate
  • Standardise data models and protocols
  • Provide a reference implementation
  • Provide reasoning capability

19
Context (1)
  • Aerospace engineering maintain a historical
    record of design processes, up to 99 years.

Organ transplant management tracking of previous
decisions, crucial to maximise the efficiency in
matching and recovery rate of patients
20
Context (2)
Bioinformatics verification and auditing of
experiments (e.g. for drug approval)
High Energy Physics tracking, analysing,
verifying data sets in the ATLAS Experiment of
the Large Hadron Collider (CERN)
21
Provenance Concepts
22
Provenance Lifecycle
Core Interfaces to Provenance Store
Provenance Store
Query and Reason over Provenance of Data
Administer Store and its contents
23
Nature of Documentation
  • We represent the provenance of some data by
    documenting the process that led to the data
  • documentation can be complete or partial
  • it can be accurate or inaccurate
  • it can present conflicting or consensual views of
    the actors involved
  • it can provide operational details of execution
    or it can be abstract.

24
p-assertion
  • A given element of process documentation will be
    referred to as a p-assertion
  • p-assertion is an assertion that is made by an
    actor and pertains to a process.

25
Service Oriented Architecture
  • Broad definition of service as component that
    takes some inputs and produces some outputs.
  • Services are brought together to solve a given
    problem typically via a workflow definition that
    specifies their composition.
  • Interactions with services take place with
    messages that are constructed according to
    services interface specification.
  • The term actor denotes either a client or a
    service in a SOA.
  • A process is defined as execution of a workflow

26
Process Documentation (1)
From these p-assertions, we can derive that M3
was sent by Actor 1 and received by Actor 2 (and
likewise for M4)
Actor 2
Actor 1
M1
M3
If actors are black boxes, these assertions are
not very useful because we do not know
dependencies between messages
M4
M2
27
Process Documentation (2)
Actor 2
Actor 1
M1
M3
These assertions help identify order of
messages, but not how data was computed
M4
M2
28
Process Documentation (3)
Actor 2
Actor 1
M1
M3
These assertions help identify how data is
computed, but provide no information about
non-functional characteristics of the
computation (time, resources used, etc)
M4
M2
29
Process Documentation (4)
Actor 2
Actor 1
M1
M3
M4
M2
30
Types of p-assertions (1)
  • Interaction p-assertion is an assertion of the
    contents of a message by an actor that has sent
    or received that message

31
Types of p-assertions (2)
  • Relationship p-assertion is an assertion, made
    by an actor, that describes how the actor
    obtained output data or the whole message sent in
    an interaction by applying some function to
    input data or messages from other interactions.

32
Types of p-assertions (3)
  • Actor state p-assertion assertion made by an
    actor about its internal state in the context of
    a specific interaction

I used sparc processor I used algorithm
x version x.y.z
33
Data flow
  • Interaction p-assertions allow us to specify a
    flow of data between actors
  • Relationship p-assertions allow us to
    characterise the flow of data inside an actor
  • Overall data flow (internal external)
    constitutes a DAG, which characterises the
    process that led to a result

34
Provenance Architecture
35
Interfaces to Provenance Store
Provenance Store
Query and Reason over Provenance Of Data
Administer Store and its contents
36
(No Transcript)
37
P-Assertion schemas
38
The p-structure (1)
  • The p-structure is a common logical structure of
    the provenance store shared by all asserting and
    querying actors
  • Hierarchical
  • Indexed by Interactions

39
The p-structure (2)
40
Recording Protocol (Groth04-06)
  • Abstract machines
  • DS Properties
  • Termination
  • Liveness
  • Safety
  • Statelessness
  • Documentation Properties
  • Immutability
  • Attribution
  • Datatype safety
  • Foundation for adding necessary cryptographic
    techniques

41
Querying Functionality (Miles06)
  • Process Documentation Query Interface allows for
    navigation of the documentation of execution
  • Allows us to view the provenance store (i.e. the
    p-structure) as if containing XML data structures
  • Independent of technology used for running
    application and internal store representation
  • Seamless navigation of application dependent and
    application independent provenance representation

42
Querying Functionality (Miles06)
  • Provenance Query Interface allows us to obtain
    the provenance of some specific data
  • A recognition that there is not one provenance
    for a piece of data, but there may be different,
    depending on the end-users interest
  • Hence, provenance is seen as a query
  • Identify a piece of data
  • Scope of the process of interest
  • Filter in/out p-assertions according to actors,
    process, types of relationships, etc

43
Available Software
  • PReServ (Paul Groth Simon Miles)
  • Offer recording and querying interfaces
  • Available from www.pasoa.org
  • Soon ogsa-dai based version available from
    www.gridprovenance.org
  • Is being used in a bioinformatics application
    (cf. hpdc05, iswc05)

44
Standardisation
45
Standardisation Options
46
Purpose of Standardisation
Application
Application
Provenance Stores
Allow for multiple applications to document their
execution. Applications may be running in
different institutions.
47
Purpose of Standardisation
Application
Provenance Store
Provenance Store
Provenance Store
Allow for multiple stores from multiple IT
providers
48
Purpose of Standardisation
Provenance Store
Provenance Store
Query Provenance Of Data
Allow for multiple stores from multiple IT
providers
49
Purpose of Standardisation
Convert in standard data format
Allow for legacy, monolithic applications to
expose their contents (according to standard
schema)
50
Purpose of Standardisation
Application
Allow third parties to host provenance stores,
which are trusted by application owners but also
auditors
51
Compliance Oriented Architectures
Application
Provenance Store
Query Provenance Of Data
52
Compliance Oriented Architectures
  • Separate execution documentation from compliance
    verification
  • Allows for multiple compliance verifications
  • Allows for validation to take place across
    multiple applications, possibly run by different
    institutions (in particular, allows for
    outsourcing and subcontracting).
  • Approach is suitable for e-scientific
    peer-reviewing and business compliance
    verification

53
Provenance Queries(Miles06)
54
Example Application
GUI
1. average (7, 5)
Averager
Divider
2. divide (12, 2)
3. answer (6)
4. answer (6)
Averager(in1,in2) return
(in1in2)/2 Averager delegates the division
operation to the service Divider
5. store (6, file1)
Store
55
Example Application
GUI
1. average (7, 5)
Averager
Divider
2. divide (12, 2)
3. answer (6)
  • Relationships
  • 12 in msg 2 is sum of 7, 5 in msg 1
  • 6 in msg 3 is division of 12, 2 in msg 2
  • 6 in msg 4 is copy of 6 in msg 3
  • 6 in msg 4 is average of 7, 5 in msg 1
  • 6 in msg 6 is copy of 6 in msg 4
  • Tracers
  • are used to demarcate activities (aka sets of
    services)
  • added by Averager in call to Divider
  • returned by Divider in response

5. store (6, file1)
Store
56
Identifying what to Find the Provenance of
  • Identify the event where the entity is
    documented
  • In this case, the event is the receipt of a
    request to store the data in file named file1
  • Identify the data entity within that message
  • In this case, the data of interest is the 6
    stored in file1

file1
Store
57
Provenance Graph
5
7
Averager
GUI
Averager
GUI
Sum of
12
2
Divider
Averager
Divider
Averager
Dividend
Divisor
Division of
6
Averager
Divider
Average of
Copy of
6
GUI
Averager
Copy of
6
Store
GUI
58
Scoped Provenance Graph
5
7
Averager
GUI
Averager
GUI
Allows us to take a given perspective on the
provenance of a piece of data e.g. looking at
the restorations of a painting rather than its
various owners
Sum of
12
2
Divider
Averager
Divider
Averager
Dividend
Divisor
Division of
6
Averager
Divider
Average of
Copy of
Filter to exclude Average of relationships
6
GUI
Averager
Copy of
6
Store
GUI
59
Scoped Provenance Graph
5
7
Averager
GUI
Averager
GUI
Allows us to consider a given service (and all
its inferior invocations) as a black box e.g.
no detail should be provided about the internals
of Averager
Sum of
12
2
Divider
Averager
Divider
Averager
Dividend
Divisor
Division of
6
Averager
Divider
Average of
Copy of
Filter to exclude messages containing tracer
This is equivalent to hiding the
internal operation of Averager
6
GUI
Averager
Copy of
6
Store
GUI
60
Scoped Provenance Graph
5
7
Averager
GUI
Averager
GUI
Allows us to scope the provenance graph according
to types of data
Sum of
12
2
Divider
Averager
Divider
Averager
Dividend
Divisor
Division of
6
Averager
Divider
Average of
Copy of
Filter to exclude Divisor parameters
6
GUI
Averager
Copy of
6
Store
GUI
61
Provenance Query
62
Practically
  • Data Identification (here Event)
  • //psinteractionRecord
  • psinteractionKey/psmessageSink/
  • wsaEndpointReference/wsaAddress
  • "http//www.example.com/store"
  • Unscoped query
  • /
  • Exclude averageOf relation
  • /pqrelationshipTargetpsrelation!
  • "http//www.example.comaverageOf"
  • Exclude tracer introduced by Averager
  • /pqrelationshipTarget/psinteractionPAssertion
  • not(exenvelope/phpheader/
  • phinteractionMetaData
  • phtracer"process//sub/1")

63
Conclusions
64
To Sum Up
Finance
Distribution
Aerospace
Standardising the documentation of Business
Processes
Healthcare
Automobile
Pharmaceutical
  • Compliance check
  • Rerun/Reproduce
  • Analyse

Query
Slide from John Ibbotson
65
Conclusions
  • Crucial topic for many applications
  • Full architectural specification
  • An implementation available for download
  • Methodology to make application provenance-aware
  • www.pasoa.org
  • www.gridprovenance.org

66
www.ipaw.info/ipaw06
67
Publications
  • Paul Groth, Simon Miles, Weijian Fang, Sylvia C.
    Wong, Klaus-Peter Zauner, and Luc Moreau.
    Recording and Using Provenance in a Protein
    Compressibility Experiment. In Proceedings of the
    14th IEEE International Symposium on High
    Performance Distributed Computing (HPDC'05), July
    2005.
  • Paul Groth, Michael Luck, and Luc Moreau. A
    protocol for recording provenance in
    service-oriented Grids. In Proceedings of the 8th
    International Conference on Principles of
    Distributed Systems (OPODIS'04), Grenoble,
    France, December 2004.
  • Paul Groth, Michael Luck, and Luc Moreau.
    Formalising a protocol for recording provenance
    in Grids. In Proceedings of the UK OST e-Science
    second All Hands Meeting 2004 (AHM'04),
    Nottingham, UK, September 2004.
  • Simon Miles, Paul Groth, Miguel Branco, and Luc
    Moreau. The requirements of recording and using
    provenance in e-Science experiments. Technical
    report, University of Southampton, 2005.
  • Luc Moreau, Syd Chapman, Andreas Schreiber, Rolf
    Hempel, Omer Rana, Lazslo Varga, Ulises Cortes,
    and Steven Willmott. Provenance-based Trust for
    Grid Computing --- Position Paper. In , 2003.
  • Paul Townend, Paul Groth, and Jie Xu. A
    Provenance-Aware Weighted Fault Tolerance Scheme
    for Service-Based Applications. In Proc. of the
    8th IEEE International Symposium on
    Object-oriented Real-time distributed Computing
    (ISORC 2005), May 2005.
  • Paul Groth, Simon Miles, Victor Tan, and Luc
    Moreau. Architecture for Provenance Systems.
    Technical report, University of Southampton,
    October 2005.

68
Questions
69
(No Transcript)
70
OTM Application
Write a Comment
User Comments (0)
About PowerShow.com