Title: Provenance in myGrid and beyond
1- Provenance in myGrid and beyond
- www.mygrid.org.uk
- Luc Moreau,
- University of Southampton, UK
2- or the Provenance of
- my interest for Provenance
- Luc Moreau,
- University of Southampton, UK
3Overview
- Bioinformatics background
- myGrid facts
- Services and Workflows
- Provenance in myGrid
- Beyond myGrid Provenance
- Architectural vision
- Conclusions
4Overview
- Bioinformatics background
- myGrid facts
- Services and Workflows
- Provenance in myGrid
- Beyond myGrid Provenance
- Architectural vision
- Conclusions
5Large amounts of data
http//www3.ebi.ac.uk/Services/DBStats/
- EMBL July 2001
- 150 Gbytes
- Microarray
- 1 Petabyte per annum
- Sanger Centre
- 20 terabytes of data
- Genome sequences increase 4x per annum
6Heterogeneity
Genomic, proteomic, transcriptomic, metabalomic,
protein-protein interactions, regulatory
bio-networks, alignments, disease, patterns
motifs, protein structure, protein
classifications, specialist proteins (enzymes,
receptors),
Proteome
7Heterogeneity
- Data types forms
- Community
- Autonomy
- Over 500 different databases
- Different formats, structure, schemas, coverage
- Web interfaces, flat file distribution,
8Heterogeneous Data
- Multimedia
- Images Video
- Text annotations literature
- Descriptive as well as numeric
- Knowledge-based
Text Extraction
9Bioinformatics Analysis
- Different algorithms
- BLAST, FASTA, pSW
- Different implementations
- WU-BLAST, NCBI-BLAST
- Different service providers
- NCBI, EBI, DDBJ
10Drug Discovery
11In silico experimentation
- Discovery of resources and tools, staging of
operations, sharing of results - Process is as important as outcome
- Science is dynamic change happens
- Scientific discovery is personal global
- Provenance and history
12Overview
- Bioinformatics background
- myGrid facts
- Services and Workflows
- Provenance in myGrid
- Beyond myGrid Provenance
- Architectural vision
- Conclusions
13myGrid
- EPSRC funded pilot project
- Generic middleware within application setting
- 36 month in 42 month performance period
- Start 1st October 2001
- 16 full-time post docs altogether
- 6 DTA studentships
- 1 technical project manager
14myGrid consortium
- Scientific Team
- Biologists and Bioinformaticians
- GSK, AZ, Merck KGaA, Manchester, EBI
- Technical Team
- Manchester, Southampton, Newcastle, Sheffield,
EBI, Nottingham - IBM, SUN
- GeneticXchange
- Network Inference, Epistemics Ltd
15myGrid outcomes
- e-Scientists
- Bioinformatics demonstrator (Graves disease and
Williams syndrome) - Developers
- myGrid-in-a-Box developers kit
- (currently myGrid 0.4)
- Integrating some existing bioinformatics tools
with myGrid (EBI services)
16Overview
- Bioinformatics background
- myGrid facts
- Services and Workflows
- Provenance in myGrid
- Beyond myGrid Provenance
- Architectural vision
- Conclusions
17Graves disease
- Autoimmune disease of the thyroid in which the
immune system of an individual attacks cells in
the thyroid gland resulting in hyperthyroidism - Weight loss, trembling, muscle weakness,
increased pulse rate, increased sweating and heat
intolerance, goitre, exophthalmos
18The Biology
- GD caused by the stimulation of the thyrotrophin
receptor by thyroid-stimulating autoantibodies
secreted by lymphocytes of the immune system. - Why is the lymphocyte causing the antibodies that
attack the thyroid cell?
19 Graves Disease Experimental Process
20Experiment life cycle
Personalised registries Personalised
workflows Info repository views Personalised
annotations Personalised metadata Security
Resource service discovery Repository
creation Workflow creation Database query
formation
Forming experiments
Personalisation
Discovering and reusing experiments and resources
Executing experiments
Workflow discovery refinement Resource
service discovery Repository creation Provenance
Workflow enactment Distributed Query
processing Job execution Provenance
generation Single sign-on authorisation Event
notification
Providing services experiments
Managing experiments
Service registration Workflow deposition Metadata
Annotation Third party registration
Information repository Metadata
management Provenance management Workflow
evolution Event notification
21A work bench for demonstrating services
myView on the mIR
Workflow
Metadata about workflow
note about workflow
22Worflows
- A workflow represents an experiment that can be
run on the Grid. - A workflow takes data as input.
- It performs activities, which are steps
involved in analysing the data, including using
tools and services, querying databases and
running other workflows. - A workflow can be run on the users local
machine, or remotely, taking advantage of
resources that are distributed. - Data intensive grid having to deal with
heterogeneity of the data and processes.
23myGrid schematic
Graves disease scenario
Exemplars
Workbench
Workflow editor
Talisman
Generic Applications
Gateway
Event Notification
Workflow Enactment
Core components
Information repository
Service Registry
Knowledge management
SoapLab
Services
Bio services
Distributed query processing
Text services
24Service Oriented Architecture
Knowledge Services
Knowledge Service
Semantic registration
Registry
Registry
Ontology Server
Reasoner
Structural registration
UDDI
Matcher
Service
Registry View
Notification Service
Notification Service
UDDI-M
Service Discovery
JMS
Provenance service
Workflow enactment engine
Build/Edit Workflow
mIR
Test Data
WSFL
Component Discovery
Information Extraction
Distributed Query Processor
Job Execution
mInfo Repository
Workflow templates
Workflow instances
PASTA
Service
Service
Service
Metadata
Concepts
Data
Provenance
SoapLab
DB2
DB2
25myGrid Deployment
26myGrid 0.4 (Nov 2003)
- Describer (MAN) A tool for attaching semantic
descriptions to WS and workflows - Find Service (MAN) A component for classifying
and discovering services and workflows via their
semantic descriptions - Ontology Server (MAN) The DAMLOIL reasoner
- Workbench (NOT) a NetBeans module for examining
and updating the MIR and submitting workflows for
enactment - e-Science Gateway (NOT) An API giving access to
myGrid core services - MIR (myGrid information repository) (MAN/NEW) A
Web Service accessing a repository that can hold
data for an individual scientist or a team of
scientists. - Notification Service (IAM) A general-purpose Web
Service that supports a publish/subscribe model
of event notification, based on JMS - Registry View service (IAM) A Web Service
supporting a registry of published Web Services
and workflows annotated with metadata, including
semantic descriptions - Freefluo (ITI) workflow enactment engine
- Taverna (EBI) workflow editing environment
27Overview
- Bioinformatics background
- myGrid facts
- Services and Workflows
- Provenance in myGrid
- Beyond myGrid Provenance
- Architectural vision
- Conclusions
28Provenance definition
- Main Entry provenance Pronunciation
'präv-nn(t)s, 'prä-v-"nän(t)sFunction
nounEtymology French, from provenir to come
forth, originate, from Latin provenire, from pro-
forth venire to come -- more at PRO-,
COMEDate 17851 ORIGIN, SOURCE2 the
history of ownership of a valued object or work
of art or literature
29(No Transcript)
30(No Transcript)
31(No Transcript)
32(No Transcript)
33Provenance
Provenance is related to
- Experiment is repeatable, if not reproducible,
and explained by provenance records - Who, what, where, why, when, (w)how?
- The traceability of knowledge as it is evolves
and as it is derived. - Immutable metadata
- Migration travels with its data but may not be
stored with it. - Private vs Shared provenance records.
- Credit.
34Early Provenance Capture
A full provenance record is linked with the
results. Its a log of execution.
35Kinds of Provenance
- Backward Derivation
- An explanation of when, by who, how something was
produced. - Linking items, usually in a directed graph.
- Execution Process-centric
- To be contrasted with forward derivation, which
is a path like a workflow, script or query.
36Kinds of Provenance
- Annotations
- Attached to items or collections of items, in a
structured, semi-structured or free text form. - Annotations on one item or linking items.
- An explanation of why, when, where, who, what,
how. - Data-centric
37Kinds of Provenance in myGrid
- Derivations
- Workflow Enactment Engine provides a detailed
provenance record stored in the myGrid
Information Repository (mIR) describing what was
done, with what services and when - XML document, soon to be an RDF model
- Annotations
- Every mIR object has Dublin Core provenance
properties described in an attribute value model
38Provenance of data
- Operational execution trail
GeneAC005412.6
SNP000010197
input
output
processstart timeend time
run_for
by_service
urn Claire Jennings
lsidHGVBase_retrieve
39From Provenance to Knowledge
- Declarative semantic execution trail
contains_single_nucleotide_polymorphism
GeneAC005412.6
SNP000010197
input
output
as stated by
processstart timeend time
run_for
by_service
urn Claire Jennings
lsidHGVBase_retrieve
40From Provenance to Knowledge
urn Carole Goble
disputed by
contains_single_nucleotide_polymorphism
GeneAC005412.6
SNP000010197
input
output
as stated by
processstart timeend time
run_for
by_service
urn Claire Jennings
lsidHGVBase_retrieve
41Provenance vs
- Provenance vs Annotation
- Provenance of an annotation
- Annotation of Provenance
- Provenance vs Workflow
- Provenance describes past execution
- A workflow is a script for future execution
42What is Provenance?
- Annotations may be subject of interpretation
(e.g. Alice believes annotation X, whereas Bob
does not). - Provenance should aim at recording an undisputed
view of an execution.
43What is Provenance?
- Provenance traces execution
- Provenance must be generated automatically
- Annotations can be either generated automatically
or created by the user - Annotations can contain semantic augmentation,
which can be derived automatically or supplied
manually.
44Generating provenance
45Overview
- Bioinformatics background
- myGrid facts
- Services and Workflows
- Provenance in myGrid
- Beyond myGrid Provenance
- Architectural vision
- Conclusions
46Provenance in a Bioinformatics Grid
- myGrid builds a personalised problem-solving
environment that helps bioinformaticians find,
adapt, construct and execute in silico
experiments - Provenance in Drugs Discovery process
- FDA requirement on drug companies to keep a
- record of provenance of drug discovery as
long - as the drug is in use (up to 50 years
- sometimes).
47Provenance in Aerospace Engineering
- Provenance requirement to maintain a historical
record of outputs from each sub-system involved
in simulations. - Aircrafts provenance data need to be kept for up
to 99 years when sold to some countries. - Currently, little direct support is available for
this.
48Provenance in Organ Transplant Management
- Decision support systems for organ and tissue
transplant, rely on a wide range of data sources,
patient data, and doctors and surgeons
knowledge - Heavily regulated domain European, national,
regional and site specific rules govern how
decisions are made. - Application of these rules must be ensured, be
auditable and may change over time - Provenance allows tracking previous decisions
crucial to maximise the efficiency in matching
and recovery rate of patients
49The Grid and Virtual Organisations
- The Grid problem is defined as coordinated
resource sharing and problem solving in dynamic,
multi-institutional virtual organisations
FKT01. - Effort is required to allow users to place their
trust in the data produced by such virtual
organisations - Understanding how a given service is likely to
modify data flowing into it, and how this data
has been generated is crucial.
50Provenance and Virtual Organisations
- Given a set of services in an open grid
environment that decide to form a virtual
organisation with the aim to produce a given
result - How can we determine the process that
generated the result, especially after the
virtual organisation has been disbanded? - The lack of information about the origin of
results does not help users to trust such open
environments.
51Provenance and Workflows
- Workflow enactment has become popular in the Grid
and Web Services communities - Workflow enactment can be seen as a scripted form
of virtual organisation. - The problem is similar how can we determine the
origin of enactment results.
52Provenance Definition
- Provenance is some data able to explain how a
particular result has been derived. - In a service-oriented architecture, provenance
identifies what data is passed between services,
what services are available, and what results are
generated for particular sets of input values,
etc. - Using provenance, a user can trace the process
that led to the aggregation of services producing
a particular output.
53Overview
- Bioinformatics background
- myGrid facts
- Services and Workflows
- Provenance in myGrid
- Beyond myGrid Provenance
- Architectural vision
- Conclusions
54What is the problem?
- Provenance recording should be part of the
infrastructure, so that users can elect to enable
it when they execute their complex tasks over the
Grid or in Web Services environments. - Currently, the Web Services protocol stack and
the Open Grid Services Architecture do not
provide any support for recording provenance.
55Architectural Vision
56Architectural Vision
- Provenance gathering is a collaborative process
that involves multiple entities, including the
workflow enactment engine, the enactment engine's
client, the service directory, and the invoked
services. - Provenance data will be submitted to one or more
provenance repositories acting as storage for
provenance data. - Upon user's requests, some analysis, navigation
and reasoning over provenance data can be
undertaken.
57Architectural Vision
- Storage could be achieved by a provenance
service. - Provenance service would provide support for
analysis, navigation or reasoning over provenance - Client side support for submitting provenance
data to the provenance service.
58A First Prototype (Szomszor,Moreau 03)
- A service-oriented architecture for provenance
support in Grid and Web Services environments,
based on the idea of a provenance service - A client-side API for recording provenance data
for Web Service invocation - A data model for storing provenance data
- A server-side interface for querying provenance
data - Two components making use of provenance
provenance browsing and provenance validation.
59Prototype Overview
60Prototype Sequence Diagram
61Prototype Sequence Diagram
- To identify the interactions between provenance
service, client side library and enactment engine - Creation of a session
- Need to be able to support the most complex
workflows including conditional branching,
iteration, recursion and parallel execution. - Support asynchronous submission of provenance
data so that provenance submission does not delay
workflow execution.
62Prototype Provenance Data Model
63Prototype Provenance Data Model
- Must support recording of all information
necessary to replay execution - Must support all complex forms of workflows
(recursion, iterations, parallel execution).
64Prototype Provenance Browser
65Discussion
- In order for provenance data to be useful, we
expect such a protocol to support some
classical properties of distributed algorithms.
- Using mutual authentication, an invoked service
can ensure that it submits data to a specific
provenance server, and vice-versa, a provenance
server can ensure that it receives data from a
given service. - With non-repudiation, we can retain evidence of
the fact that a service has committed to
executing a particular invocation and has
produced a given result. - We anticipate that cryptographic techniques will
be useful to ensure such properties
66Towards Trust
67Towards Trust
- Using the provenance of data, trust metrics of
the data can be derived from - Trust the user places in invoked services
- Trust the user places in the input data
- Trust the user places in the enacted workflow
- Trust the user places in the enactor
- Trust the user places in the provenance service.
68- The purpose of project PASOA to investigate
provenance in Grid architectures - Funded by EPSRC under the fundamental computer
science for e-Science call - In collaboration with Cardiff
- www.pasoa.org
69Conclusion
- Provenance is a rather unexplored domain
- Strategic to bring trust in open environment
- Necessity to design a configurable architecture
capable of support multiple requirements from
very different application domains. - Need to further investigate the algorithmic
foundations of provenance, which will lead to
scalable and secure industrial solutions.
70Publications
- SM03 Martin Szomszor and Luc Moreau. Recording
and reasoning over data provenance in web and
grid services. In International Conference on
Ontologies, Databases and Applications of
SEmantics (ODBASE'03), volume 2888 of Lecture
Notes in Computer Science, pages 603-620,
Catania, Sicily, Italy, November 2003. - MCS03 Luc Moreau, Syd Chapman, Andreas
Schreiber, Rolf Hempel, Omer Rana, Lazslo Varga,
Ulises Cortes, and Steven Willmott.
Provenance-based trust for grid computing -
position paper. 2003. - GGS03 Mark Greenwood, Carole Goble, Robert
Stevens, Jun Zhao, Matthew Addis, Darren Marvin,
Luc Moreau, and Tom Oinn. Provenance of e-science
experiments - experience from bioinformatics. In
Proceedings of the UK OST e-Science second All
Hands Meeting 2003 (AHM'03), pages 223-226,
Nottingham, UK, September 2003.
71Acknowledgements
- The myGrid Southampton Team Simon Miles, Juri
Papay, Ananth Krishna, Michael Luck, David De
Roure, Terry Payne - Mark Greenwood, Carole Goble, Manchester
- Martin Szomszor, Southampton
- Syd Chapman, IBM
- Omer Rana, Cardiff
- Andreas Schreiber and Rolf Hempel, DLR
- Lazslo Varga, SZTAKI
- Ulises Cortes and Steven Willmott, UPC
72www.mygrid.org.uk
m