Title: An Identity Crisis in the Life Sciences
1An Identity Crisis in the Life Sciences
- Jun Zhao, Carole Goble and Robert Stevens
- The University of Manchester, UK
- Thanks to
- Tom Oinn, Matthew Pocock, Daniele Turi
- And our users
- And the EPSRC
2- UK e-Science project
- Middleware for in silico experiments by
individual life scientists, stuck in
under-resourced labs, who use other peoples
applications.
3(No Transcript)
4(No Transcript)
5Bioinformatics workflows
- Data pipelines
- Collect data
- Compute data
- Frequently updated public resources
- Open world
- Get the same data product in different experiment
context
Bioinformatician users
Taverna workflow workbench
collected metabolic pathway
computed BLAST report
computed BLAST report
6Workflow outcomes
- A record of outcome data and its provenance.
- Store data outcomes with a unique id, link
together in a typed graph. - In fact store all provenance as graph
7Concept
Data
8Fusion between different data models using shared
concepts and shared data
Add assertions, Add rulesReason over assertions
9Putting Provenance to Use
- Single workflow
- audit trail
- recipe
- Multiple workflow runs (versions)
- Aggregation - gathering
- Integration - merging
- Comparison - differencing
10Any idea?
Life Science Identifier
A ruddy great lump of RDF
11URIs for Dataurnlsidmygrid.ac.ukdata498411
- Life Science Identifier
- Protocol for allocation and resolution
- Adopted by a range of data providers
- LSIDs in the data providers databases we collect
during workflow execution - LSIDs for the data products we computed during
the workflow execution
http//www.omg.org/cgi-bin/doc?lifesci/2003-12-02
12RDF provenance graph
my http//www.mygrid.org.uk/provenance tav
taverna.sourceforge.net
urnlsidtavbrpt1
myderivedFrom
myderivedFrom
urnlsidtavseqcollection1
urnlsidtavseq1
myhasElement
A graph, with URIs for resources as the nodes,
and their provenance relationships as the edges
13Having a BLAST in every workflow!
Seq
database
score
BLAST
BlastReport
BLAST_simplifer
A list of Sequences
GenBank_retrieve
GenBank Report
14Alignment of sequence AC005089
15Data products in each run
- Computed data product
- BLAST report
- GenBank report
- Collected data product
- Sequence contained within the content of a BLAST
report - Sequence extracted by the simplifier service
- Collection and Atomic
Computed data
Collected data
1..m
contains
1
BLAST Report
Sequence
1..m
aListOf
1
SEQ
16Computed Collections and Collected data items
BLAST Report
BLAST Report
BLAST Report
Sequence1
Sequence1
Sequence1
Sequence2
Sequence2
Sequence2
Sequence3
Sequence3
Sequence4
BLAST simplifer
BLAST simplifer
SEQ
SEQ
listOf
listOf
17Data Co-references
BLAST Report
BLAST Report
Sequence1
Sequence1
Sequence2
Sequence2
Sequence3
Sequence4
BLAST simplifer
SEQ
listOf
18Aggregation of repeated run
Run2
Run1
rdftype
rdftype
urnlsidtav57b6
DNASeq
urnlsidtavic531
derivedFrom
derivedFrom
rdftype
BLASTReport
rdftype
urnlsidtav57b13
urnlsidtavic537
derivedFrom
derivedFrom
urnlsidtav57b14
urnlsidtavic538
refersTo
refersTo
refersTo
rdftype
DNASeq
AC005089
19(No Transcript)
20External Duplicates
Sequence
gi15145617
Different providers
ac073846
A replica
urnlsidmygac073846
Different tool providers
mmu11423
21LSID Assignment Process
Taverna LSID Authority
Data service
Data storage group
BAKLAVA
MySQL relational stores
Customized DB
Customized DB
Workflow enactor
Provenance service
wfEvents
Equivalent data in repeated runs Duplicate ids
for these data
KAVE
Jena/Sesame RDF store
External domain service
22Provenance from two repeated runs
No convergence
urnlsidtavbrpt1
myderivedFrom
myderivedFrom
urnlsidtavseqcollection1
urnlsidtavseq1
myhasElement
Run1
urnlsidtavbrpt2
myderivedFrom
myderivedFrom
urnlsidtavseqcollection2
urnlsidtavseq2
myhasElement
Run2
23Duplicated identities in these two runs
BLAST Report
SEQ
Sequence
24Execution duplicates
urngbseq1
Sequence1
Sequence1
urngbseq1
BLAST report
BLAST report
urnlsidtavbrpt1
urnlsidtavbrpt2
25Execution duplicates
A list of Seq
BLAST
BLAST_simplifer
BlastReport
GenBank_retrieve
SEQ1
Sequence1
listOf
urntavseqc1
urntavseq1
Sequence2
Sequence3
urngbseq1
urntavseqc2
urntavseq2
Sequence1
SEQ1
listOf
Sequence2
Sequence3
26Execution duplicates 3
- collection data computed by iterations,
- e.g. a list of GenBank reports from
GenBank_retrieve - nested collected data products,
- e.g. the species data object in the sequence data
aListOf
1
GBRPT
1..m
GBrpt
1..m
aListOf
1
SEQ
Seq
1
isOf
1
Species
27Migration duplicates
Seq1
C/MyDocuments/ WBS/Run2/
28Managing identity co-reference
- Identity co-reference
- Identifying duplicate identities that refer to
the same object but kept context - An approach
- An IDSet entity
- Identity equivalence for collected data
- Identity correspondence for computed data
- An identity service
- Identity normalisation and cleansing activity
29IDSet entity
- IDSet(BLASTrpt) urntavbrpt1,
urntavbrpt3
urngbseq1
Sequence
Query by its content
urnlsidtavbrpt1
BLASTreport
Query by its identity
IDSet created by another organization
IDSet1
IDSet3
30Extended Architecture
Data service
Data storage group
BAKLAVA
Taverna LSID Authority
MySQL relational stores
Customized DB
Customized DB
Workflow enactor
Provenance service
wfEvents
External domain service
31Identifying collected product
KAVE
urngbseq1
3
Identity service
1
3
2
Identity store
IDSet 1
urngbseq1
urngbseq1
Store the id and the IDSet
Receive an identity
Look for or create Its IDSet
32Identifying a collection product
KAVE
1
3
Identity service
3
2
unrlsidseqc2
Identity store
IDSet
unrlsidseqc2
unrlsidseqc1
urnlsidseqc1
Receive an identity
Look for or create Its IDSet
Store the id and the IDSet
Look for equivalent collection
33Putting the Identity Service to Use
Provenance Integration
Run1
Run2
b1
s1
b2
Provenance Aggregation
s2
c1
c2
Provenance Normalization
Identity Management
34Discussion
- Scalability issues
- Normalizing provenance graphs
- Building IDSet for collections with multiple
hierarchies - Open world data type-free context
- Use experimental context more effectively
workflows are not independently executed. - Granularity of identity
- Identity aware operations in workflow
- Multiple naming schemes
- Migration duplicates
- Compacting data results
35Conclusion
- Combining provenance kind of depends on finding
points of commonality. Like data identity. - Duplicate identities will occur in an open world
- Hard to achieve uniqueness without community
commitment - Different types of equivalent objects
- How much can be avoided?
- And how much has to be repaired?