An Identity Crisis in the Life Sciences - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

An Identity Crisis in the Life Sciences

Description:

Aggregation - gathering. Integration - merging. Comparison - differencing. Any idea? 30350027 ... Aggregation of repeated run. AC005089. BLASTReport. urn:lsid: ... – PowerPoint PPT presentation

Number of Views:1055
Avg rating:3.0/5.0
Slides: 30
Provided by: nick194
Category:

less

Transcript and Presenter's Notes

Title: An Identity Crisis in the Life Sciences


1
An Identity Crisis in the Life Sciences
  • Jun Zhao, Carole Goble and Robert Stevens
  • The University of Manchester, UK
  • Thanks to
  • Tom Oinn, Matthew Pocock, Daniele Turi
  • And our users
  • And the EPSRC

2
  • UK e-Science project
  • Middleware for in silico experiments by
    individual life scientists, stuck in
    under-resourced labs, who use other peoples
    applications.

3
(No Transcript)
4
(No Transcript)
5
Bioinformatics workflows
  • Data pipelines
  • Collect data
  • Compute data
  • Frequently updated public resources
  • Open world
  • Get the same data product in different experiment
    context

Bioinformatician users
Taverna workflow workbench
collected metabolic pathway
computed BLAST report
computed BLAST report
6
Workflow outcomes
  • A record of outcome data and its provenance.
  • Store data outcomes with a unique id, link
    together in a typed graph.
  • In fact store all provenance as graph

7
Concept
Data
8
Fusion between different data models using shared
concepts and shared data
Add assertions, Add rulesReason over assertions
9
Putting Provenance to Use
  • Single workflow
  • audit trail
  • recipe
  • Multiple workflow runs (versions)
  • Aggregation - gathering
  • Integration - merging
  • Comparison - differencing

10
Any idea?
  • 30350027
  • 30350027
  • gi30350027

Life Science Identifier
A ruddy great lump of RDF
11
URIs for Dataurnlsidmygrid.ac.ukdata498411
  • Life Science Identifier
  • Protocol for allocation and resolution
  • Adopted by a range of data providers
  • LSIDs in the data providers databases we collect
    during workflow execution
  • LSIDs for the data products we computed during
    the workflow execution

http//www.omg.org/cgi-bin/doc?lifesci/2003-12-02
12
RDF provenance graph
my http//www.mygrid.org.uk/provenance tav
taverna.sourceforge.net
urnlsidtavbrpt1
myderivedFrom
myderivedFrom
urnlsidtavseqcollection1
urnlsidtavseq1
myhasElement
A graph, with URIs for resources as the nodes,
and their provenance relationships as the edges
13
Having a BLAST in every workflow!
Seq
database
score
BLAST
BlastReport
BLAST_simplifer
A list of Sequences
GenBank_retrieve
GenBank Report
14
Alignment of sequence AC005089
15
Data products in each run
  • Computed data product
  • BLAST report
  • GenBank report
  • Collected data product
  • Sequence contained within the content of a BLAST
    report
  • Sequence extracted by the simplifier service
  • Collection and Atomic

Computed data
Collected data
1..m
contains
1
BLAST Report
Sequence
1..m
aListOf
1
SEQ
16
Computed Collections and Collected data items
BLAST Report
BLAST Report
BLAST Report
Sequence1
Sequence1
Sequence1
Sequence2
Sequence2
Sequence2
Sequence3
Sequence3
Sequence4
BLAST simplifer
BLAST simplifer
SEQ
SEQ
listOf
listOf
17
Data Co-references
BLAST Report
BLAST Report
Sequence1
Sequence1
Sequence2
Sequence2
Sequence3
Sequence4
BLAST simplifer
SEQ
listOf
18
Aggregation of repeated run
Run2
Run1
rdftype
rdftype
urnlsidtav57b6
DNASeq
urnlsidtavic531
derivedFrom
derivedFrom
rdftype
BLASTReport
rdftype
urnlsidtav57b13
urnlsidtavic537
derivedFrom
derivedFrom
urnlsidtav57b14
urnlsidtavic538
refersTo
refersTo
refersTo
rdftype
DNASeq
AC005089
19
(No Transcript)
20
External Duplicates
Sequence
gi15145617
Different providers
ac073846
A replica
urnlsidmygac073846
Different tool providers
mmu11423
21
LSID Assignment Process
Taverna LSID Authority
Data service
Data storage group
BAKLAVA
MySQL relational stores
Customized DB
Customized DB
Workflow enactor
Provenance service
wfEvents
Equivalent data in repeated runs Duplicate ids
for these data
KAVE
Jena/Sesame RDF store
External domain service
22
Provenance from two repeated runs
No convergence
urnlsidtavbrpt1
myderivedFrom
myderivedFrom
urnlsidtavseqcollection1
urnlsidtavseq1
myhasElement
Run1
urnlsidtavbrpt2
myderivedFrom
myderivedFrom
urnlsidtavseqcollection2
urnlsidtavseq2
myhasElement
Run2
23
Duplicated identities in these two runs
BLAST Report
SEQ
Sequence
24
Execution duplicates
urngbseq1
Sequence1
Sequence1
urngbseq1
BLAST report
BLAST report
urnlsidtavbrpt1
urnlsidtavbrpt2
25
Execution duplicates
A list of Seq
BLAST
BLAST_simplifer
BlastReport
GenBank_retrieve
SEQ1
Sequence1
listOf
urntavseqc1
urntavseq1
Sequence2
Sequence3
urngbseq1
urntavseqc2
urntavseq2
Sequence1
SEQ1
listOf
Sequence2
Sequence3
26
Execution duplicates 3
  • collection data computed by iterations,
  • e.g. a list of GenBank reports from
    GenBank_retrieve
  • nested collected data products,
  • e.g. the species data object in the sequence data

aListOf
1
GBRPT
1..m
GBrpt
1..m
aListOf
1
SEQ
Seq
1
isOf
1
Species
27
Migration duplicates
Seq1
C/MyDocuments/ WBS/Run2/
28
Managing identity co-reference
  • Identity co-reference
  • Identifying duplicate identities that refer to
    the same object but kept context
  • An approach
  • An IDSet entity
  • Identity equivalence for collected data
  • Identity correspondence for computed data
  • An identity service
  • Identity normalisation and cleansing activity

29
IDSet entity
  • IDSet(BLASTrpt) urntavbrpt1,
    urntavbrpt3

urngbseq1
Sequence
Query by its content
urnlsidtavbrpt1
BLASTreport
Query by its identity
IDSet created by another organization
IDSet1
IDSet3
30
Extended Architecture
Data service
Data storage group
BAKLAVA
Taverna LSID Authority
MySQL relational stores
Customized DB
Customized DB
Workflow enactor
Provenance service
wfEvents
External domain service
31
Identifying collected product
KAVE
urngbseq1
3
Identity service
1
3
2
Identity store
IDSet 1
urngbseq1
urngbseq1
Store the id and the IDSet
Receive an identity
Look for or create Its IDSet
32
Identifying a collection product
KAVE
1
3
Identity service
3
2
unrlsidseqc2
Identity store
IDSet
unrlsidseqc2
unrlsidseqc1
urnlsidseqc1
Receive an identity
Look for or create Its IDSet
Store the id and the IDSet
Look for equivalent collection
33
Putting the Identity Service to Use
Provenance Integration
Run1
Run2
b1
s1
b2
Provenance Aggregation
s2
c1
c2
Provenance Normalization
Identity Management
34
Discussion
  • Scalability issues
  • Normalizing provenance graphs
  • Building IDSet for collections with multiple
    hierarchies
  • Open world data type-free context
  • Use experimental context more effectively
    workflows are not independently executed.
  • Granularity of identity
  • Identity aware operations in workflow
  • Multiple naming schemes
  • Migration duplicates
  • Compacting data results

35
Conclusion
  • Combining provenance kind of depends on finding
    points of commonality. Like data identity.
  • Duplicate identities will occur in an open world
  • Hard to achieve uniqueness without community
    commitment
  • Different types of equivalent objects
  • How much can be avoided?
  • And how much has to be repaired?
Write a Comment
User Comments (0)
About PowerShow.com