Title:
1Tomorrow, and tomorrow, and tomorrow the
players on the curation stage
- Chris Rusbridge
- Presentation at OCLC
2- "To-morrow, and to-morrow, and to-morrow,
- Creeps in this petty pace from day to day,
- To the last syllable of recorded time
- And all our yesterdays have lighted fools
- The way to dusty death.
- Out, out, brief candle!
- Life's but a walking shadow a poor player,
- That struts and frets his hour upon the stage,
- And then is heard no more it is a tale
- Told by an idiot, full of sound and fury,
- Signifying nothing."
3 4(No Transcript)
5(No Transcript)
6Contents
- Curation and the Digital Curation Centre
- Science and Data Citations
- The poor players of data curation
- Sustainability of curated data
- Macbeth again
7Curation
- Data increasingly important as evidence
- Experimental verifiability (the basis of science)
- Unrepeatable observations experiments
(particularly environmental in broadest sense) - Legal, compliance transactions
- Cultural resources
- Preservation view vs Publishing view
8Lynch remarks
- Closing the Curation Conference
- 3 views of digital curation
- Finite process, handover to preservation
- Whole life process, evolving object(s)
- Collection as a living thing
9Digital curation?
For later use
Static
Digital preservation
10Digital curation?
For later use
In use now (and the future)
Static
Dynamic Long-term
Digital preservation
Digital curation
11Digital curation
For later use
In use now (and the future)
Static
Dynamic Long-term
Digital curation preservation
maintaining and adding value to a trusted body
of digital information for current and future
use
12Mission
- The over-riding purpose of the DCC is to
support and promote continuing improvement in the
quality of data curation, and of associated
digital preservation
13Organisation to Engage Collaborate
curation organisations eg DPC
communities of practice users
community support outreach
service definition delivery
management admin support
Associates Network
research collaborators
research
development co-ordination
testbeds tools
Industry
standards bodies
14Organisation to Engage Collaborate Leads
curation organisations eg DPC
communities of practice users
Bath
Associates Network
research collaborators
Glasgow
Edinburgh
Edinburgh
CCLRC
testbeds tools
Industry
standards bodies
15Associated work
- DCC LOCKSS Technical Support Service
- (Lots of Copies Keep Stuff Safe)
- DCC SCARP Project
- Disciplinary approaches to sharing, curation,
re-use and preservation - EU projects associated
- CASPAR
- Digital Preservation Europe
- PLANETS
16Phase 2
- Externally-moderated, reflective self-evaluation
completed - Phase 2 proposal (2007/10) to JISC
- Accepted focus on science data, reduced scale
- EPSRC-funded Research continues until 2007/8
172nd International Digital Curation Conference
- Research invited presentations
- Glasgow, 21/22 November, 2006
- Please register at http//www.dcc.ac.uk/events/dc
c-2006/
18(No Transcript)
19Data resource stages
- Curated data is created
- Observations? Fixed!
- Or Acquired
- Data brought/bought from outside
- Ingest
- Development
- Derived, refined, combined, processed data
- Potentially many stages
20TWOMASS (Infrared)
SDSS (Visual)
Slide from Rajendra Bose
21Slide from Rajendra Bose
22New discovery
- National Virtual Observatory
- Johns Hopkins press release Scientists working
to create the NVO, an online portal for
astronomical research unifying dozens of large
astronomical databases, confirmed discovery of
a new brown dwarf recently. The star emerged
from a computerized search of information on
millions of astronomical objects in two separate
astronomical databases. Thanks to an NVO
prototype, that search, formerly an endeavor
requiring weeks or months of human attention,
took approximately two minutes.
23Context
- Data meaningless without context
- Linkage
- Metadata of many kinds
- Workflow!
- Provenance
- Computational lineage
- Authenticity
24NASA
research group3
University research group1
local decision-making body
University research group2
Slide from Rajendra Bose
25Access and re-use
- Ethics and rights control access
- Weak in expressing this long-term
- Collaboration tools
- Annotation, discussion, review
- Re-use leading to change and development
- Publication
- Not just in print
- Underlying data should be published, too
- Citation
26CLADDIER citation investigation
- My last example was an MST data set held at the
BADC, and I was suggesting something like this
(for a citation) - ltCitationgtltAuthorgt Natural Environment Research
Council lt/Authorgt - ltTitlegt Mesosphere-Stratosphere-Troposphere Radar
at Aberystwyth lt/Titlegt - ltMediumgt Internet lt/Mediumgt
- ltPublishergt British Atmospheric Data Centre
(BADC) lt/Publishergt - ltPublicationDate status"ongoing"gt
1990lt/PublicationDategt - ltIdentifiergt badc.nerc.ac.uk/data/mst/v3/upd150320
06lt/Identifiergt - ltFeaturegtltFeatureTypegthttp//featuretype.registry/
verticalProfilelt/FeatureTypegtltLocalIDgt200409031205
lt/LocalIDgtlt/Featuregt - ltAccessDategt Sep 21 2006 lt/AccessDategt
- ltAvailableAtgtlturlgthttp//badc.nerc.ac.uk/data/mst/
v3/lt/urlgtlt/AvailableAtgt - lt/Citationgt
- (Made up tags!)
27CLADDIER 2 Version of record
- Role of Publisher add value
- provision of catalogue metadata
- some commitment to maintenance of the resource at
the AvailableAt url - some commitment to the resource being conformant
to the description of the Feature - some commitment to the maintenance of the mapping
between the identifier LocalID and the
resource.
28CLADDIER 3 persistence
- Wayback Machine
- Only snapshots (eg only 2004 version of Bryans
home page!) - WebCite
- allows the creater of content to submit URLs for
archiving, thus ensuring when one writes an
academic document, the material will be archived,
and the citation will be persistent - But no real help for data
- only allow data citation when we believe in
the persistence of the organisation making the
data available
29(No Transcript)
30Citation
- Needs a stable resource to cite
OWL Web Ontology Language Reference W3C
Proposed Recommendation 15 December 2003 This
version http//www.w3.org/TR/2003/PR-owl-ref-2003
1215/ Latest version http//www.w3.org/TR/owl-ref
/ Previous version http//www.w3.org/TR/2003/CR-o
wl-ref-2003081
- (FRBR works expressions?)
31Citation
- The date alone (as in common web citation
approaches) is not enough! - Cited object likely to have changed
- Citation should link to the cited object as it
was!
- 6 The CIA World Factbook.
- www.cia.gov/cia/publications/factbook/.
- Retrieved on 8 Jan 2006.
32Citation needs
- An efficient way to reference and access
archived past states of a changing dataset
(work in progress, Buneman et al) - Not important for original observations
- Dont mess with those data
- Less important for incremental datasets
- Later stuff should not invalidate earlier
- Very important for revisable datasets
- Eg Genomics datasets that result from the
combined work of curators, or contain opinions or
facts likely to change - Eg Mapping OS maps represent a huge database
that changes on a daily basis
33XMLArch System Architecture
Pre-processor
Version Merger
34Who are the curation players?
35Curation Individual
- Small science 2-3 times more data than Big
science, but much more at risk - PhD student? RA? PI? Administrator? IT support?
- Data potentially on local hard drives, or at best
shared network drives - May be inadequately protected
- Liable for policy-led deletion on resignation
- Individual knows too much
- Documentation/metadata unlikely to be adequate
- Tomorrow gone!
36Department eCrystals
- Specialist department archive ( national
service) - Workflow recording of lab parameters (R4L)
- Public private elements
- Trying to build eCrystals federation (eBank 3)
- But ReciprocalNet? French COD efforts?
Fragmented discipline! - Tomorrow likely to continue
37Institution Cambridge Chemistry
- 175,000 small molecule structures in CML
- Alongside Archaeology, Manuscripts, Learning
Materials, etc - No library curation skills dependent on research
group enthusiast - Collection isolated from other Chemistry
- Tomorrow assured
38Community CDL
- Shared effort from group of institutions
- Comparison OhioLink?
- Document tradition, not data
- Passive role re collections
- Rely on departmental domain expertise
- Tomorrow assured
39Community SDSC?
- Data specialists
- Multiple disciplines
- Distinct from domains curation dependent on
external expertise - Research ethos
- Tomorrow dependent on grant/contract income
research priorities
40Community LOCKSS?
- Self-selected group of collectors closest to
genuine open activity (despite Alliance)? - Traditionally libraries collecting eJournals
- Model respects IPR
- No domain expertise rely on origins
- Data limitations
- Tomorrow potentially very persistent (low cost,
high reliability, attack resistance, distributed)
41Discipline Archaeology
- Staffed by archaeologist curators
- Understand special legal issues
- Strong relationship with community peers
- Internationally still fragmented?
- Tomorrow dependent on research council grants
deposit funding
42Discipline Astronomy
- Part of major international effort
- Expensive shared facilities, global reach
- Well integrated into community
- Enable new science
- Tomorrow assured by community (another large
facility)
43Discipline Atmosphere
- Strong believer in need for domain scientists as
curators - Significant participant in community proxy
agenda-setting activities - Internationally fragmented resources
- Tomorrow mostly dependent on grant funding (but
strong commitment)
44Discipline Pharmacology
- International Scientific Union
- Attempting to build credit for data contributions
- DB ownership rotates
- Tomorrow extremely limited funding
45Discipline Social Sciences
- Mature!
- Staffed by Social Science curators
- Alert to opportunities
- Able to appraise material offered
- Strong relationship to discipline
- Tomorrow assured through broad mix of funding
streams
46Publisher Crystallography
- Publisher and Scientific Union
- Created key domain crystallographic standard
(CIF) - Strong motivator for deposit of structure data
- Consistent quality checks
- DOIs used for structure data
- Tomorrow publishing business model
47National bodies British Library
- Serious and robust approach
- Legal deposit powers responsibilities as driver
- Oriented primarily towards cultural heritage
(broadly interpreted) - Little data, no science domain experience
- Tomorrow strong future commitment
48National bodies TNA/NDAD
- Specialist archive for government datasets
- Understand government regulations, dynamics
requirements - Subject generalists disconnected from associated
science - Technology specialists (understand databases)
- Tomorrow likely to pass eventually to The
National Archives
49National bodies NOAA (etc)
- Government body making serious data available
- Domain scientists curate data
- Operates in current political context (!)
- Tomorrow reasonably assured but some un-funded
mandates?
503rd parties OCLC?
- Should this be community?
- Demand driven
- No domain science expertise rely on origins
- Tomorrow business case
513rd parties Portico
- Specific area eJournals
- Depends on publisher agreements
- No data or domain science expertise
- Tomorrow commitment from Mellon publishers
subscriptions, good funding mix
523rd Parties Iron Mountain
- Records management IS a curation problem
- Organisations like this very likely to branch out
- No domain science expertise
- Tomorrow business case, viability, stock market
53Institutions the network
- Institutions have some fundamental sustainability
- Disciplines live in the network sustainability
is an issue - Can we get the best of both?
54Intersections
Institution 1 Institution 2 Institution 3 etc
Discipline 1 X X
Discipline 2 X X
Discipline 3 X X
etc
55Who are the curation players again?
56Project StORe findings
- Discipline commonality from survey (Miller, UKDA,
2006) - 2-way links between data publication useful
- Barriers to actual deposit of data/outputs
- Sharing data important, likely between colleagues
- Perceived inconsistency across repositories
- Most common searching Google type
- Researchers favour self-reliance rather than
library support - Recognise need for common minimum metadata
- Aim for pilot linking middleware demonstrator
- Creating small scale silos of information with
institutional repositories is not a compelling
information management strategy in the Google
age (Heery Anderson for JISC, 2005)
57Sustainability tomorrow is the emerging worry
- Sustainability work package in DCC (new grant!)
- JISC/NDIIPP meeting addressed it
- AHRC report draft soon
- Research Information Network report draft
- JISC study on sustainable IT systems for HE
- Recent ARL/NSF workshop, NSF strategy
58Sustainability of what?
- Repository as an organisation
- Repository as a service
- Repository as a system
- Repositories as a network (federation?)
- Collections and objects supported by repositories
- Commit to collection contract the manager!
59Social factors
- Commitment essential much more than anything
else (cf persistent identifiers) - Funder requirements express social determination
- Policy grant application forms, selection
criteria - Monitoring essential
- Legal, ethical, IPR impacts all significant
- Public good questions
- Academic credit (citations?)
- Free-loaders (embargos?)
- Disciplines are different!
- Workforce skills researcher, data
librarian/scientist
60Sustainability a function of...
- Commitment
- Goals
- Value and cost
- Business model
- Time
- Environment
- Domain knowledge and information
- Dimensions (how much stuff)
- Technical approaches
- Usage
61So, tomorrow
- Digital data repositories already sustained gt 30
years - How?
- Vision, leadership, commitment
- Libraries, archives, museums sustained 100s of
years - How?
- Aggregate value proposition
- Perception now under threat!
- Collectively we need to identify the next steps
toward digital data sustainability, for tomorrow,
and tomorrow, and tomorrow!
62Macbeth again
- "To-morrow, and to-morrow, and to-morrow,
- Creeps in this petty pace from day to day,
- To the last syllable of recorded time
- it is a tale
- Told by an idiot, full of sound and fury,
- Signifying nothing."
63Mission (impossible?)
- To that last syllable of recorded time
- Keep our tales forever full of significance!
- Thank you