Title: Digital Curation Centre www'dcc'ac'uk
1Digital Curation Centre www.dcc.ac.uk
JISC Programme Conference, Brigton, 6/7 July 2004
- Peter Burnhill, Michael Day, David Giaretta, Liz
Lyon, Robin Rice, Bridget Robinson and Seamus Ross
Funded by
2Session Overview
- 1. Introduction Briefing
- 2. Towards a Technical Model of Digital Curation
our RD - 3. Planning Delivery of Services the Associates
Network - User Requirements Study
- Leona Carpenter Bridget Robinson
31. Introduction Briefing
- Background story on the DCC
- So whos that new kid on the block?
- What is digital curation anyway?
- adding value ensuring longevity
- Aims objectives for the DCC
- improving the quality of what is done
- Our planning our progress
- timelines deliverables
- How does this relate to the JISC Programme?
- start of a beautiful and lasting relationship ...
4Background to the DCC (1)
- Two parallel policy concerns
- 1. Neglect of digital heritage, especially given
investment in digitsation programmes - JISC Continuing Access and Digital Preservation
Strategy, 2002-2005 - eLib Programme, eLib3, Circular 5/97 Digital
Preservation - Digital Preservation Coalition formed in 2002
- 2. Differing data sharing practices in eScience,
especially given huge data volumes - Links between eScience Programme and JISC
- Report commissioned by JISC Cttee for Support of
Research (Lord Macdonald, May 2003) - twin drivers Digital Preservation Continuing
Access (e-Science) - identified need for national digital curation
centre
5Interpretation of JISC policy
- JISC plays 3 roles
- promotes, supports develop management
preservation of institutional and community
digital materials for UK benefit - partner to Research Council/AHRB other
national/international bodies - as organization, appropriate grant conditions for
JISC-funded creation of digital resources good
practice for JISC created/managed materials - escalating scale and complexity of digital
resources to be curated and the subsequent
urgency of developing a critical mass of
expertise, shared services and tools, for
long-term digital preservation require a step
change in investment and approaches. - Over the next three years a greater emphasis on
development of production services and tools
needed to build on previous research studies and
projects. - Digital preservation remains a challenging area
in which techniques, costs, and skills are still
in development advocacy, dissemination and
training, to embed preservation needs as
appropriate in JISC funding programmes.
6Interpreting the implementation plan
- Risk assessment studies, eg ePrints
- Calls to implement studies recommendations for
services and integration of preservation activity
standards into repositories funded by JISC. - Series of community calls to support records
management and digital preservation in
institutions - cf FOI compliance. - Establish Digital Curation Centre to
- Provide central focus of skilled staff research
- links to wider network of development activity,
researchers, services - Develop set of central services, standards, and
tools - for a range of distributed digital data centres
preservation services, - across the Information Environment Research
Grid. - JISC Partnership funding,
- eg Web-archiving study jointly funded by JCIE
and Wellcome Trust -
- Digital Preservation Coalition as an independent
entity with JISC membership and sector activity
supported by JISC. - National preservation of e-journals, through
RLN/RSLG
7Back to the DCC Background (2)
- JISC Circular 6/03, initially issued June 2003
- Call postponed, revised re-issued with more
significant research component - Joint funding JISC and e-Science Core Programme
- 750K pa (outreach, services development) 250K
pa (research) - Unlikely that any single organisation could do
whats expected - Expressions of Interest Full Proposals from
Consortia - Final selection made in December 2003
- Negotiations clarification in January 2004
8Designation of DCC
- Task entrusted to Consortium of four
institutional partners - Universities of Edinburgh (lead), Glasgow Bath
together with CCLRC (Rutherford Appleton and
Daresbury Laboratories) - brought together through the National eScience
Centre - jointly managed by Universities of Edinburgh
Glasgow - Two 3-year awards made
- JISC funding started on 1st March 2004
- EPSRC grant-funded starts on 1st September 2004
- Phase One set-up
- some early deliverables of website helpdesk
- preparation for full operation launch of
services in October - planning formal opening for early November 2004
9Responsibilities across the DCC
- Them with titles
- Peter Burnhill, Director (Phase One)
- with Robin Rice, Phase One Project Co-ordinator
- (from EDINA Data Library, University of
Edinburgh) - Peter Buneman Research Director ( PI on EPSRC
grant) - Professor of Informatics, University of Edinburgh
- Liz Lyon, Associate Director (Community Support
Outreach) - Director of UKOLN, University of Bath
- Seamus Ross, Associate Director (Service
Definition Delivery) - Director of HATII ERPANET, University of
Glasgow - David Giaretta, Associate Director (Development)
- Head of Astronomical Software Services, CCLRC
- Two significant well known Ex Portfolio names
- Malcolm Atkinson, Director, NeSC
- Chris Rusbridge, Director, Information Services,
UofGlasgow
10functional management collaboration
curation organisations eg DPC
communities of practice users
community support outreach
service definition delivery
Collaborative Associates Network of
Data Organisations
management admin support
research collaborators
research
development co-ordination
testbeds tools
Industry
standards bodies
11What is this digital curation anyway?
- The term Digital Curation is a new invention.
- Digital Data Curation Task Force - Report of
Strategy Discussion Day (2002) - citing Tony Hey citing use by Dr John Taylor,
Director General of the Research Councils, to
distinguish the actions involved in caring for
digital data beyond its original use, from
digital preservation. The concepts reach extends
beyond libraries. -
- The e-Science Curation Report (2003) proposed the
following distinctions - Curation managing promoting the use of data
from point of creation, to ensure fit-
for-contemporary-purpose, available for discovery
re-use. - For dynamic datasets this may mean continuous
enrichment or updating to keep it fit for
purpose. - Higher levels of curation will involve
maintaining links with annotation with other
published materials. - Archiving curation activity which ensures that
data are properly selected, stored, can be
accessed - logical and physical integrity is maintained over
time, including security and authenticity. - Preservation activity within archiving in which
specific items of data are maintained over time
so that they can still be accessed and understood
through changes in technology.
12Digital curation redefined ...
- digital curation ... digital objects and data,
over their life-cycle, for current future
generations of use ... - f(data curation digital preservation)
- data curation when high current/ongoing
interest - actions needed to maintain and utilise digital
data research results over entire life-cycle - data creation management adding value
generating new sources of information
knowledge, for use - digital preservation for longevityfall off in
interest - long-run technological/legal accessibility
usability - storage, maintenance accessibility of
information content in digital material over the
long-term, for use - OAIS concept of designated community
13Data curation in action
- Astronomy
- Integrating and analysing distributed data
(AstroGrid) - publishing multi-TB sky surveys (SuperCOSMOS
WFCAM) - interoperability standards (IVO Alliance)
- BioInformatics
- data publishing generic tools for XML export
(EBI Biomart) - annotation tools for massive data sets (Pubmed,
VOTable) - archiving tools for dynamic data sets (biological
DBs) - Environmental sciences
- spatio-temporal annotation (OS Mastermap/ Mouse
Atlas) - Document management
- Repository certification (RLG Task Force)
14Digital preservation approaches
- Migration Refreshment
- Emulation Encapsulation
- Digital Archaeology Rescue
- Document Format Specification
Robin Rice Najla Semple, http//www.lib.ed.ac.uk
/sites/digpres/
15Communities of Practice Social Sciences
(IASSIST)
- History of sharing economical in terms of both
data collector and respondent - Data about humans problems of confidentiality
confronted early on - Mixed blessing of agreed proprietary formats
(OSIRIS, SPSS, etc.) allows migration - Future-proofing - 30 years of data advocacy!
- Tradition of data archiving data citation
- Building new data standards out of common
experience - data archivists, data librarians are they now
digital curators? - www.iassistdata.org
16Unifying Themes for D C C
- data as evidence
- for one or more designated communities
- archival responsibility
- at one or more institutional levels
- with institutional policies individuals
competence - engage/discover communities of practice, to
invoke/provoke good practices - appraisal retention/disposal
- logical physical integrity authenticity/securit
y - research problems in productive research domains
- eg Informatics, Law School
17Aims Objectives for the DCC
- quality improvement in data curation digital
preservation - Initial focus data as evidence for scholarly
conclusions - Wider remit worlds of scholarly communication
eLearning - twin aimsexcellence in research excellence in
service - need to bridge across communities
- universities research institutes
- scientific data tradition document tradition
- multi-sectoral, international
18We are all curators now ...
- The term curation builds on our understanding
of the word curator - who keeps something for the public good, value
of which often needs to be brought out by the
curator. - 1. this open context implies more support for
explicit policies with regard to data sharing,
and it has major implications for structuring and
tools. - 2. the digital curator as store-keeper closely
linked to promoting new science, looking forward
to identify new ways to serve present and future
researchers. - digital curator should take an active role in
promoting and adding value to holdings - manage the value of collection
- adding links and annotation to provide context
- recording provenance of changes made
19Planning Progress
- We must plan for the Long, with our 2020 Vision -
15yrs - we have large territory, and large expectation
- multi-disciplinary, multi data type, multi
tradition/profession - national and international, but also local and
hidden from view - a lot is going on
- how to ensure that we do something sensible with
the s and the trust we have been given? - who/what should we plan to affect/effect?
- policy-makers responsible curators
(researchers?) - how do we wish to be judged, and when?
- collaboration win-win-win scenarios
20Deliverables
21focii of attention in set-up phase
- Users client, peer and policy communities
- outreach community support service
definition/delivery development co-ordination
research agenda - user requirements analysis Leona Carpenter
(Focus Groups) - Consortium organisation from partner
participation - roles commitment norming/performing
operational communication consortium agreement
(IPR) - Employers institutional settings
- re-deployment/appointments accommodation
commitment/reporting - -gt Project Plan, as living document
22Phase One Progress, March -
- weekly AccessGrid/telecon two face2face meetings
- defining programme of deliverables re-deploying
recruiting staff planning appointment of full
time director in time for Launch - early deliverables
- www.dcc.ac.uk with links, presentations
progress updates - digitalcuration_at_ed.ac.uk for contacts offers of
collaboration - project plan submitted to JISC, late May 2004
- defining R D programme services for delivery
- eg curation architecture repository of tools
technical information - engaging curators in existing community of
practice
23(No Transcript)
24Towards a Technical Model of Digital Curation
our RD
Funded by
25Outline
- Fundamental questions for the DCC
- What are we doing? - - - - - in detail
- Where do we start?
- Our answers
- Identifying something useful, do-able, extensible
- OAIS fundamentals
- Initial development and associated services
- What we believe is going to be useful
- Collaborate with other groups - worldwide
- In parallel
- User Requirements
- Research and Development to support future
services
26Overview
- Look to the Long Term needs for overall guidance
on our approach - Keep e-Science in mind but aim for general
applicability - NB - not limited to e-Science
- Map the landscape what is needed and what is
already being tackled. Be guided by - automatic processing
- interoperability
- Dont duplicate existing useful work
- Identify niche for immediate services
- Identify collaborators worldwide
27What can we rely on in the Long Term
- The bits - let us for the moment put to one side
the issue of BIT PRESERVATION (but it is an
issue) - Paper documents that people can read
- ISO standards
- The information we collect either in the far
future DCC or its successor - Some kind of remote access
- Some kind of computers
- People?
- 10 fingers?
28Preservation vs Current Use
- There are already very many architectures to
support immediate use of information - Including JISC architecture
- Aim to support these
- Therefore chose to be guided by
- long-term preservation aspects
- try to ensure that components of the preservation
architecture can supplement other current use
architectures. - to promote this we should emphasise
interoperability and automated use as far as
possible. - based initially on OAIS Reference Model but add
other ideas later - bear e-Science in mind
29OAIS Reference Model Functional Model
30OAIS Preservation Planning - key aspects
- Representation Net
- Designated Communities Knowledge Base
31Representation Net
32Preservation Issues
- Given a file or a stream of bits how does one
know what Representation Information is needed
(this question applies to Representation
Information itself as well as to the digital
objects we are primarily interested in preserving
and using) how does one know, for example, if
this thing is in FITS format? - Someone may simply know what it is and how to
deal with it i.e. the bits are within the
Knowledge Base - One may be able to recognise the format by
looking for various types of patterns. - One may feed the bits into all available
interpreters to see which accept the data as
valid - Other means.
- The only safe way have an associated label which
points to the appropriate Representation
Information - Note this does not exclude the other methods e.g.
for data rescue
33High Level View
Example of use of Representation Information
Labelling
34Implications
- A label must be attached to each piece of digital
object as a necessary (but not sufficient)
condition for long-term preservation logical
attachment or packaging TBD by the DCC. - The label should at least identify Representation
Information. For long-term preservation this
label must therefore be a DCC persistent
identifier. - allow some normalisation
- In order for the Representation Information to be
persistent then it should either be held with the
data object itself or be part of a central
repository part of the DCC. Thus the DCC needs
a DCC Representation Information Repository. This
repository would include - a Format Repository (covering structural
information) automated use would be supported by
use of formal description languages such as EAST
(ISO 15889, http//east.cnes.fr/ ) or DFDL
(http//forge.gridforum.org/projects/dfdl-wg/) - a Semantic Repository with, for example, Data
Dictionaries and Ontologies - Software Repository with appropriate emulation
capabilities - Each piece of digital RI is also a digital object
which is understood either by the users
Knowledge Base OR by further Representation
Information. Therefore each piece of RI also has
a label pointing to further RI.
35Designated Community
- Techniques must be created for
- defining a Knowledge Base
- linking a Knowledge Base to a Designated
Community - linking Representation Information to a Knowledge
Base if possible
36Representation Information (1)
- Structure including Formats
- Distinguish
- formats which are used mainly for rendering to
be followed by human inspection, and - formats used for automated processing
- Implications
- Representation Information Repository should
define selected file formats using EAST and DFDL - The EAST and DFDL tools are themselves
Representation Information which in due course
will have to be fully defined the closure of
their Representation Nets will be the EAST
standard and the DFDL documentation - Definitions should include scientific objects and
humanities objects
37Representation Information (2)
- Semantics
- Hard problem start with Data Dictionaries
- Implications
- the Representation Information Repository should
include Data Dictionaries, followed by more
general semantics
38Representation Information (3)
- Time Dependent Information
- Many, perhaps most, datasets change over time and
the state at each particular moment in time may
be important. It may be useful to break the issue
into separate parts. - at each moment in time we could, in principle,
take a snapshot and store it. That snapshot has
its associated Representation Net. - efficient storage of a series of snapshots may
lead one to store differences or include time
tags in the data (see for example P.Buneman, S.
Khanna, and Wang-Chiew Tan. On the Propagation of
Deletions and Annotations through Views.
Proc.21st ACM Sym. on Principles of Database
Systems.). - Additional Representation Information would be
needed which describes how to get to a particular
time's snapshot from the efficiently encoded
version. - Also applies to ANNOTATION who said what and
when did they say it - Implications
- These are area of active research within the
consortium and the DCC should be able to provide - advice and well tested tools for certain forms of
efficient encoding of time dependent information - advice on annotation
- identifiers and Representation, perhaps in the
form of software, for the associated encodings
39Representation Information (4)
- Actions and Processes (Behaviour?)
- Some information has, as an integral part of its
content, an implicit or explicit process
associated with it this could be argued to be a
type of semantics, however it is probably
sufficiently different to need special
classification. An examples of this is a database
or other time dependent or reactive system such
as a Neural Net. - Emulations Universal Virtual Computer (UVC)
- Implications
- Support Software emulation via a UVC (possibly
based on JVM) - Support time dependent or reactive systems
40Persistent Ids
- Implications
- Use of existing, or creation of new,
infrastructure (standards, protocols, servers
etc) for persistent IDs with adequate flexibility
and longevity - as part of the succession planning, agreement
would be needed with appropriate organisation to
act as backup and inheritor of DCC data.
41Archival Information Package
42Preservation Description Info
43AIP implications PDI
- define standard Preservation Metadata based
initially on OCLC work including Michael Days
work and also CCLRC work etc - define adequate Packaging technique almost
certainly XML based - define recommended tools and procedures for
creating Fixity Information such as checksums and
digests, together with associated Representation
Information - investigate authentication systems
44Audit and Certification
- Implications
- facilitate production of standard(s) on which a
certification program can be based - work to establish accreditation and certification
body in preparation for offering audit and
certification services - audit, certification and accreditation are
potential sources of long term funding for the
DCC - software certification will require testbeds and
testing procedures. - Hardware and software systems will need to be
purchased, hired or borrowed. The DCC associates
would be useful partners. - We might expect Commercial software to be offered
to us by the manufacturer for testing - Testing commercial software could be fee based.
45Implications for Research
- Research needed on Representation Information
(Structure and Semantics) e.g. - Investigate fundamental limitations of bit-level
descriptions and existing tools. - Contribute to DFDL definition
- Investigate capabilities needed to describe
rendered format (including Word, PDF etc) - Data Virtualisation define Science objects and
Humanities objects - Research is needed to
- Support Software emulation via a UVC (possibly
based on JVM) - Support time dependent or reactive systems
- Research is needed to provide a solid basis on
which we can develop persistent IDs with adequate
flexibility and longevity - Research is needed to allow the DCC to
- define standard Preservation Metadata based
initially on OCLC work - define adequate Packaging technique almost
certainly XML based - investigate authentication systems with a view to
preparing recommendations for users and consider
offering, for example, a (fee-based) key storage
service. - A rigorous theoretical basis must be put in place
from which we can create techniques for - defining a Knowledge Base
- linking a Knowledge Base to a Designated
Community - linking Representation Information to a Knowledge
Base if possible
46Curation Manual
- Put in place quickly using international experts
- Updates annually
- Build to curation encyclopaedia
47Document format specification
- They borrowed from records management tradition -
institutions to create documents in standard or
open formats, which are easier to preserve. - Much easier to do in a strict records management
environment with a published policy of retention
schedules and a clear knowledge of internally
produced records. - Stipulating a specific file format is harder in a
research environment where a wide range of
digital materials are produced and have to be
preserved. - The move to DDI DTD in social science data world
may be seen as an example of this preservation
technique.
48Services Development
- Turns Research into Products for Research that
our communities can use with confidence - tracking and testing tools and standards
- that are correct, usable, reliable, well
documented - e.g. for ingest, repository management, data
exchange, ontologies - working with tool developers wherever possible
- developing testbeds interworking with other
testbeds - aim to gain leverage formats
- working with other projects worldwide
- using generic tools and techniques
- to develop strategies for emerging digital
formats - Metadata standards
- long-term viability of metadata
- Registries underpin this work to provide basis of
Advisory Service
49More definitions
- There does seem to be a lack of clarity. Some
terms worth distinguishing are - data preservation a general term probably
equivalent to digital preservation in this
context - digital preservation could be, and probably is,
interpreted as simply ensuring the original bits
and bytes are accessible - digital information preservation this is what
is referred to in the OAIS standard - what is
important is not the original "bits and bytes"
but the content. An OAIS ensures that the content
is accessible, understandable and usable. - curation general term - taking care of things
- if someone currently calls themselves a curator
do we accept their definition? - data curation looking after and adding value to
data - digital curation looking after and somehow
"adding value" to digital data. This probably
implies creating some new data from the existing,
in order to make the latter more useful and "fit
for purpose". - information curation not seen in the wild
- evidence bit preservation plus authenticity and
trust?
50(No Transcript)
51(No Transcript)
52(No Transcript)
53Faith in the medium
?
54Faith in the technology