Digital Curation Centre www.dcc.ac.uk - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Digital Curation Centre www.dcc.ac.uk

Description:

... the actions involved in caring for digital data beyond its original use, from ... need to bridge across communities: universities & research institutes ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 50
Provided by: Nes66
Category:

less

Transcript and Presenter's Notes

Title: Digital Curation Centre www.dcc.ac.uk


1
Digital Curation Centre www.dcc.ac.uk
  • Peter Burnhill, Michael Day, David Giaretta, Liz
    Lyon, Robin Rice, Bridget Robinson and Seamus Ross

Funded by
2
Session Overview
  • 1. Introduction Briefing
  • 2. Towards a Technical Model of Digital Curation
    our RD
  • 3. Planning Delivery of Services the Associates
    Network

3
1. Introduction Briefing
  • Background story on the DCC
  • So whos that new kid on the block?
  • What is digital curation anyway?
  • adding value ensuring longevity
  • Aims objectives for the DCC
  • improving the quality of what is done
  • Our planning our progress
  • timelines deliverables
  • How does this relate to the JISC Programme?

4
Background to the DCC (1)
  • Two parallel policy concerns
  • 1. Neglect of digital heritage, especially given
    investment in digitsation programmes
  • JISC Continuing Access and Digital Preservation
    Strategy, 2002-2005
  • eLib Programme, eLib3, Circular 5/97 Digital
    Preservation
  • Digital Preservation Coalition formed in 2002
  • 2. Differing data sharing practices in eScience,
    especially given huge data volumes
  • Links between eScience Programme and JISC
  • Report commissioned by JISC Cttee for Support of
    Research (Lord Macdonald, May 2003)
  • twin drivers Digital Preservation Continuing
    Access (e-Science)
  • identified need for national digital curation
    centre

5
Interpretation of JISC policy
  • JISC plays 3 roles
  • promotes, supports develop management
    preservation of institutional and community
    digital materials for UK benefit
  • partner to Research Council/AHRB other
    national/international bodies
  • as organization, appropriate grant conditions for
    JISC-funded creation of digital resources good
    practice for JISC created/managed materials
  • escalating scale and complexity of digital
    resources to be curated and the subsequent
    urgency of developing a critical mass of
    expertise, shared services and tools, for
    long-term digital preservation require a step
    change in investment and approaches.
  • Over the next three years a greater emphasis on
    development of production services and tools
    needed to build on previous research studies and
    projects.
  • Digital preservation remains a challenging area
    in which techniques, costs, and skills are still
    in development advocacy, dissemination and
    training, to embed preservation needs as
    appropriate in JISC funding programmes.

6
Interpreting the implementation plan
  • Risk assessment studies, eg ePrints
  • Calls to implement studies recommendations for
    services and integration of preservation activity
    standards into repositories funded by JISC.
  • Series of community calls to support records
    management and digital preservation in
    institutions - cf FOI compliance.
  • Establish Digital Curation Centre to
  • Provide central focus of skilled staff research
  • links to wider network of development activity,
    researchers, services
  • Develop set of central services, standards, and
    tools
  • for a range of distributed digital data centres
    preservation services,
  • across the Information Environment Research
    Grid.
  • JISC Partnership funding,
  • eg Web-archiving study jointly funded by JCIE
    and Wellcome Trust
  • Digital Preservation Coalition as an independent
    entity with JISC membership and sector activity
    supported by JISC.
  • National preservation of e-journals, through
    RLN/RSLG

7
Back to the DCC Background (2)
  • JISC Circular 6/03, initially issued June 2003
  • Call postponed, revised re-issued with more
    significant research component
  • Joint funding JISC and e-Science Core Programme
  • 750K pa (outreach, services development) 250K
    pa (research)
  • Unlikely that any single organisation could do
    whats expected
  • Expressions of Interest Full Proposals from
    Consortia
  • Final selection made in December 2003
  • Negotiations clarification in January 2004

8
Designation of DCC
  • Task entrusted to Consortium of four
    institutional partners
  • Universities of Edinburgh (lead), Glasgow Bath
    together with CCLRC (Rutherford Appleton and
    Daresbury Laboratories)
  • brought together through the National eScience
    Centre
  • jointly managed by Universities of Edinburgh
    Glasgow
  • Two 3-year awards made
  • JISC funding started on 1st March 2004
  • EPSRC grant-funded starts on 1st September 2004
  • Phase One set-up
  • some early deliverables of website helpdesk
  • preparation for full operation launch of
    services in October
  • planning formal opening for early November 2004

9
Responsibilities across the DCC
  • Them with titles
  • Peter Burnhill, Director (Phase One)
  • with Robin Rice, Phase One Project Co-ordinator
  • (from EDINA Data Library, University of
    Edinburgh)
  • Peter Buneman Research Director ( PI on EPSRC
    grant)
  • Professor of Informatics, University of Edinburgh
  • Liz Lyon, Associate Director (Community Support
    Outreach)
  • Director of UKOLN, University of Bath
  • Seamus Ross, Associate Director (Service
    Definition Delivery)
  • Director of HATII ERPANET, University of
    Glasgow
  • David Giaretta, Associate Director (Development)
  • Head of Astronomical Software Services, CCLRC
  • Two significant well known Ex Portfolio names
  • Malcolm Atkinson, Director, NeSC
  • Chris Rusbridge, Director, Information Services,
    UofGlasgow

10
functional management collaboration
curation organisations eg DPC
communities of practice users
community support outreach
service definition delivery
Collaborative Associates Network of
Data Organisations
management admin support
research collaborators
research
development co-ordination
testbeds tools
Industry
standards bodies
11
What is this digital curation anyway?
  • The term Digital Curation is a new invention.
  • Digital Data Curation Task Force - Report of
    Strategy Discussion Day (2002)
  • citing Tony Hey citing use by Dr John Taylor,
    Director General of the Research Councils, to
    distinguish the actions involved in caring for
    digital data beyond its original use, from
    digital preservation. The concepts reach extends
    beyond libraries.
  • The e-Science Curation Report (2003) proposed the
    following distinctions
  • Curation managing promoting the use of data
    from point of creation, to ensure fit-
    for-contemporary-purpose, available for discovery
    re-use.
  • For dynamic datasets this may mean continuous
    enrichment or updating to keep it fit for
    purpose.
  • Higher levels of curation will involve
    maintaining links with annotation with other
    published materials.
  • Archiving curation activity which ensures that
    data are properly selected, stored, can be
    accessed
  • logical and physical integrity is maintained over
    time, including security and authenticity.
  • Preservation activity within archiving in which
    specific items of data are maintained over time
    so that they can still be accessed and understood
    through changes in technology.

12
Digital curation redefined ...
  • digital curation ... digital objects and data,
    over their life-cycle, for current future
    generations of use ...
  • f(data curation digital preservation)
  • data curation when high current/ongoing
    interest
  • actions needed to maintain and utilise digital
    data research results over entire life-cycle
  • data creation management adding value
    generating new sources of information
    knowledge, for use
  • digital preservation for longevityfall off in
    interest
  • long-run technological/legal accessibility
    usability
  • storage, maintenance accessibility of
    information content in digital material over the
    long-term, for use
  • OAIS concept of designated community

13
Data curation in action
  • Astronomy
  • Integrating and analysing distributed data
    (AstroGrid)
  • publishing multi-TB sky surveys (SuperCOSMOS
    WFCAM)
  • interoperability standards (IVO Alliance)
  • BioInformatics
  • data publishing generic tools for XML export
    (EBI Biomart)
  • annotation tools for massive data sets (Pubmed,
    VOTable)
  • archiving tools for dynamic data sets (biological
    DBs)
  • Environmental sciences
  • spatio-temporal annotation (OS Mastermap/ Mouse
    Atlas)
  • Document management
  • Repository certification (RLG Task Force)

14
Digital preservation approaches
  • Migration Refreshment
  • Emulation Encapsulation
  • Digital Archaeology Rescue
  • Document Format Specification

Robin Rice Najla Semple, http//www.lib.ed.ac.uk
/sites/digpres/
15
Communities of Practice Social Sciences
(IASSIST)
  • History of sharing economical in terms of both
    data collector and respondent
  • Data about humans problems of confidentiality
    confronted early on
  • Mixed blessing of agreed proprietary formats
    (OSIRIS, SPSS, etc.) allows migration
  • Future-proofing - 30 years of data advocacy!
  • Tradition of data archiving data citation
  • Building new data standards out of common
    experience
  • data archivists, data librarians the new
    digital curators?
  • www.iassistdata.org

16
Unifying Themes for D C C
  • data as evidence
  • for one or more designated communities
  • archival responsibility
  • at one or more institutional levels
  • with institutional policies individuals
    competence
  • engage/discover communities of practice, to
    invoke/provoke good practices
  • appraisal retention/disposal
  • logical physical integrity authenticity/securit
    y
  • research problems in productive research domains
  • eg Informatics, Law School

17
Aims Objectives for the DCC
  • quality improvement in data curation digital
    preservation
  • Initial focus data as evidence for scholarly
    conclusions
  • Wider remit worlds of scholarly communication
    eLearning
  • twin aimsexcellence in research excellence in
    service
  • need to bridge across communities
  • universities research institutes
  • scientific data tradition document tradition
  • multi-sectoral, international

18
We are all curators now ...
  • The term curation builds on our understanding
    of the word curator
  • who keeps something for the public good, value
    of which often needs to be brought out by the
    curator.
  • 1. this open context implies more support for
    explicit policies with regard to data sharing,
    and it has major implications for structuring and
    tools.
  • 2. the digital curator as store-keeper closely
    linked to promoting new science, looking forward
    to identify new ways to serve present and future
    researchers.
  • digital curator should take an active role in
    promoting and adding value to holdings
  • manage the value of collection
  • adding links and annotation to provide context
  • recording provenance of changes made

19
Planning Progress
  • We must plan for the Long, with our 2020 Vision -
    15yrs
  • we have large territory, and large expectation
  • multi-disciplinary, multi data type, multi
    tradition/profession
  • national and international, but also local and
    hidden from view
  • a lot is going on
  • how to ensure that we do something sensible with
    the s and the trust we have been given?
  • who/what should we plan to affect/effect?
  • policy-makers responsible curators
    (researchers?)
  • how do we wish to be judged, and when?
  • collaboration win-win-win scenarios

20
focii of attention in set-up phase
  • Users client, peer and policy communities
  • outreach community support service
    definition/delivery development co-ordination
    research agenda
  • user requirements analysis Leona Carpenter
    (Focus Groups)
  • Consortium organisation from partner
    participation
  • roles commitment norming/performing
    operational communication consortium agreement
    (IPR)
  • Employers institutional settings
  • re-deployment/appointments accommodation
    commitment/reporting
  • -gt Project Plan, as living document

21
Phase One Progress, March -
  • weekly AccessGrid/telecon two face2face meetings
  • defining programme of deliverables re-deploying
    recruiting staff planning appointment of full
    time director in time for Launch
  • early deliverables
  • www.dcc.ac.uk with links, presentations
    progress updates
  • digitalcuration_at_ed.ac.uk for contacts offers of
    collaboration
  • project plan submitted to JISC, late May 2004
  • defining R D programme services for delivery
  • eg curation architecture repository of tools
    technical information
  • engaging curators in existing community of
    practice

22
Towards a Technical Model of Digital Curation
our RD
  • David Giaretta

Funded by
23
What can we rely on in the Long Term
  • The bits - BIT PRESERVATION
  • Paper documents that people can read
  • ISO standards
  • The information we collect either in the far
    future DCC or its successor
  • Some kind of remote access
  • Some kind of computers
  • People?

24
Preservation vs Current Use
  • There are already very many architectures to
    support immediate use of information
  • Including JISC architecture
  • Aim to support these
  • Therefore chose to be guided by
  • long-term preservation aspects
  • to promote this we should emphasise
    interoperability and automated use as far as
    possible.
  • based initially on OAIS Reference Model but add
    other ideas later
  • bear e-Science in mind

25
OAIS Reference Model Functional Model
26
OAIS Preservation Planning - key aspects
  • Representation Net
  • Designated Communities Knowledge Base

27
Representation Net
28
Preservation Issues
  • Given a file or a stream of bits how does one
    know what Representation Information is needed
    (this question applies to Representation
    Information itself as well as to the digital
    objects we are primarily interested in preserving
    and using) how does one know, for example, if
    this thing is in FITS format?
  • Someone may simply know what it is and how to
    deal with it i.e. the bits are within the
    Knowledge Base
  • One may be able to recognise the format by
    looking for various types of patterns.
  • One may feed the bits into all available
    interpreters to see which accept the data as
    valid
  • Other means.
  • The only safe way have an associated label which
    points to the appropriate Representation
    Information
  • Note this does not exclude the other methods e.g.
    for data rescue

29
High Level View
Example of use of Representation Information
Labelling
30
Implications
  • A label must be attached to each piece of digital
    object as a necessary (but not sufficient)
    condition for long-term preservation logical
    attachment or packaging TBD by the DCC.
  • The label should at least identify Representation
    Information. For long-term preservation this
    label must therefore be a DCC persistent
    identifier.
  • allow some normalisation
  • In order for the Representation Information to be
    persistent then it should either be held with the
    data object itself or be part of a central
    repository part of the DCC. Thus the DCC needs
    a DCC Representation Information Repository. This
    repository would include
  • a Format Repository (covering structural
    information) automated use would be supported by
    use of formal description languages such as EAST
    (ISO 15889, http//east.cnes.fr/ ) or DFDL
    (http//forge.gridforum.org/projects/dfdl-wg/)
  • a Semantic Repository with, for example, Data
    Dictionaries and Ontologies
  • Software Repository with appropriate emulation
    capabilities
  • Each piece of digital RI is also a digital object
    which is understood either by the users
    Knowledge Base OR by further Representation
    Information. Therefore each piece of RI also has
    a label pointing to further RI.

31
Designated Community
  • Techniques must be created for
  • defining a Knowledge Base
  • linking a Knowledge Base to a Designated
    Community
  • linking Representation Information to a Knowledge
    Base if possible

32
Representation Information (1)
  • Structure including Formats
  • Distinguish
  • formats which are used mainly for rendering to
    be followed by human inspection, and
  • formats used for automated processing
  • Implications
  • Representation Information Repository should
    define selected file formats using EAST and DFDL
  • Definitions should include scientific objects and
    humanities objects

33
Representation Information (2)
  • Semantics
  • Hard problem
  • start with Data Dictionaries
  • Implications
  • the Representation Information Repository should
    include Data Dictionaries, followed by more
    general semantics

34
Representation Information (3)
  • Time Dependent Information
  • Many, perhaps most, datasets change over time and
    the state at each particular moment in time may
    be important. It may be useful to break the issue
    into separate parts.
  • at each moment in time we could, in principle,
    take a snapshot and store it. That snapshot has
    its associated Representation Net.
  • efficient storage of a series of snapshots may
    lead one to store differences or include time
    tags in the data (see for example P.Buneman, S.
    Khanna, and Wang-Chiew Tan. On the Propagation of
    Deletions and Annotations through Views.
    Proc.21st ACM Sym. on Principles of Database
    Systems.).
  • Additional Representation Information would be
    needed which describes how to get to a particular
    time's snapshot from the efficiently encoded
    version.
  • Also applies to ANNOTATION who said what and
    when did they say it
  • Implications
  • These are area of active research within the
    consortium and the DCC should be able to provide
  • advice and well tested tools for certain forms of
    efficient encoding of time dependent information
  • advice on annotation
  • identifiers and Representation, perhaps in the
    form of software, for the associated encodings

35
Representation Information (4)
  • Actions and Processes (Behaviour?)
  • Some information has, as an integral part of its
    content, an implicit or explicit process
    associated with it this could be argued to be a
    type of semantics, however it is probably
    sufficiently different to need special
    classification. An examples of this is a database
    or other time dependent or reactive system such
    as a Neural Net.
  • Emulations Universal Virtual Computer (UVC)
  • Implications
  • Support Software emulation via a UVC (possibly
    based on JVM)
  • Support time dependent or reactive systems

36
Persistent Ids
  • Implications
  • Use of existing, or creation of new,
    infrastructure (standards, protocols, servers
    etc) for persistent IDs with adequate flexibility
    and longevity
  • as part of the succession planning, agreement
    would be needed with appropriate organisation to
    act as backup and inheritor of DCC data.

37
Archival Information Package
38
Preservation Description Info
39
AIP implications PDI
  • define standard Preservation Metadata based
    initially on OCLC work including Michael Days
    work and also CCLRC work etc
  • define adequate Packaging technique almost
    certainly XML based
  • define recommended tools and procedures for
    creating Fixity Information such as checksums and
    digests, together with associated Representation
    Information
  • investigate authentication systems

40
Audit and Certification
  • Implications
  • facilitate production of standard(s) on which a
    certification program can be based
  • work to establish accreditation and certification
    body in preparation for offering audit and
    certification services
  • audit, certification and accreditation are
    potential sources of long term funding for the
    DCC
  • software certification will require testbeds and
    testing procedures.
  • Hardware and software systems will need to be
    purchased, hired or borrowed. The DCC associates
    would be useful partners.
  • We might expect Commercial software to be offered
    to us by the manufacturer for testing
  • Testing commercial software could be fee based.

41
Implications for Research
  • Research needed on Representation Information
    (Structure and Semantics) e.g.
  • Investigate fundamental limitations of bit-level
    descriptions and existing tools.
  • Contribute to DFDL definition
  • Investigate capabilities needed to describe
    rendered format (including Word, PDF etc)
  • Data Virtualisation define Science objects and
    Humanities objects
  • Research is needed to
  • Support Software emulation via a UVC (possibly
    based on JVM)
  • Support time dependent or reactive systems
  • Research is needed to provide a solid basis on
    which we can develop persistent IDs with adequate
    flexibility and longevity
  • Research is needed to allow the DCC to
  • define standard Preservation Metadata based
    initially on OCLC work
  • define adequate Packaging technique almost
    certainly XML based
  • investigate authentication systems with a view to
    preparing recommendations for users and consider
    offering, for example, a (fee-based) key storage
    service.
  • A rigorous theoretical basis must be put in place
    from which we can create techniques for
  • defining a Knowledge Base
  • linking a Knowledge Base to a Designated
    Community
  • linking Representation Information to a Knowledge
    Base if possible

42
Curation Manual
  • Put in place quickly using international experts
  • Updates annually
  • Build to curation encyclopaedia

43
Document format specification
  • They borrowed from records management tradition -
    institutions to create documents in standard or
    open formats, which are easier to preserve.
  • Much easier to do in a strict records management
    environment with a published policy of retention
    schedules and a clear knowledge of internally
    produced records.
  • Stipulating a specific file format is harder in a
    research environment where a wide range of
    digital materials are produced and have to be
    preserved.
  • The move to DDI DTD in social science data world
    may be seen as an example of this preservation
    technique.

44
Services Development
  • Turns Research into Products for Research that
    our communities can use with confidence
  • tracking and testing tools and standards
  • that are correct, usable, reliable, well
    documented
  • e.g. for ingest, repository management, data
    exchange, ontologies
  • working with tool developers wherever possible
  • developing testbeds interworking with other
    testbeds
  • aim to gain leverage formats
  • working with other projects worldwide
  • using generic tools and techniques
  • to develop strategies for emerging digital
    formats
  • Metadata standards
  • long-term viability of metadata
  • Registries underpin, to provide basis of Advisory
    Service

45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
Faith in the medium
?
49
Faith in the technology
Write a Comment
User Comments (0)
About PowerShow.com