Title: ERA Project: Research Testbed and Related Outcomes
1ERA Project Research Testbed and Related Outcomes
- Joseph JaJa, Mike Smorul, and Sangchul Song
- Institute for Advanced Computer Studies
- Department of Electrical and Computer Engineering
- University of Maryland, College Park
2Background
- Started as an ERA project focusing on setting up
and testing a distributed archiving
infrastructure. - Outcomes include the development of archiving
tools and services that are scalable and platform
independent. - Complementary research efforts have been more
recently supported by NSF, Library of Congress,
and the Mellon Foundation.
3Transcontinental Persistent Archive Prototype
(TPAP) ERA Project
- Partnership between NARA, San Diego Supercomputer
Center, and the University of Maryland. - A distributed testbed built on a set of
heterogeneous grid bricks linked by the SRB data
grid technology. - Outcomes scalable, platform-independent tools
and technologies tested and evaluated over TPAP.
4Outcomes of the ERA Research Testbed
- Empirical testing and evaluation of technologies
using extensive NARA selected collections. - Flexible software environment for ingestion and
for handling producers archive interactions
PAWN. Developed with extensive collaborations
with NARA. - Release of Version 1.0 of ACE A tool for
policy-driven auditing to ensure the long term
authenticity of digital holdings of an archive. - Tracking and Monitoring tool of the digital
holdings of an archive part of the ERA TPAP
Project
5Overall Methodology ADAPT
- Layered digital object architecture and a set of
modular tools built using open standards and web
technologies. - Can easily accommodate emerging standards and
policies. - Will evolve gracefully as the underlying
technologies change. - Evaluation and demonstration of tools on widely
different collections.
6Software Developed and Tested on TPAP
7PAWN Producer Archive Workflow Network A
Collaborative Effort with NARA
- Software that provides a flexible and
customizable ingestion framework - Handles the process in a reliable and secure
fashion - From package assembly
- To archival storage
- Simple interface for end-users
- Flexible interface for archive managers
- Designed for use in multiple contexts
-
8Overall Organization
- Producers organized into domains, each domain
contains a transfer agreement negotiated with the
archive. - Each domain contains a hierarchical organization
of data grouped into record sets/templates
(convenient groupings from the transfer
agreement). - An end-user operates within a domain with record
sets associated with the account.
9Producer-Archive Agreement
10Package Workflow Overview
- Create Producer-Archive Agreement and client
package template. - Create package based on template
- Once approved, packages can be archived
- Rejected packages can be held until rectified or
deleted for resubmission.
11Customizable Components
- Definable Roles
- Actions in PAWN can be grouped to create
arbitrary types of users - Flexible Approval Requirements
- Signature requirements can be placed on parts of
a package. - Automated Processing
- API for creating processes to validate,
transform, approve, or publish items in a package - Processes can be invoked manually or
automatically - Processes may have dependencies on item approval
12PAWN Summary
- Flexible environment to handle ingestion between
many producers and an archive. - Very little effort for producers to push their
data into the archive. - Granular workflow definition.
- Fully automated to completely manual.
- Easy to include new standards (metadata,
packaging, ). - Tested extensively in TPAP environment.
- Interest from different communities including
NDIIPP.
13ACE Auditing Control Environment
- Software to protect the integrity of digital
assets in the long term - Hardware/media degradation
- Security breaches, malicious alterations
- Infrequent access to most data
- Evolution of cryptographic schemes
- Underpinnings are based on rigorous cryptographic
techniques. - Scalable, cost-effective, and can interoperate
with any archiving architecture.
14ACE Basic Methodology
- Three-tiered Cryptographic Information
- A integrity token (IT) for each digital object is
generated upon its deposit into the archive 1KB
per object. - Cryptographic summary information (CSI) is
periodically computed over the generated
integrity tokens 100MB/year. - Very compact cryptographic summaries (witnesses)
are generated periodically - 2-3KB/year. - Each tier is periodically audited separately
according to policies set by managers.
15ACE System Architecture
16Software Developed in Version 1.0
- Audit Local Files Audit Manager software
periodically audits files as specified by the
archive manager. - Audit Local Manager An independent IMS can
verify the correctness of the local audit
manager. - Independent Auditing Any third-party can audit
the IMS using the published witness values.
17ACE Summary
- Third-party auditable
- Cryptographically rigorous yet cost-effective
- Update-aware
- Highly interoperable
- Scalable
- High Performance
- Easily configured
- Version 1.0 just released after extensive testing
on large collections using the ERA research
testbed. - Currently, running on the Chronopolis testbed.
18Tracking and Replication Monitoring ERA Project
in Support of TPAP
- Portal that provides overview of a collection
status over different zones. - Ensures that new objects are replicated to
relevant sites. - Tracks files at master locations and periodically
copy new files to replica sites. - Log actions on a collection and errors during
replication - Currently in use on TPAP and Chronopolis.
19Other Technologies
- PAWN Related
- APIs for different packaging technologies (METS
and XFDU). - ICDL Book Builder Interface to enable bulk
ingestion of digital objects already managed by a
database. - FOCUS (FOrmat CUration Service) a scalable, and
secure registry for persistent information and
services applied to formats.
20Conclusion
- Partnership with NARA has been critical in
enabling an extensive research program. - Focus has been on empirical testing on a
distributed research testbed using a wide variety
of NARA collections. - Outcomes include the development of tools and
services in support of ingestion and preservation
for long-term archives. - Recent expansion into new areas such as web
archiving, information discovery, and access.