Title: MultiArchival Syndicated Storage Platform
1Multi-Archival Syndicated Storage Platform
Bryan Beecher University of Michigan Director,
Computing Network Services E
bryan_at_umich.edu W http//www.icpsr.umich.edu/ICPS
R/staff/beecher.html/
Micah Altman Harvard University Archival
Director, Henry A. Murray Research
Archive Associate Director, Harvard-MIT Data
Center Senior Research Scientist, Institute for
Quantitative Social Sciences E
micah_altman_at_harvard.eduW http//maltman.hmdc.ha
rvard.edu/
2This talk
- Roadmap
- Why replicate for preservation?
- What is institutional model for replication in
Data-PASS use? - How do we build on LOCKSS to support these
institutional needs? - Collaborators and Conspirators
- Leonid Andreev, IQSS Steve Burling, ICPSR
Jonathan Crabtree, Odum Marc Maynard, Roper
Nancy McGovern, ICPSR
3Nexuses for Preservation Failure
- Technical
- Media failure storage conditions, media
characteristics - Format obsolescence
- Preservation infrastructure software failure
- Storage infrastructure software failure
- Storage infrastructure hardware failure
- External Threats to Institutions
- Third party attacks
- Institutional funding
- Change in legal regimes
- Quis custodiet ipsos custodes?
- Unintentional curatorial modification
- Loss of institutional knowledge skills
- Intentional curatorial deaccessioning
- Change in institutional mission
4Replication as Part of a Multi-Institutional
Preservation Strategies
- There are potential single points of failure in
both technology, organization and legal regimes - Diversify your portfolio multiple software
systems, hardware, organization - Find diverse partners diverse business models,
legal regimes
http//failblog.org/2008/02/08/floppy-fail/
5Preservation is impossible to demonstrate
conclusively
- Consider organizational credentials
- No organization is absolutely certain to be
reliable - Consider the trust relationships across
institutions
http//flickr.com/photos/phauly/35555985/
6Data-PASS Requirements for SPP
- Policy Driven
- Institutional policy creates formal replication
commitments - Replication commitments are described in
metadata, using schema - Metadata drives
- Configuration of replication network
- Auditing of replication network
7Requirements (more)
- Asymmetric CommitmentsPartners vary in
- storage commitments to replication
- size of holdings being replicated
- what holdings of other partners they replicate
8Requirements (more)
- Completeness
- Complete public holdings of each partner
- Retain previous version of holdings
- Include
- metadata
- data
- documentation
- legal agreements
9Requirements (more)
- Restoration guarantees
- Restore groups of versioned content
- to owning archive
- to replication hosts
- Institutional failure restoration transfer
entire holdings of an archive to another
10Requirements (more)
- Trust Verification
- Each partner is trusted
- to hold the public content of other(not to
disseminate improperly) - to add units to be harvested
- No partner is trusted to be super-user
- No deletion (or directly manipulation of
replication storage owned by another partner - Legal agreements reinforce trust model
- Schema based auditing used to
- verify replication guarantees are met
- record replication and storage commitments
- document related TRAC criteria
11SPP Commitment Schema
- Network level
- Identification name description contact
access point URI - Capabilities protocol version number of
replicates maintained replication frequency
versioning/deletion support - Human readable documentation restrictions on
content that may be placed in the network
services guaranteed by the network Virtual
Organization policies relating to network
maintenance - Host level
- Identification name description contact
access point URI - Capabilities protocol version storage available
- Human readable terms of use Documentation of
hardware, software and operating personnel in
support of TRAC criteria - Archival unit level
- Identification name description contact
access point URI - Attributes update frequency, plugin required for
harvesting, storage required - Terms of use Required statement of content
compliance with network terms. Dissemination
terms and conditions - TRAC Integration
- A number of elements comprise documentation
showing how the replication system itself
supports relevant TRAC criteria - Other elements that may be use to include text,
or reference external text that documents
evidence of compliance with TRAC criteria. - Specific TRAC criteria are identified implicitly,
can be explicitly identified with attributes - Schema documentation describes each elements
relevance to TRAC, and mapping to particular
TRAC criteria
12Main SSP Use Cases
- Initialization Given schema instance distribute
AU harvesting responsibility to hosts - Auditing Does current host harvesting allocation
history match replication commitment in schema?
- Recovery of hosts
- Deliver AU content to source archive
- Addition of AUs, hosts
- Growth of AUs over initial commitment
- Assumptions
- Nothing is deleted
- Resources in network grow monotonically
- Off-the-path behavior is detected
automatically, resolved manually
13DRAFT USE CASE INITIALIZATION
14DRAFT Use Case AUDITING
15Our approach LOCKSS
- Very easy to build and deploy
- 5 minutes
- Very easy to plug into public LOCKSS network
- 5 minutes
- Very easy to manage thereafter it is basically
an appliance
- Also easy to set-up
- Grouping your CLOCKSS devices into a private
network and paring the 20k-line configuration
file into the right 200-line configuration file
is not - Managing a network now, not a device
16SPP LOCKSS Technology Integration
- Standard LOCKSS used for
- Harvesting
- Recovery
- New LOCKSS bulk update mechanism used for
- Initial configuration
- Adding AUs
- CLOCKSS mechanisms (certificates, cache monitor)
- Content delivery
- Optimize recovery
- Auditing
- Data-PASS customizations for schema processing
- Translating schema instance into bulk update
requests - Reporting on compliance based on cache monitor
database
17Progress so far
- Summer 2007 Attended the MetaArchive LOCKSS
tutorial - Very good overview of LOCKSS
- Summer 2007 SSP System Requirements Developed
Approved - Winter 2007 First public LOCKSS network nodes
built at two Data-PASS sites - Winter 2007 SSP Replication Commitment Schema
Developed - Spring 2008 Completed Test harvest of MRA
collection into LOCKSS - Sprint 2008 SSP System Use Cases Developed
- Spring 2008 Prototype plugin developed to
harvest Dataverse Networks - Spring 2008 Data-PASS sites are joined into
single Private LOCKSS Network (PLN) - Spring 2008 Met with LOCKSS developers to review
use cases - SSP will leverage functionality in the works by
LOCKSS team
18Data-PASS PLN as of June 2008
19Summary
- Replication ameliorates institutional risks to
preservation - Data PASS requires policy based, auditable,
asymmetric replication commitments - Formalize policy in schema
- (Re)Configure audit LOCKSS using schema
- Replication uses standard LOCKSS mechanisms