Title: Martha Anderson
1 LC Perspective Preservation Partnerships
- Martha Anderson
- Program Officer, NDIIPP
- Office of Strategic Initiatives
- Library of Congress
- April 2005
2Born Digital At-Risk Web Sites
- http//www.loc.gov/minerva/collect/elec2000
http//www.loc.gov/minerva/collect/sept11
3Take Actions that are
NDIIPP Strategic Direction
- Catalytic
- Invest in existing strengths
- Collaborative
- Engage partners in areas of mutual interest and
expertise - Iterative
- Learn by doing
- Strategic
- Broad spectrum of balanced short-term
investments
4Web of projects
NARA
GPO
LC Web Projects
UIUC
IA
Preservation Partners
IIPC
AIHT
NDIIP
CDL
States Initiative
5Library of Congress Web Archiving
Strategy
- Collaborate with partners working on the same
preservation issues - Develop collection strategies to leverage
available resources - Learn by doing
6Collaborate with partners working on the same
preservation issues
- Membership in the International Internet
Preservation Consortium (IIPC) - Cooperative projects with NDIIPP Preservation
Partners - California Digital Library
- University of Illinois at Champaign-Urbana
- Technical information sharing with other US
government agencies - Government Printing Office
- National Archives and Records Administration
7Develop collection strategies to leverage
available resources
- Collect thematically both by crawling and by
acquiring collections gathered by others
Learn by doing
- Case studies and regular collection of
theme-based collections - Participate in tools development with IIPC
- Archive Ingest Handling Project
8Challenges of collecting from the Web
- Characteristics of the resource--dynamic, deep,
linked - Intellectual property laws and regulations
- Tension of preservation vs access goals
- Degree of alignment with current collection
policies for other media - Curation strategy
- Tools for identification and selection
- Tools for collection, curation, and archiving of
large web collections
9Average Web Collection
- Begins with a theme or event
- Usually does not include commercial sites
- Starts with a list of about 200 urls
- Is crawled by vendor
- Yields about 1 TB of data per month
- Has a frequency of once a week
10Web Collections to date at LC
- Event-based
- US National Elections2000, 2002, 2004
- War in Iraq
- September 11
- Public Policy Topics
- Health Care
- Legislative Branch
- Terrorism
- 26 TB
11Archive Ingest Handling Test
- AIHT is a first test of proposed NDIIP
preservation architecture. - The test is conducted with a common data set.
- George Mason University 9/11 Archive
- Phase I tests ingest and data handling in local
systems. - Phase II tests export and import between
institutions. - Phase III explores format migration.
12GMU 9/11 Archive
Participants exchange archive
Participants demonstrate capabilities
13Participants
- Old Dominion University, Department of Computer
Science -
- Stanford University Libraries Academic
Information Resources - The Johns Hopkins University, Sheridan Libraries
- Harvard University Library
14George Mason University 9/11 Archive Breakdown
by File Types
57,450 files 12GB Originally stored in a Linux
environment
15Goals of AIHT
- Gain practical experience with multiple
institutions - Document transfer and ingest processes for
multiple systems - Determine next set of tasks for developing
interfaces between layers and institutions
16Status of AIHT
- All phases completed.
- Imports focused on technical assessment of
archive and developing tools to examine the
archive - Exports included METS and MPG21 DID objects
- Migrations included transforms to JPG2000, TIFF,
and some exploration of html to xml and avi to
mpg - Full report expected by early summer.
17For more information.
- NDIIPP Technical Architecture version 0.2
- http//www.digitalpreservation.gov
- International Internet Preservation Consortium
http//netpreserve.org/about/index.php - MINERVA Mapping the INternet Electronic
Resources Virtual Archive http//www.loc.gov/miner
va/
18- Martha Anderson
- NDIIP Program Officer
- Office of Strategic Initiatives
- The Library of Congress
- Washington, DC
- mande_at_loc.gov