Title: Planning to Maximize Longevity of Digital Information
1Planning to Maximize Longevity of Digital
Information
- Howard Besser
- UCLA School of Education Information
- http//www.gseis.ucla.edu/howard
2Planning to Maximize Longevity of Digital Info-
- Access and Preservation
- Why are you Managing this Information?
- Key Considerations for Imaging Projects
- Important Planning Considerations
- Models for Digital Collections
- Importance of Metadata Standards
- Digital Longevity Issues
- More Planning Issues
3Access and Preservation
- Digitizing can serve both Access and
Preservation
- E.g. Access to digital surrogates saves wear
tear on originals
- But Digitization for Access can be quite
different than Digitization for Preservation
- Level of detail, scanning quality, extensiveness
of resources
- And long-term retention of digital works is
still an open issue
4Why are you Managing this Information?
- Organizational mission type
- Users
- Uses
5Key Considerations for Imaging Projects-
- Users' Needs
- Image Quality
- Intellectual Property
- Standards
- Topology
- Tools Processes
6Key Considerations for Imaging Projects (1 of 3)
- Users' Needs
- Quality of Digital Surrogate
- Interoperable desktop applications
- Image Quality
- Archival
- Current online delivery
7Key Considerations for Imaging Projects (2 of 3)
- Intellectual Property
- Standards
- Modular and Layered Architecture
- Terminology
- Technical imaging information
- Topology
8Key Considerations for Imaging Projects (3 of 3)
- Tools Processes
- Scanners
- Compression techniques
- Linking files
- Workflow
- Interoperable desktop applications
9Some nuts-and-boltsPlanning Considerations
- Think about users (and potential users), uses,
and type of material/collection
- Scan at the highest quality that does not exceed
the likely potential users/uses/material
- Do not let todays delivery limitations influence
your scanning file sizes understand the
difference between digital masters and derivative
files used for delivery - Many documents which appear to be bitonal
actually are better represented with greyscale
scans
- Include color bar and ruler in the scan
- Use objective measurements to determine scanner
settings (do NOT attempt to make the image good
on your particular monitor or use image
processing to color correct) - Dont use lossy compression
- Store in a common (standardized) file format
- Capture as much metadata as is reasonably
possiple (including metadata about the scanning
process itself)
10Why Scale is important
11Important Planning Considerations
- File Formats
- Choosing Interoperable Systems
- Adhere to standards
- Vendors with large installed base
- Refreshing and/or Migration
12Key problems were facing
- Discovery
- Longevity-
- Interoperability-
13Serious Longevity Problems
- What we know from prior widespread digital file
formats
- Images separating from their metadata
- Inaccessibility of software needed to view an
image
- Inability to even decode the file format of an
image
- return to Longevity problem later-
14Traditional Digital Library Model
15Ideal Digital Library Model
16For Interoperability Digital Libraries Need
Standards
- Descriptive Metadata for consistent description
- Discovery Metadata for finding
- Administrative Metadata for viewing and
maintaining
- Structural Metadata for navigation
- ... Terms Conditions Metadata for controlling
access...
17Why are Standards and Metadata consensus
important?
- Managing digital files over time
- Longevity
- Interoperability
- Veracity
- Recording in a consistent manner
- Will give vendors incentive to create
applications that support this
18Why Standards?
- Why do we need standards?
- To make information universally available to
users
- facilitate sharing and interchange of
information
- To preserve information (make it safe from
changes in hardware and software)
- Standards only work if communities widely accept
them, but theyre necessary for communities to
work together
19Questions to Ask
- What communities is this standard designed for?
- What type of information is this standard
designed to handle?
- What functions is this standard designed to
serve?
- What previous standards is it built upon?
- Does the standard prescribe how to create new
records (or parts of records), or how to map from
existing records?
- How far does the standard go? Semantics Does it
define element sets? Rules? Syntax?-
20Semantics/Syntax/Structure
- Semantics
- meaning, as defined by a community to meet their
particular needs (DC)
- Syntax
- a systematic arrangement of data elements for
machine processing
- facilitates the exchange and use of metadata
among various applications (HTML, XML, RDF)
- Structure
- a formal arrangement of the syntax with the goal
of consistent representation of the semantics
(rules defining field contents like 1/11/99)
21The Short Life of Digital Info Digital Longevity
Problems-
- Disappearing Information
- The Viewing Problem
- The Scrambling Problem
- The Inter-relation Problem
- The Custodial Problem
- The Translation Problem
22The Viewing Problem
- Digital Info requires a whole infrastructure to
view it
- Each piece of that infrastructure is changing at
an incredibly rapid rate
- How can we ever hope to deal with all the
permutations and combinations
23The Scrambling ProblemDangers from
- Compression to ease storage delivery
- Container Architecture to enhance digital commerce
24The Inter-relation Problem
- -Info is increasingly inter-related to other
info
- -How do we make our own Info persist when it
points to and integrates with Info owned by
others?
- -What is the boundary of a set of information (or
even of a digital object)?
25The Custodial Problem
- How do we decide what to save?
- Who should save it?
- How should they save it?
- -methods for later access emulation, migration,
etc.
- -issues of authenticity and evidence
26The Translation Problem
- Content translated into new delivery devices
changes meaning
- -A photo vs. a painting
- -If Info is produced originally in digital form
in one encoded format, will it be the same when
translated into another format?
- Behaviors
27Pieces of the Solution (1/2)
- -We need to insist upon clearly readable
standardized ways for digital objects to
self-identify their formats
- -We should discourage scrambling
- -We need to better understand information
inter-relates to other Info, and what constitutes
boundaries of Info objects
28Pieces of the Solution (2/2)
- -People and organizations wishing to make
information persist need guidelines of how to go
about doing it
- -We need to better understand how translating
from one storage or display format to another
affects the meaning of a work
- -We need to save the behaviors of a digital
object, not just its contents
29Conceptual Approaches to Digital Preservation
- Refreshing always necessary due to volatility of
physical strata
- Impact on evidential value
- Migration -- advantages disadvantages
- Emulation -- advantages disadvantages
30Metadata can be the first line of defense
- Can tell you
- where the file is (if you cant find the file)
- where more info about the file is (if you have
the file but most other metadata has become
separated)
- what the file format is
- what the compression scheme is
- what application program and version is needed
for the file
31Groups Working onthe Big Problemhttp//sunsite.b
erkeley.edu/Longevity/
- CPA Task Force
- Getty Time Bits Conference Follow-ups-
- Emulation experiments in US and Europe
- NEDLIB, CURL, Michigan
- Mellon-funded E-Journal Archive experiments
- Internet Archive
- Long Now
32Time Bits
33Time Bits Participants
- Steward Brand
- Howard Besser
- Brian Eno
- Danny Hillis
- Peter Lyman
- Brewster Kahle
- Kevin Kelly
- Jaron Lanier
- Doug Carlston
- John Heilemann
- Ben Davis
- Margaret MacLean
- Bruce Sterling
- Paul Saffo
34Groups Working onPieces of the Big
Problemhttp//sunsite.berkeley.edu/Longevity/
- Internet Archive
- Long Now
- Emulation experiments in US and Europe
- NEDLIB, CURL, Michigan
35Journal Archiving
- License, dont own may not be even able to
obtain right to make archival copy
- Increasingly no paper back-up at all
- Usually we dont have the important redundancy
factor
- Stanfords LOCKSS Project (Lots of Copies Keeps
Stuff Safe) and its problems (http//lockss.stanfo
rd.edu)
36Migration/Refreshing
- Impact on evidential value
37More Planning Issues
- Image Families
- Behaviors
- Persistent Identification
38Identification/Provenance (Images)-
- The number of variant forms of a work can be
enormous
- Image Families
- A digital image frequently has many layers of
parentage
- Information about the parentage that can indicate
the quality and veracity of the image (Dublin
Core "Source" and "Relation")
- how to deal with different versions derived from
the same scan or different encoding schemes
- Vocabulary Standards to express this
39The number of variant forms of a work can be
enormous
- different views of the same object
- different scans of the same photo
- different resolutions
- different compression schemes
- different compression ratios
- different file storage formats
- different details of the same image
- ...
40Image Families
41Identification/Provenance
- how to deal with different versions (browse,
hi-res, medium res) derived from the same scan or
different encoding schemes (TIFF, PICT, JFIF)
- Vocabulary Standards to express this
- VRA Surrogate Categories
- CIMI's "Image Elements
42MOA II Behaviors
43MOA II Best practices
- Use/Users/Collection
- Benchmarking
- Masters vs. Derivatives
- Scanning-
- Administrative Metadata-
- Structural Metadata-
44To deal with Immediately
45Persistent IDs--the Problem
- Need to separate work ID from work location
- URNs probably wont be ready until 2003
- Becomes a business process issue when one
organization maintains the resource and another
organization references it (ie. licensed from
vendors or managed by separate administrative
structures)
46More Persistent IDs--the Approach for today
- PURLs
- Handles
- HTTP redirects
- And worry about costs now and conversion costs
when URNs become feasible
47Data Set ManagementMore issues with referencing
IDs
- References for mirror sites
- References for back-up sites when main site is
down or bottle-necked
- References for off-site copies and archival
copies
48One Final QuestionWho will collect the digital
works of today that should become the Special
Collections of tomorrow?
- web sites
- zines
- electronic journals
- listserve and email discussions
- drafts of works that later become famous
49Planning to Maximize Longevity of Digital
Information
- Howard Besser
- UCLA School of Education Information
- http//sunsite.berkeley.edu/Longevity/
- http//www.gseis.ucla.edu/howard
- http//sunsite.berkeley.edu/moa2
- http//lockss.stanford.edu
- http//www.longnow.com/10klibrary/TimeBitsDisc/
- http//www.archive.org/