Title: Global Digital Format Registry
1Global Digital Format Registry
Archiving Web Resources Issues for Cultural
Heritage Institutions National Library of
Australia, Canberra, November 12,
2004 Information Day
- Stephen L. Abrams
- Digital Library Program Manager
- Harvard University Library
2Introduction
- Almost all aspects of repository operation are
conditioned by the format of the objects in the
repository - Without proper characterization of digital
objects (format typing and technical metadata),
effective long-term preservation is difficult, if
not impossible - Repositories need to ensure that
- Digital object content streams are valid with
respect to their format - Metadata encapsulated within object content
streams are consistent with externally supplied
metadata - Formatted content streams remain accessible over
time
3Use Cases
- Identification
- I have an object what format is it?
- Validation
- I have an object purportedly of format F is
it? - Characterization
- I have an object of format F what are its
salient properties? - Assessment
- I have an object of format F is it at risk of
obsolescence? - Processing
- I have an object of format F how can I perform
operation X on it?
4Repository Format Dependencies
Based on Open Archival Information System (OAIS)
Reference Model, ISO 14721
5Characteristics of a Format Registry
- Predictable data
- Arbitrary granularity
- Inclusive
- Trustworthy
- Authoritative
- Honest broker with regard to proprietary
information - Machine actionable discovery
- Interoperable
- Informative, not evaluative
6Global Digital Format Registry
- DLF funded two invitational workshops in 2002 to
investigate issues surrounding the establishment
of a GDFR
- National Archives, UK - NARA - National
Archives of Canada - New York University - NIST -
Online Computer Library Center - Research
Libraries Group - Stanford University -
University of Pennsylvania
- Bibliothèque nationale de France - California
Digital Library - Digital Library Federation -
Harvard University - Internet Engineering Task
Force - JISC - JSTOR - Library of Congress - MIT
7GDFR Scope
- The registry will maintain persistent,
unambiguous bindings between public identifiers
for digital formats and representation
information for those formats
8What is a Format, Anyway?
- A reversible byte-serialized encoding of an
information model - A set of syntactic and semantic rules that
- Map from abstract content to a sequence of bytes
- Map back from a sequence of bytes to the abstract
content represented by those bytes
9Almost Anything is a Format
- ASCII-encoded text, Excel spreadsheet, PDF
- IEEE 754 floating point number
- XML schema (and XML Schema)
- LZW compression
- ARC file, Tar archive
- Windows Portable Executable (.exe)
- NTFS file system
10When Is a Format Not a Format?
11When Is a Format Not a Format?
- How many words are inside the box?
CAT
CAT
12When Is a Format Not a Format?
- How many words are inside the box?
- It depends
CAT
CAT
13When Is a Format Not a Format?
- How many words are inside the box?
- It depends
- There are two tokens of one type (or two species
of one genus, or two instances of one class, etc.)
CAT
CAT
14When Is a Format Not a Format?
- How many formats are inside the box?
TIFF 4.0
TIFF/EP
TIFF/IT
15When Is a Format Not a Format?
- How many formats are inside the box?
- Three subtypes of one format family
- Inter-familial relationships
TIFF 4.0
TIFF/EP
TIFF/IT
16Format Family Tree
17Formal classification
- Ontological CLASSES, abstract families, concrete
formats, and relationships - BYTESTREAM
- IMAGE
- STILL
- RASTER
- GIF
- GIF87a
- GIF89a new-version-of GIF87a
- JPEG
- ISO 10918
- JFIF subtype-of ISO 10918
- TIFF
- TIFF 4.0
- TIFF 5.0 new-version-of TIFF 4.0
- TIFF 6.0 new-version-of TIFF 5.0
- TIFF/EP subtype-of TIFF 6.0
- TIFF/IT subtype-of TIFF 6.0
- TIFF/IT/CT subtype-of TIFF/IT
- TIFF/IT/CT/P1 subtype-of TIFF/IT/CT
18Format Subtyping
- Substitutability
- Can the subtype be substituted for its parent in
all contexts without detection or loss of
function? - All TIFF/ITs are TIFF 6.0s, but not all TIFF
6.0s are TIFF/ITs - Arbitrary granularity of subtype
-
- TIFF
- TIFF 6.0
- Baseline bitonal
- DLF Benchmark for Faithful Digital
Reproductions of Monographs and Serials - Harvard archival master specifications
- Harvard Open Collection Program
specifications -
19Format Subtyping
- MIME types represent formats at the coarsest
possible granularity - May not be sufficient for characterizes digital
objects for purposes of preservation workflows - Representation information for a subtype need
only detail the properties that distinguish it
from its parent all others are inherited - Permits selection of tools appropriate to the
task at hand
20Format Relationships
- Subtyping
- US-ASCII is a subtype of UTF-8
- Version
- PDF 1.0 1.5
- Encapsulation
- WAVE can contain ?-law and ?-law audio content
streams - Tar archive can contain anything
- Affinity
- ISO 10918-1 (JPEG) vs. ISO 10918-3 (SPIFF) vs.
ISO 14495 (JPEG-LS)
21Format Representation Information
- Information that maps formatted content to more
meaningful concepts - Syntax
- A TIFF header is composed of a two byte string,
II or MM, a two byte string, 0x2A00 or
0x002A, and an unsigned 32 bit integer - Semantics
- II indicates big-endian byte order MM,
little-endian - The two byte string is the decimal value 42 in
correct byte order - The integer is the byte offset of the first IFD
structure - Assessment
- Factors bearing on a formats amenability for
long-term preservation
22GDFR Architecture
- Not a single monolithic registry
- A distributed network of cooperating registries
- Standard protocol
- Standard abstract data model
- Implementation of the participating registries is
not prescribed - Conformance is at the level of the protocol
23Distributed Network of Cooperating Registries
24Data Model
- General descriptive properties, including
canonical and alias identifiers for formats - Characterization properties, detailing the
syntactic and semantic properties for formats - Processing properties, describing systems and
services for which registered formats are inputs
or outputs - Administrative properties, capturing important
events in a registrations provenance
25Data Model
26Data Model Sources
- ISO 14721, Open archival information system --
Reference model - OCLC/RLG Preservation Metadata Framework
- Incorporates CEDARS, NEDLIB, NLA, OAIS, OCLC,
etc. - JISC File Format Representation and Rendering
Project - PRONOM
- ISO/IEC 11179, Specification and standardization
of data elements - OASIS/ebXML Registry Information Model
27Descriptive Properties
- Identifiers
- Canonical
- Alias
- Author
- Owner
- Maintainer
- Standard agent properties
- Name, title, affiliation, type, contact
information - Ontological classification
- Relationships
- Status
28Characterization Properties
- Family
- Specification
- Bibliographic description
- Title, edition, author, publisher, date
- Identifiers
- Type
- Reference manual, technical report, standards
document - Access regime
- Signature
- External - Nominal file extension, Mac OS file
type - Internal - Magic number
29Characterization Properties
- Assessment
- Library of Congress
- Sustainability
- Disclosure
- Adoption
- Transparency
- Self-documentation
- External dependencies
- DRM
- Quality and functionality
- Cornell VRC, OCLC INFORM
30Processing Properties
- Systems and services that use formats as inputs
or outputs - Name and version
- Vendor
- Function
- Hardware/software dependencies
31Service Model
- Interoperation services, for communication
between registries conforming to the GDFR
protocol - Local services, which can include
- Access services, providing discovery and delivery
of format representation information - Management services, providing mechanisms for
maintenance, technical review, and notification - Human and machine service interfaces
32Service Model Sources
- ANSI X3.285, Metamodel for Management of
Shareable Data - OASIS/ebXML Registry Services Specification
33Interoperation
- Registration
- Review
- IETF RFC process
- Synchronization
- OAI
- LOCKSS
- Some information may not be replicated, either by
matter of local policy or due to access
restrictions
34Access Services
- Discovery
- Local
- Global
- Delivery
35Management Services
- Maintenance
- Create, update, delete
- Notification
- Tell me when an event of interest to me occurs
- Introspection
- Public exposure of local services, policies, and
practices
36What Happens Next?
- Prototype system demonstrating provisional data
model - Multi-year, two-track project
37FRED A Format Registry Demonstration
38GDFR Technical Track
- Deliverables
- Data model
- Network protocol
- Reference implementation
- Initial population
- Schedule
- Year one Analysis, design, and prototype
- Year two Development and deployment
- Year three Production operation and integration
with repository workflows
39GDFR Administrative Track
- Deliverables
- Recommendations for sustainable governance
structure and business model - Schedule
- Year one Analysis and consultation
- Year two White Papers and consultation
- Year three Final recommendations
40Why is This Important to You?
- The GDFR is an enabling technology underlying
digital repository operations and preservation
activities - It permits typing of digital objects at an
appropriate level of granularity - It enables the future recovery of the syntax and
semantics associated with typed digital objects - It provides a mechanism to pool and redistribute
the expertise of the digital preservation
community
41More Information
hul.harvard.edu/gdfr/ tom.library.upenn.edu/fred/
stephen_abrams_at_harvard.edu