Title: Digital Preservation at HUL
1Digital Preservation at HUL DRS 2
- HMS Countway Library
- Andrea Goethals
- July 20, 2009
2Agenda
- The problem
- What are we doing about it?
- DRS 2
- Open for questions
31. The problem
4The problem is twofold
2. Keeping the bits useful to people
5Keeping the bits safe
- Digital things are amazingly easy to destroy
- Bad people
- Software or hardware failure
- Human mistakes
- Destruction is not always apparent
- Data not used frequently is at risk of unnoticed
damage - Some damage is not noticeable to human eyes and
ears
6Keeping the bits useful to people
- Digital material is fragile
- Humans are dependent on technology to interpret
the content... - Technologies must understand the format of the
content - Technologies age and disappear!
7Using information content
Analog book Unmediated use
Digital book Technology-mediated use
8Formats are key to determining usability
Formats are the bridge between the content we
want to preserve and supporting technologies
digital content
supporting technologies
92. What are we doing about it?
10Keeping the bits safe
- Store the bits in multiple copies, in multiple
places - Make sure the bits are not corrupt
- Replace media periodically
- Restrict who can access the bits
- Be able to recover the bits!
11Keeping the bits safe at HUL
- 3-4 copies of each file, 2 different media
- 1-2 (tape and sometimes disk) 60 Oxford Street,
Cambridge - 1 (disk) Summer Street, Boston
- 1 (tape) Southborough
12Keeping the bits safe at HUL
- Automated integrity monitoring
- Drscheck script
- Compares the MD5 of each file at the Summer
Street location to the MD5 stored in a database - Also checks the 60 Oxford Street disk copy
- A copy of each file checked every 2 weeks
- Recent enhancement Trigger on database update of
MD5 - Storage media replaced every 4-5 years
13Keeping the bits safe at HUL
- Overseen by OIS and UIS IT staff
- Just-in-case plans
- Disaster recovery
- Server fail-overs
- Software failure
- Tape libraries
- Fabric switches
- Lost or damaged tapes
- Data recovery (corruption)
14Its safe - but is it usable???
- Its not enough to preserve the bits if the
format of the bits is obsolete! - WordStar? AppleWorks? Excel 1.0?
- For digital content we are dependent on software
that can understand the format
15The importance of format
- Understanding formats is fundamental to
preservation
ffd8ffe000104a46494600010201 008300830000ffed0fb05
0686f74 6f73686f7020332e30003842494d 03e90a5072696
e7420496e666f00 0000007800000000004800480000 00000
2f40240ffeeffee03060252 0347052803fc00020000004800
48 0000000002d80228000100000064 000000010003030300
000001270f 0001000100000000000000000000 0000600800
190190000000000000 0000000000000000000000000000 00
00000000000000000000003842 494d03ed0a5265736f6c757
4696f 6e0000000010008313a3000200 ...
16The importance of format
- Understanding formats is fundamental to
preservation
ffd8ffe000104a46494600010201 008300830000ffed0fb05
0686f74 6f73686f7020332e30003842494d 03e90a5072696
e7420496e666f00 0000007800000000004800480000 00000
2f40240ffeeffee03060252 0347052803fc00020000004800
48 0000000002d80228000100000064 000000010003030300
000001270f 0001000100000000000000000000 0000600800
190190000000000000 0000000000000000000000000000 00
00000000000000000000003842 494d03ed0a5265736f6c757
4696f 6e0000000010008313a3000200 ...
SOI APP0 JFIF 1.2 APP13 IPTC APP2 ICC DQT SOF0
183x512 DRI DHT SOS ECS0 RST0 ECS1 RST1 ECS2 ...
17The importance of format
- Understanding formats is fundamental to
preservation
ffd8ffe000104a46494600010201 008300830000ffed0fb05
0686f74 6f73686f7020332e30003842494d 03e90a5072696
e7420496e666f00 0000007800000000004800480000 00000
2f40240ffeeffee03060252 0347052803fc00020000004800
48 0000000002d80228000100000064 000000010003030300
000001270f 0001000100000000000000000000 0000600800
190190000000000000 0000000000000000000000000000 00
00000000000000000000003842 494d03ed0a5265736f6c757
4696f 6e0000000010008313a3000200 ...
SOI APP0 JFIF 1.2 APP13 IPTC APP2 ICC DQT SOF0
183x512 DRI DHT SOS ECS0 RST0 ECS1 RST1 ECS2 ...
18Keeping the bits useful to people
- Know what formats you have
- Make sure theres technology to support the
formats! - Provide ways for people to find it
- Provide ways for curators to manage it
- Keep records of significant events
- Repair, replace
19Can we approach the problem differently?
- In way thats more proactive?
- And more efficient?
- And less expensive?
Yes
20The content production matters!
- The least expensive, and most effective
preservation measure is to think about the future
when digital content is created! - It makes good sense to try to influence the
content creation process
21Preservation lifecycle
- Create digital content
- Ingest into a preservation repository
- Continuous cycle of
- Monitoring
- Planning
- Intervention
- Subject to collection management decisions
- Transfer to next generation of the repository or
to a different repository
22Keeping the bits useful to people at HUL
- Guidelines
- More preservable files
- formats standard, well-understood,
well-supported, open - Recommended supplementary documentation
(metadata) - Tools
- FITS, JHOVE check quality of files, automated
metadata extraction - Staff available to consult
23Keeping the bits useful to people at HUL
- Collection management applications
- Discoverable content
- Catalogs
- Persistent names
- Search engines
- Extensive metadata
- Administrative, Technical, Structural, Provenance
- Suite of delivery applications
24Keeping the bits useful to people at HUL
- Suite of delivery services
- Delivery applications created and maintained at
OIS - IDS, PDS, SDS, ADS, FTS
- Third party middle-ware maintained at OIS
- RealServer, Luratech JPEG 2000 Server
- Third party rendering applications on users
desktops - Web browsers, RealAudio Players, TIFF viewers,
ZIP utilities
25Involvement in broader preservation community
efforts
- E-journal archiving
- Technical metadata
- Still images, audio, documents
- METS (package for metadata and digital objects)
- PDF-A
- PREMIS (preservation metadata)
- AIHT (repository interaction demonstration)
- Registry of digital masters
- Repository certification
- Formats registry (UDFR)
264. DRS 2
27DRS 2 changes
- Why?
- To better support digital preservation
- To better support needs of DRS depositors,
curators and collection managers
28DRS 2 changes
- New conceptual foundation
- Objects, content models
- User improvements
- Opaque objects, new file formats, tools, guidance
- A new approach to metadata
- Increased preservation planning and activities
29Objects
- Currently only a file level in the DRS
- All management has to be done at the individual
file level - Objects are aggregations of files
- Page-turned object
- Still image object
- More intuitive unit for management, reporting and
searching - Example How many Page-turned objects do I have
in the DRS?
30Content models
- Types of objects
- Example audio content model
31Support for opaque objects
- A special content model
- Allows files in any format
- Digital equivalent of buying time at HD
- Content can be minimally processed, or can be
fully processed by depositors but not yet
supported by the DRS - Must be intended for long-term preservation
- Will receive some preservation services
- Will be on a path to fuller DRS preservation
32Support for new file formats
- PDF
- Audio
- MP3, MP4/AAC
- Drawings
- AutoCAD
- Adobe Illustrator
- Video
- Whats next?
33Deposit, management delivery tools
- Enhanced Batch Builder
- Integrated with File Information Tool Set (FITS)
- Enhanced DRS Web Admin
- Better searching
- Richer management and reporting
- Ability to perform batch updates
- File Delivery Service (FDS)
- Created for PDF delivery
- Delivers a file to users web browser
34Future of http//hul.harvard.edu/ois/
35Guidance user community
- New website for digital preservation
- Formats central
- Content models
- DRS practices
- HUL digital preservation projects
- Emerging standards and best practices
- Tools, services, registries
- Resources Experts
36A new approach to metadata
- Moving towards community-standard schemas
- PREMIS, MODS, MIX, textMD, etc.
- Metadata files on the file system alongside
content files - object descriptor files
- Preservation, rights, descriptive metadata
- More reliance on embedded metadata
- Automatic extraction at deposit time by FITS
- Third party delivery applications are becoming
aware of file-embedded metadata
37Increased preservation planning and activities
- More granular format identification
- Sub-file characterization
- Preservation plans per content model
- Digital first aid (content metadata)
- Localization, migrations, normalizations
- Technology watch
- Virus checking
385. Open questions