Title: Practical Approaches to FutureProofing Institutional Web Sites
1Practical Approaches to Future-Proofing
Institutional Web Sites
- 19 January 2006
- John Kunze, California Digital Library
2Future-Proofing in a Nutshell
- Loss happens
- The 3 Rs
- Reduce narrow, complex dependencies
- Redirect URLs consider opaque ids
- Replicate
- Think consortially, act locally
- Apply Not Bad Practices
- Requirements Arent
3Why Try to Future-Proof?
- Your institutional web sites hold content of
great scholarly and cultural value - Much of that content is online only
- The amount of data is growing rapidly
- Site technology and construction techniques
evolve rapidly - Digital loss is commonplace, imminent, and
devastating
4Kinds of Loss
- Hard loss - some or all data bits are missing
- Soft loss - bits still somewhere, we think
- Syntactic loss - bits there but format cannot be
rendered by software - Semantic loss - data renderable without error,
but not understandably - Legal loss - data format or data itself is
legally encumbered
5Complex, Expensive Problem
- Future-proofing web sites is a kind of digital
preservation, and shares many of its woes - Theyre hard, with no widely accepted meaning
- Theyre expensive and based on guesswork
- Some loss is unavoidable
- Prioritize
- What content do you care about most?
- Where will your money go furthest?
- When to be perfect? When to do triage?
6Uneasy Definitions
- DCC Digital curation is about maintaining and
adding value to a trusted body of digital
information for current and future use - Digital preservation is only a subset of digital
curation, and is much more than just about bits - Try this Digital preservation is the ongoing
manipulation of a digital object with the intent
of keeping its bitstreams usable to some - Fine, but words in italics need major work
- Some efforts DPC, PANDORA, NDIIPP
7Towards Defining Object, Page, Document, and Site
- Web units match poorly with curatorial units
- Digital Object a set of bit streams deemed to be
logically related by ? - Web Page the set of HTTP GETs triggered by an
initial browser fetch? - Web Document no automatic detection
- Web Site no automatic detection
- Didnt get very far
8Preservation of Abstraction
- All our experience of digital data is
intermediated by software, which produces
different things depending on things like - Algorithms, browser configurations, window size,
local fonts, network traffic, client network
location, randomized ads, time of day, etc. - Do we know which of those is relevant or
important to save? - Can we even formulate Best Practices?
9Downgrading Expectations
- Future-proofing? OK, future-improving
- Consider looking to minimize the worst potential
loss rather than seeking the best solution for
each next set of data - Whats the simplest thing you could do to have
the biggest possible impact? - Example why not snapshot the internet every six
weeks, like the Internet Archive (IA)? - The uncool (bit-level preservation) became cool
10Web Archiving
- Preservation can be split into the Our Stuff
Problem and the Their Stuff Problem - We take responsibility for stuff on our
institutional site - Much as we like their stuff, on their site we
cant take responsibility for it until we get a
copy - We therefore do web archiving, the automated
harvest of web content using crawler programs - Some web archiving players
- IA (WayBack Machine, Heritrix crawler), IIPC
(WARC) - NWA (WERA), PANDAS, UKWAC, TNA, HTTrack, IWAW
11Web Archiving Meets Web Site Future-Improving
- Odd fit your web archive is not a way to
improve your sites future, but the future of the
site that you harvest - How? One of the 3 Rs Replication
- Their site is a passive beneficiary of your
archiving efforts but they can make it easier - By tailoring the site to be crawler-friendly
- perhaps by consortial pre-arrangement (e.g.,
you harvest my site and Ill harvest yours)
12Back to Your Sites Future
- A direct management strategy, possibly with
- An indirect, passive strategy ranging from
- Loosely coupled, e.g., IA crawler, to
- Tightly coupled, e.g., mutual cross copying with
link checking, checksum validation, and detailed
sitemaps to allow capturing entire databases
normally hidden to crawlers - Agreements are vital to indirect strategies
- E.g., google sitemaps, maybe OAI-PMH
- Mirroring software preferred among partners
13Letting Yourself Be Saved
- In future-improving your site, think about the
crawler view (if you want) as you will hear
presented in this workshop - In some cases, the IAs crawlers may be adequate
backup, but it wouldnt hurt to check their
coverage if youre counting on it - In most cases, it wont be adequate
- Often too infrequent, shallow, or spotty
14Saving Yourself Policy
- Different flavors of preservation exist, not just
between organizations, but for different kinds of
objects within one collection - Preservation is nuanced, not on or off
- Whats your policy? Standards needed
- Commitment statements
- Permanence ratings (e.g., US NLM)
- Rights declarations
- Trusted Digital Repositories Attributes and
Responsibilities (RLG) - Organizations are surprisingly cautious about
commitments to preservation
15Saving Yourself Short is Long
- Best indicator of the future is the recent past
- E.g., exactly how OS job scheduling works
- Your skilled computer sysadmin is a great adviser
on practical long-term preservation - Maintaining, monitoring, converting, upgrading,
and migrating user files and software on multiple
hardware/software platforms several times a year
develops killer insights into long-term
maintenance - Long-term is like short-term, only more so
- Reduce dependencies to simplify maintenance
16Saving Yourself Depend Less
- Technical dependencies wont all go away, as we
cant experience the bits without them - Why paper lives a 1000 years
- Why plain text has lasted 30 years
- But which technical dependencies to lose first?
- Narrow, complex tools (e.g., Handle) are risky
- Diffuse tools will be fixed before you notice
(e.g., browser, OS, DNS, Internet, Acrobat) - Narrow, simple tools fixable in the community
17Saving Yourself Stable Identifiers
- URL-based identifiers are shorthand for most
direct online access in the world - Whatever id you pick (URN, PURL, ARK, DOI,
Handle), publish the URL form of it - Redirect, redirect, redirect
- Prevent headaches by avoiding semantics and using
opaque ids and hostnames - Semantics create unstable dependencies that will
force links to be changed for political reasons - Think consortially to stabilize small hostnames
18Some Identifier Events
- 10 April 2006 DLF Developer Group is putting on a
special DLF Forum session on scheme-agnostic
global id resolution held in Austin, Texas - 13 March 2006 NISO is putting on an identifier
roundtable in Washington, DC following on the DCC
identifier workshop last summer in Glasgow
19A Glimpse at Migration and Emulation
- Migration problems
- Unknown costs, human review, format errors
- Emulation problems
- Unknown costs, human review, software IP
- Both try to keep up with or preserve an objects
technical context - Desiccation to reduce that context
20A Desiccated Data Detour
- Remarkable lesson from the longest-lived online
digital format - Plain text archives of IETF internet RFCs
- High in value, low in features
- Preservation through desiccation
- No fonts, graphics, colors, diacritics, etc.
- But essential cultural value retained
21Hedging our Bets
- Always save the original format
- In addition, derive desiccated formats in case
the original format ever fails - Raster image as alternate desiccated format
- Rectangular grid of picture elements
- Rendering tools will be best at peak of formats
popularity - Very common malformed format instances
- Additional fall back format in case the original
and plain text versions fail - May never have money to touch the objects again
22A Glimpse at Metadata
- Metadata is important for managed collections,
especially in libraries - Controversial whether to require it
- In the end, requirements waived for good stuff
- PREMIS Preservation Metadata
- Large metadata dictionary for info on format,
fixity, provenance, relationships, environment,
etc. - Great checklist for preservation process planning
- Rule if you get metadata, dont destroy it
- Consider HTML meta tags, RDF, and Dublin Core
23A Glimpse at Formats
- Formats create rendering dependencies
- Global Digital Format Registry (GDFR) just
received Mellon funding to go forward - JHOVE tool used very commonly
- Requirements frequently softened on ingest,
especially in web archiving with no recourse - Not Bad Practices on ingest
- Standards and Best Practices apply well if you
control or can influence production - IETF adage be conservative in your outputs you
produce and liberal in inputs you accept
24Future-Improving in a Nutshell
- Loss happens
- The 3 Rs
- Reduce narrow, complex dependencies
- Redirect URLs consider opaque ids
- Replicate
- Think consortially, act locally
- Apply Not Bad Practices
- Requirements Arent