Practical Approaches to FutureProofing Institutional Web Sites

About This Presentation

Title:

Practical Approaches to FutureProofing Institutional Web Sites

Description:

Your institutional web sites hold content of great scholarly and ... IETF adage: be conservative in your outputs you produce and liberal in inputs you accept ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 25

Provided by: dcc8

Category:

more less

Transcript and Presenter's Notes

Title: Practical Approaches to FutureProofing Institutional Web Sites

1
Practical Approaches to Future-Proofing
Institutional Web Sites

19 January 2006
John Kunze, California Digital Library

2
Future-Proofing in a Nutshell

Loss happens
The 3 Rs
Reduce narrow, complex dependencies
Redirect URLs consider opaque ids
Replicate
Think consortially, act locally
Apply Not Bad Practices
Requirements Arent

3
Why Try to Future-Proof?

Your institutional web sites hold content of
great scholarly and cultural value
Much of that content is online only
The amount of data is growing rapidly
Site technology and construction techniques
evolve rapidly
Digital loss is commonplace, imminent, and
devastating

4
Kinds of Loss

Hard loss - some or all data bits are missing
Soft loss - bits still somewhere, we think
Syntactic loss - bits there but format cannot be
rendered by software
Semantic loss - data renderable without error,
but not understandably
Legal loss - data format or data itself is
legally encumbered

5
Complex, Expensive Problem

Future-proofing web sites is a kind of digital
preservation, and shares many of its woes
Theyre hard, with no widely accepted meaning
Theyre expensive and based on guesswork
Some loss is unavoidable
Prioritize
What content do you care about most?
Where will your money go furthest?
When to be perfect? When to do triage?

6
Uneasy Definitions

DCC Digital curation is about maintaining and
adding value to a trusted body of digital
information for current and future use
Digital preservation is only a subset of digital
curation, and is much more than just about bits
Try this Digital preservation is the ongoing
manipulation of a digital object with the intent
of keeping its bitstreams usable to some
Fine, but words in italics need major work
Some efforts DPC, PANDORA, NDIIPP

7
Towards Defining Object, Page, Document, and Site

Web units match poorly with curatorial units
Digital Object a set of bit streams deemed to be
logically related by ?
Web Page the set of HTTP GETs triggered by an
initial browser fetch?
Web Document no automatic detection
Web Site no automatic detection
Didnt get very far

8
Preservation of Abstraction

All our experience of digital data is
intermediated by software, which produces
different things depending on things like
Algorithms, browser configurations, window size,
local fonts, network traffic, client network
location, randomized ads, time of day, etc.
Do we know which of those is relevant or
important to save?
Can we even formulate Best Practices?

9
Downgrading Expectations

Future-proofing? OK, future-improving
Consider looking to minimize the worst potential
loss rather than seeking the best solution for
each next set of data
Whats the simplest thing you could do to have
the biggest possible impact?
Example why not snapshot the internet every six
weeks, like the Internet Archive (IA)?
The uncool (bit-level preservation) became cool

10
Web Archiving

Preservation can be split into the Our Stuff
Problem and the Their Stuff Problem
We take responsibility for stuff on our
institutional site
Much as we like their stuff, on their site we
cant take responsibility for it until we get a
copy
We therefore do web archiving, the automated
harvest of web content using crawler programs
Some web archiving players
IA (WayBack Machine, Heritrix crawler), IIPC
(WARC)
NWA (WERA), PANDAS, UKWAC, TNA, HTTrack, IWAW

11
Web Archiving Meets Web Site Future-Improving

Odd fit your web archive is not a way to
improve your sites future, but the future of the
site that you harvest
How? One of the 3 Rs Replication
Their site is a passive beneficiary of your
archiving efforts but they can make it easier
By tailoring the site to be crawler-friendly
perhaps by consortial pre-arrangement (e.g.,
you harvest my site and Ill harvest yours)

12
Back to Your Sites Future

A direct management strategy, possibly with
An indirect, passive strategy ranging from
Loosely coupled, e.g., IA crawler, to
Tightly coupled, e.g., mutual cross copying with
link checking, checksum validation, and detailed
sitemaps to allow capturing entire databases
normally hidden to crawlers
Agreements are vital to indirect strategies
E.g., google sitemaps, maybe OAI-PMH
Mirroring software preferred among partners

13
Letting Yourself Be Saved

In future-improving your site, think about the
crawler view (if you want) as you will hear
presented in this workshop
In some cases, the IAs crawlers may be adequate
backup, but it wouldnt hurt to check their
coverage if youre counting on it
In most cases, it wont be adequate
Often too infrequent, shallow, or spotty

14
Saving Yourself Policy

Different flavors of preservation exist, not just
between organizations, but for different kinds of
objects within one collection
Preservation is nuanced, not on or off
Whats your policy? Standards needed
Commitment statements
Permanence ratings (e.g., US NLM)
Rights declarations
Trusted Digital Repositories Attributes and
Responsibilities (RLG)
Organizations are surprisingly cautious about
commitments to preservation

15
Saving Yourself Short is Long

Best indicator of the future is the recent past
E.g., exactly how OS job scheduling works
Your skilled computer sysadmin is a great adviser
on practical long-term preservation
Maintaining, monitoring, converting, upgrading,
and migrating user files and software on multiple
hardware/software platforms several times a year
develops killer insights into long-term
maintenance
Long-term is like short-term, only more so
Reduce dependencies to simplify maintenance

16
Saving Yourself Depend Less

Technical dependencies wont all go away, as we
cant experience the bits without them
Why paper lives a 1000 years
Why plain text has lasted 30 years
But which technical dependencies to lose first?
Narrow, complex tools (e.g., Handle) are risky
Diffuse tools will be fixed before you notice
(e.g., browser, OS, DNS, Internet, Acrobat)
Narrow, simple tools fixable in the community

17
Saving Yourself Stable Identifiers

URL-based identifiers are shorthand for most
direct online access in the world
Whatever id you pick (URN, PURL, ARK, DOI,
Handle), publish the URL form of it
Redirect, redirect, redirect
Prevent headaches by avoiding semantics and using
opaque ids and hostnames
Semantics create unstable dependencies that will
force links to be changed for political reasons
Think consortially to stabilize small hostnames

18
Some Identifier Events

10 April 2006 DLF Developer Group is putting on a
special DLF Forum session on scheme-agnostic
global id resolution held in Austin, Texas
13 March 2006 NISO is putting on an identifier
roundtable in Washington, DC following on the DCC
identifier workshop last summer in Glasgow

19
A Glimpse at Migration and Emulation

Migration problems
Unknown costs, human review, format errors
Emulation problems
Unknown costs, human review, software IP
Both try to keep up with or preserve an objects
technical context
Desiccation to reduce that context

20
A Desiccated Data Detour

Remarkable lesson from the longest-lived online
digital format
Plain text archives of IETF internet RFCs
High in value, low in features
Preservation through desiccation
No fonts, graphics, colors, diacritics, etc.
But essential cultural value retained

21
Hedging our Bets

Always save the original format
In addition, derive desiccated formats in case
the original format ever fails
Raster image as alternate desiccated format
Rectangular grid of picture elements
Rendering tools will be best at peak of formats
popularity
Very common malformed format instances
Additional fall back format in case the original
and plain text versions fail
May never have money to touch the objects again

22
A Glimpse at Metadata

Metadata is important for managed collections,
especially in libraries
Controversial whether to require it
In the end, requirements waived for good stuff
PREMIS Preservation Metadata
Large metadata dictionary for info on format,
fixity, provenance, relationships, environment,
etc.
Great checklist for preservation process planning
Rule if you get metadata, dont destroy it
Consider HTML meta tags, RDF, and Dublin Core

23
A Glimpse at Formats

Formats create rendering dependencies
Global Digital Format Registry (GDFR) just
received Mellon funding to go forward
JHOVE tool used very commonly
Requirements frequently softened on ingest,
especially in web archiving with no recourse
Not Bad Practices on ingest
Standards and Best Practices apply well if you
control or can influence production
IETF adage be conservative in your outputs you
produce and liberal in inputs you accept

24
Future-Improving in a Nutshell

Loss happens
The 3 Rs
Reduce narrow, complex dependencies
Redirect URLs consider opaque ids
Replicate
Think consortially, act locally
Apply Not Bad Practices
Requirements Arent

Write a Comment

User Comments (0)

About PowerShow.com

Practical Approaches to FutureProofing Institutional Web Sites - PowerPoint PPT Presentation

Practical Approaches to FutureProofing Institutional Web Sites

Your institutional web sites hold content of great scholarly and ... IETF adage: be conservative in your outputs you produce and liberal in inputs you accept ... – PowerPoint PPT presentation