Practical Approaches to FutureProofing Institutional Web Sites - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Practical Approaches to FutureProofing Institutional Web Sites

Description:

Your institutional web sites hold content of great scholarly and ... IETF adage: be conservative in your outputs you produce and liberal in inputs you accept ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 25
Provided by: dcc8
Category:

less

Transcript and Presenter's Notes

Title: Practical Approaches to FutureProofing Institutional Web Sites


1
Practical Approaches to Future-Proofing
Institutional Web Sites
  • 19 January 2006
  • John Kunze, California Digital Library

2
Future-Proofing in a Nutshell
  • Loss happens
  • The 3 Rs
  • Reduce narrow, complex dependencies
  • Redirect URLs consider opaque ids
  • Replicate
  • Think consortially, act locally
  • Apply Not Bad Practices
  • Requirements Arent

3
Why Try to Future-Proof?
  • Your institutional web sites hold content of
    great scholarly and cultural value
  • Much of that content is online only
  • The amount of data is growing rapidly
  • Site technology and construction techniques
    evolve rapidly
  • Digital loss is commonplace, imminent, and
    devastating

4
Kinds of Loss
  • Hard loss - some or all data bits are missing
  • Soft loss - bits still somewhere, we think
  • Syntactic loss - bits there but format cannot be
    rendered by software
  • Semantic loss - data renderable without error,
    but not understandably
  • Legal loss - data format or data itself is
    legally encumbered

5
Complex, Expensive Problem
  • Future-proofing web sites is a kind of digital
    preservation, and shares many of its woes
  • Theyre hard, with no widely accepted meaning
  • Theyre expensive and based on guesswork
  • Some loss is unavoidable
  • Prioritize
  • What content do you care about most?
  • Where will your money go furthest?
  • When to be perfect? When to do triage?

6
Uneasy Definitions
  • DCC Digital curation is about maintaining and
    adding value to a trusted body of digital
    information for current and future use
  • Digital preservation is only a subset of digital
    curation, and is much more than just about bits
  • Try this Digital preservation is the ongoing
    manipulation of a digital object with the intent
    of keeping its bitstreams usable to some
  • Fine, but words in italics need major work
  • Some efforts DPC, PANDORA, NDIIPP

7
Towards Defining Object, Page, Document, and Site
  • Web units match poorly with curatorial units
  • Digital Object a set of bit streams deemed to be
    logically related by ?
  • Web Page the set of HTTP GETs triggered by an
    initial browser fetch?
  • Web Document no automatic detection
  • Web Site no automatic detection
  • Didnt get very far

8
Preservation of Abstraction
  • All our experience of digital data is
    intermediated by software, which produces
    different things depending on things like
  • Algorithms, browser configurations, window size,
    local fonts, network traffic, client network
    location, randomized ads, time of day, etc.
  • Do we know which of those is relevant or
    important to save?
  • Can we even formulate Best Practices?

9
Downgrading Expectations
  • Future-proofing? OK, future-improving
  • Consider looking to minimize the worst potential
    loss rather than seeking the best solution for
    each next set of data
  • Whats the simplest thing you could do to have
    the biggest possible impact?
  • Example why not snapshot the internet every six
    weeks, like the Internet Archive (IA)?
  • The uncool (bit-level preservation) became cool

10
Web Archiving
  • Preservation can be split into the Our Stuff
    Problem and the Their Stuff Problem
  • We take responsibility for stuff on our
    institutional site
  • Much as we like their stuff, on their site we
    cant take responsibility for it until we get a
    copy
  • We therefore do web archiving, the automated
    harvest of web content using crawler programs
  • Some web archiving players
  • IA (WayBack Machine, Heritrix crawler), IIPC
    (WARC)
  • NWA (WERA), PANDAS, UKWAC, TNA, HTTrack, IWAW

11
Web Archiving Meets Web Site Future-Improving
  • Odd fit your web archive is not a way to
    improve your sites future, but the future of the
    site that you harvest
  • How? One of the 3 Rs Replication
  • Their site is a passive beneficiary of your
    archiving efforts but they can make it easier
  • By tailoring the site to be crawler-friendly
  • perhaps by consortial pre-arrangement (e.g.,
    you harvest my site and Ill harvest yours)

12
Back to Your Sites Future
  • A direct management strategy, possibly with
  • An indirect, passive strategy ranging from
  • Loosely coupled, e.g., IA crawler, to
  • Tightly coupled, e.g., mutual cross copying with
    link checking, checksum validation, and detailed
    sitemaps to allow capturing entire databases
    normally hidden to crawlers
  • Agreements are vital to indirect strategies
  • E.g., google sitemaps, maybe OAI-PMH
  • Mirroring software preferred among partners

13
Letting Yourself Be Saved
  • In future-improving your site, think about the
    crawler view (if you want) as you will hear
    presented in this workshop
  • In some cases, the IAs crawlers may be adequate
    backup, but it wouldnt hurt to check their
    coverage if youre counting on it
  • In most cases, it wont be adequate
  • Often too infrequent, shallow, or spotty

14
Saving Yourself Policy
  • Different flavors of preservation exist, not just
    between organizations, but for different kinds of
    objects within one collection
  • Preservation is nuanced, not on or off
  • Whats your policy? Standards needed
  • Commitment statements
  • Permanence ratings (e.g., US NLM)
  • Rights declarations
  • Trusted Digital Repositories Attributes and
    Responsibilities (RLG)
  • Organizations are surprisingly cautious about
    commitments to preservation

15
Saving Yourself Short is Long
  • Best indicator of the future is the recent past
  • E.g., exactly how OS job scheduling works
  • Your skilled computer sysadmin is a great adviser
    on practical long-term preservation
  • Maintaining, monitoring, converting, upgrading,
    and migrating user files and software on multiple
    hardware/software platforms several times a year
    develops killer insights into long-term
    maintenance
  • Long-term is like short-term, only more so
  • Reduce dependencies to simplify maintenance

16
Saving Yourself Depend Less
  • Technical dependencies wont all go away, as we
    cant experience the bits without them
  • Why paper lives a 1000 years
  • Why plain text has lasted 30 years
  • But which technical dependencies to lose first?
  • Narrow, complex tools (e.g., Handle) are risky
  • Diffuse tools will be fixed before you notice
    (e.g., browser, OS, DNS, Internet, Acrobat)
  • Narrow, simple tools fixable in the community

17
Saving Yourself Stable Identifiers
  • URL-based identifiers are shorthand for most
    direct online access in the world
  • Whatever id you pick (URN, PURL, ARK, DOI,
    Handle), publish the URL form of it
  • Redirect, redirect, redirect
  • Prevent headaches by avoiding semantics and using
    opaque ids and hostnames
  • Semantics create unstable dependencies that will
    force links to be changed for political reasons
  • Think consortially to stabilize small hostnames

18
Some Identifier Events
  • 10 April 2006 DLF Developer Group is putting on a
    special DLF Forum session on scheme-agnostic
    global id resolution held in Austin, Texas
  • 13 March 2006 NISO is putting on an identifier
    roundtable in Washington, DC following on the DCC
    identifier workshop last summer in Glasgow

19
A Glimpse at Migration and Emulation
  • Migration problems
  • Unknown costs, human review, format errors
  • Emulation problems
  • Unknown costs, human review, software IP
  • Both try to keep up with or preserve an objects
    technical context
  • Desiccation to reduce that context

20
A Desiccated Data Detour
  • Remarkable lesson from the longest-lived online
    digital format
  • Plain text archives of IETF internet RFCs
  • High in value, low in features
  • Preservation through desiccation
  • No fonts, graphics, colors, diacritics, etc.
  • But essential cultural value retained

21
Hedging our Bets
  • Always save the original format
  • In addition, derive desiccated formats in case
    the original format ever fails
  • Raster image as alternate desiccated format
  • Rectangular grid of picture elements
  • Rendering tools will be best at peak of formats
    popularity
  • Very common malformed format instances
  • Additional fall back format in case the original
    and plain text versions fail
  • May never have money to touch the objects again

22
A Glimpse at Metadata
  • Metadata is important for managed collections,
    especially in libraries
  • Controversial whether to require it
  • In the end, requirements waived for good stuff
  • PREMIS Preservation Metadata
  • Large metadata dictionary for info on format,
    fixity, provenance, relationships, environment,
    etc.
  • Great checklist for preservation process planning
  • Rule if you get metadata, dont destroy it
  • Consider HTML meta tags, RDF, and Dublin Core

23
A Glimpse at Formats
  • Formats create rendering dependencies
  • Global Digital Format Registry (GDFR) just
    received Mellon funding to go forward
  • JHOVE tool used very commonly
  • Requirements frequently softened on ingest,
    especially in web archiving with no recourse
  • Not Bad Practices on ingest
  • Standards and Best Practices apply well if you
    control or can influence production
  • IETF adage be conservative in your outputs you
    produce and liberal in inputs you accept

24
Future-Improving in a Nutshell
  • Loss happens
  • The 3 Rs
  • Reduce narrow, complex dependencies
  • Redirect URLs consider opaque ids
  • Replicate
  • Think consortially, act locally
  • Apply Not Bad Practices
  • Requirements Arent
Write a Comment
User Comments (0)
About PowerShow.com