Web site archiving by capturing all unique responses - PowerPoint PPT Presentation

About This Presentation
Title:

Web site archiving by capturing all unique responses

Description:

Main public and internal communication mechanism ... Comparison with Vignette's WebCapture. Enterprise-sized, integrated, strategic. Large investment ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 20
Provided by: projectc5
Category:

less

Transcript and Presenter's Notes

Title: Web site archiving by capturing all unique responses


1
Web site archivingby capturing all unique
responses
Archiving the Web Conference Information
DayNational Library of Australia12 November 2004
  • Kent Fitch, Project Computing Pty Ltd

2
Reasons for archiving web sites
  • They are important
  • Main public and internal communication mechanism
  • Australian "Government Online", US Government
    Paperwork Elimination Act
  • Legal
  • Act of publication
  • Context as well as content
  • Reputation, Community Expectations
  • Commercial Advantage
  • Providence

3
Web site characteristics
  • Increasing dynamic content
  • Content changes relatively slowly as a of total
  • Small set of pages account for most hits
  • Most responses have been seen before

4
Web site characteristics
5
Desirable attributes of an archiving methodology
  • Coverage
  • Temporal
  • Responses to searches, forms, scripted links
  • Robustness
  • Simple
  • Adaptable
  • Cost
  • Feasible
  • Scalable
  • Ability to recreate web site at a point in time
  • Exactly as originally delivered
  • Support analysis, recovery

6
Approaches to archiving
  • Content archiving
  • Input side capture all changes
  • Snapshot
  • crawl
  • backup
  • Response archiving
  • Output side capture all unique request/responses

7
Responsearchiving
Contentarchiving
Snapshot
?
Address space temporally complete.Not
subvertableToo complete!
?
Incomplete(no forms, scripts..) Gap between
crawls
?
Dynamic content hard to capture.Subvertable
Coverage
?
?
Conceptually simple Independent ofcontent type
?
Simple
?
Assumes all contentis perfectly managed
Robust
?
In the critical path
??
Small volumes
Cost
Complete crawl is large
Often part of CMS Small volumes
Often part of CMS Small volumes
?
?
Collection overhead
?
Faithful andcomplete
Recreateweb site
?
Faithful but incomplete
?
Requires live site hardware software,
content, data, authentication, ...
8
Is response archiving feasible?
  • Yes, because
  • Only a small of responses are unique
  • Overhead and robustness can be addressed by
    design
  • Non material changes can be defined

9
Approaches to response archiving
  • Network sniffer
  • Not in the critical path
  • Cannot support HTTPS
  • Proxy
  • End to end problems (HTTPS, client IP addr)
  • Extra latency (TCP/IP session)
  • Filter
  • Runs within web server
  • Full access to req/resp

10
A Filter implementation pageVault
  • Simple filter gatherer
  • Uses Apache 2 or IIS server architecture
  • Big problems with Apache 1
  • Does as little as possible within the server

11
pageVault Architecture
12
pageVault design goals
  • Filter must be simple, efficient, robust
  • Negligible impact on server performance
  • No changes to web applications
  • Selection of responses to archive based on
    URL,content type
  • Support definition of non material
    differences
  • Flexible archiving
  • Union archives, split archives
  • Complete point in time viewing experience
  • Plus support for analysis

13
Sample pageVault archive capabilities
  • What did this page/this site look like at 930 on
    4th May last year?
  • How many times and exactly how has this page
    changed over the last 6 months?
  • Which images in the "logos" directory have
    changed this week?
  • Show these versions of this URL side-by-side
  • Which media releases on our site have mentioned
    John Laws, however briefly available?

14
Performance impact
  • Determining "uniqueness" requires calculation of
    checksum
  • 0.2ms per 10KB
  • pageVault adds 0.3 - 0.4 ms to service a typical
    request
  • a minimal static page takes 1.1ms,
  • typical scripted pages take 5 100ms...
  • performance impact of determining strings to
    exclude for non-material purposes is
    negligible
  • - Apache 2.0.40, Sparc 750MHz processor,
    Solaris 8

15
Comparison with Vignettes WebCapture
WebCapture pageVault
  • Simple, standalone, lightweight
  • Inexpensive
  • Targets all responses
  • Aims to recreate all responses on the entire
    website
  • Enterprise-sized, integrated, strategic
  • Large investment
  • Focused on transactions
  • Aims to be able to replay transactions

16
pageVault applicability
  • Simple web site archives
  • Notary service
  • Independent archive of delivered responses
  • Union archive
  • Organisation-wide (multiple sites)
  • National archive
  • Thematic collection

17
Summary
  • Effective web site archiving is an unmet need
  • Legal
  • Reputation, community expectation
  • Providence
  • Complete archiving with input-side and snapshot
    approaches is impractical
  • An output-side approach can be scalable,
    complete, inexpensive

18
Thanks to...
  • Russell McCaskie, Records Manager, CSIRO
  • Russell was responsible for bringing the
    significant issues with the preservation and
    management of web-site content to our attention
    in 1999
  • The JDBM team
  • An open source BTree implementation used by
    pageVault

19
More information
  • http//www.projectcomputing.com
  • kent.fitch_at_projectcomputing.com
Write a Comment
User Comments (0)
About PowerShow.com