Web site archiving by capturing all unique responses - PowerPoint PPT Presentation

About This Presentation

Title:

Web site archiving by capturing all unique responses

Description:

Main public and internal communication mechanism ... Comparison with Vignette's WebCapture. Enterprise-sized, integrated, strategic. Large investment ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 20

Provided by: projectc5

Category:

more less

Transcript and Presenter's Notes

Title: Web site archiving by capturing all unique responses

1
Web site archivingby capturing all unique
responses
Archiving the Web Conference Information
DayNational Library of Australia12 November 2004

Kent Fitch, Project Computing Pty Ltd

2
Reasons for archiving web sites

They are important
Main public and internal communication mechanism
Australian "Government Online", US Government
Paperwork Elimination Act
Legal
Act of publication
Context as well as content
Reputation, Community Expectations
Commercial Advantage
Providence

3
Web site characteristics

Increasing dynamic content
Content changes relatively slowly as a of total
Small set of pages account for most hits
Most responses have been seen before

4
Web site characteristics
5
Desirable attributes of an archiving methodology

Coverage
Temporal
Responses to searches, forms, scripted links
Robustness
Simple
Adaptable
Cost
Feasible
Scalable
Ability to recreate web site at a point in time
Exactly as originally delivered
Support analysis, recovery

6
Approaches to archiving

Content archiving
Input side capture all changes
Snapshot
crawl
backup
Response archiving
Output side capture all unique request/responses

7
Responsearchiving
Contentarchiving
Snapshot
?
Address space temporally complete.Not
subvertableToo complete!
?
Incomplete(no forms, scripts..) Gap between
crawls
?
Dynamic content hard to capture.Subvertable
Coverage
?
?
Conceptually simple Independent ofcontent type
?
Simple
?
Assumes all contentis perfectly managed
Robust
?
In the critical path
??
Small volumes
Cost
Complete crawl is large
Often part of CMS Small volumes
Often part of CMS Small volumes
?
?
Collection overhead
?
Faithful andcomplete
Recreateweb site
?
Faithful but incomplete
?
Requires live site hardware software,
content, data, authentication, ...
8
Is response archiving feasible?

Yes, because
Only a small of responses are unique
Overhead and robustness can be addressed by
design
Non material changes can be defined

9
Approaches to response archiving

Network sniffer
Not in the critical path
Cannot support HTTPS
Proxy
End to end problems (HTTPS, client IP addr)
Extra latency (TCP/IP session)
Filter
Runs within web server
Full access to req/resp

10
A Filter implementation pageVault

Simple filter gatherer
Uses Apache 2 or IIS server architecture
Big problems with Apache 1
Does as little as possible within the server

11
pageVault Architecture
12
pageVault design goals

Filter must be simple, efficient, robust
Negligible impact on server performance
No changes to web applications
Selection of responses to archive based on
URL,content type
Support definition of non material
differences
Flexible archiving
Union archives, split archives
Complete point in time viewing experience
Plus support for analysis

13
Sample pageVault archive capabilities

What did this page/this site look like at 930 on
4th May last year?
How many times and exactly how has this page
changed over the last 6 months?
Which images in the "logos" directory have
changed this week?
Show these versions of this URL side-by-side
Which media releases on our site have mentioned
John Laws, however briefly available?

14
Performance impact

Determining "uniqueness" requires calculation of
checksum
0.2ms per 10KB
pageVault adds 0.3 - 0.4 ms to service a typical
request
a minimal static page takes 1.1ms,
typical scripted pages take 5 100ms...
performance impact of determining strings to
exclude for non-material purposes is
negligible
- Apache 2.0.40, Sparc 750MHz processor,
Solaris 8

15
Comparison with Vignettes WebCapture
WebCapture pageVault

Simple, standalone, lightweight
Inexpensive
Targets all responses
Aims to recreate all responses on the entire
website

Enterprise-sized, integrated, strategic
Large investment
Focused on transactions
Aims to be able to replay transactions

16
pageVault applicability

Simple web site archives
Notary service
Independent archive of delivered responses
Union archive
Organisation-wide (multiple sites)
National archive
Thematic collection

17
Summary

Effective web site archiving is an unmet need
Legal
Reputation, community expectation
Providence
Complete archiving with input-side and snapshot
approaches is impractical
An output-side approach can be scalable,
complete, inexpensive

18
Thanks to...

Russell McCaskie, Records Manager, CSIRO
Russell was responsible for bringing the
significant issues with the preservation and
management of web-site content to our attention
in 1999
The JDBM team
An open source BTree implementation used by
pageVault

19
More information