Title: Web site archiving by capturing all unique responses
1Web site archivingby capturing all unique
responses
Archiving the Web Conference Information
DayNational Library of Australia12 November 2004
- Kent Fitch, Project Computing Pty Ltd
2Reasons for archiving web sites
- They are important
- Main public and internal communication mechanism
- Australian "Government Online", US Government
Paperwork Elimination Act - Legal
- Act of publication
- Context as well as content
- Reputation, Community Expectations
- Commercial Advantage
- Providence
3Web site characteristics
- Increasing dynamic content
- Content changes relatively slowly as a of total
- Small set of pages account for most hits
- Most responses have been seen before
4Web site characteristics
5Desirable attributes of an archiving methodology
- Coverage
- Temporal
- Responses to searches, forms, scripted links
- Robustness
- Simple
- Adaptable
- Cost
- Feasible
- Scalable
- Ability to recreate web site at a point in time
- Exactly as originally delivered
- Support analysis, recovery
6Approaches to archiving
- Content archiving
- Input side capture all changes
- Snapshot
- crawl
- backup
- Response archiving
- Output side capture all unique request/responses
7Responsearchiving
Contentarchiving
Snapshot
?
Address space temporally complete.Not
subvertableToo complete!
?
Incomplete(no forms, scripts..) Gap between
crawls
?
Dynamic content hard to capture.Subvertable
Coverage
?
?
Conceptually simple Independent ofcontent type
?
Simple
?
Assumes all contentis perfectly managed
Robust
?
In the critical path
??
Small volumes
Cost
Complete crawl is large
Often part of CMS Small volumes
Often part of CMS Small volumes
?
?
Collection overhead
?
Faithful andcomplete
Recreateweb site
?
Faithful but incomplete
?
Requires live site hardware software,
content, data, authentication, ...
8Is response archiving feasible?
- Yes, because
- Only a small of responses are unique
- Overhead and robustness can be addressed by
design - Non material changes can be defined
9Approaches to response archiving
- Network sniffer
- Not in the critical path
- Cannot support HTTPS
- Proxy
- End to end problems (HTTPS, client IP addr)
- Extra latency (TCP/IP session)
- Filter
- Runs within web server
- Full access to req/resp
10A Filter implementation pageVault
- Simple filter gatherer
- Uses Apache 2 or IIS server architecture
- Big problems with Apache 1
- Does as little as possible within the server
11pageVault Architecture
12pageVault design goals
- Filter must be simple, efficient, robust
- Negligible impact on server performance
- No changes to web applications
- Selection of responses to archive based on
URL,content type - Support definition of non material
differences - Flexible archiving
- Union archives, split archives
- Complete point in time viewing experience
- Plus support for analysis
13Sample pageVault archive capabilities
- What did this page/this site look like at 930 on
4th May last year? - How many times and exactly how has this page
changed over the last 6 months? - Which images in the "logos" directory have
changed this week? - Show these versions of this URL side-by-side
- Which media releases on our site have mentioned
John Laws, however briefly available?
14Performance impact
- Determining "uniqueness" requires calculation of
checksum - 0.2ms per 10KB
- pageVault adds 0.3 - 0.4 ms to service a typical
request - a minimal static page takes 1.1ms,
- typical scripted pages take 5 100ms...
- performance impact of determining strings to
exclude for non-material purposes is
negligible - - Apache 2.0.40, Sparc 750MHz processor,
Solaris 8
15Comparison with Vignettes WebCapture
WebCapture pageVault
- Simple, standalone, lightweight
- Inexpensive
- Targets all responses
- Aims to recreate all responses on the entire
website
- Enterprise-sized, integrated, strategic
- Large investment
- Focused on transactions
- Aims to be able to replay transactions
16pageVault applicability
- Simple web site archives
- Notary service
- Independent archive of delivered responses
- Union archive
- Organisation-wide (multiple sites)
- National archive
- Thematic collection
17Summary
- Effective web site archiving is an unmet need
- Legal
- Reputation, community expectation
- Providence
- Complete archiving with input-side and snapshot
approaches is impractical - An output-side approach can be scalable,
complete, inexpensive
18Thanks to...
- Russell McCaskie, Records Manager, CSIRO
- Russell was responsible for bringing the
significant issues with the preservation and
management of web-site content to our attention
in 1999 - The JDBM team
- An open source BTree implementation used by
pageVault
19More information
- http//www.projectcomputing.com
- kent.fitch_at_projectcomputing.com