Title: Lazy Preservation: Reconstructing Websites by Crawling the Crawlers
1Lazy Preservation Reconstructing Websites by
Crawling the Crawlers
- Frank McCown, Joan A. Smith, Michael L. Nelson,
Johan Bollen - Old Dominion UniversityNorfolk, Virginia, USA
- Arlington, VirginiaNovember 10, 2006
- Web page threats
- Web Infrastructure
- Web caching experiment
- Web repository crawling
- Website reconstruction experiment
3Black hat http//img.webpronews.com/securityprone
ws/110705blackhat.jpgVirus image
puter.virus_1137794805.jpg Hard drive
4How much of the Web is indexed?
Estimates from The Indexable Web is More than
11.5 billion pages by Gulli and Signorini
5 6(No Transcript)
7Cached Image
8Cached PDF
MSN version Yahoo
version Google version
9Web Repository Characteristics
C Canonical version is stored M Modified version
is stored (modified images are thumbnails, all
others are html conversions) R Indexed but not
retrievable S Indexed but not stored
10Timeline of Web Resource
11Web Caching Experiment
- Create 4 websites composed of HTML, PDF, images
- http//www.owenbrau.com/
- http//www.cs.odu.edu/fmccown/lazy/
- http//www.cs.odu.edu/jsmit/
- http//www.cs.odu.edu/mln/lazp/
- Remove pages each day
- Query GMY each day using identifiers
12(No Transcript)
13(No Transcript)
14(No Transcript)
15(No Transcript)
16Crawling the Web and web repositories
17- First developed in fall of 2005
- Available for download at http//www.cs.odu.edu/f
mccown/warrick/ - www2006.org first lost website reconstructed
(Nov 2005) - DCkickball.org first website someone else
reconstructed without our help (late Jan 2006) - www.iclnet.org first website we reconstructed
for someone else (mid Mar 2006) - Internet Archive officially endorses Warrick (mid
Mar 2006)
18How Much Did We Reconstruct?
Lost web site Reconstructed
web site
Four categories of recovered resources 1)
Identical A, E2) Changed B, C3) Missing D,
F4) Added G
Missing link to D points to old resource G
F cant be found
19Reconstruction Diagram
added 20
changed 33
missing 17
identical 50
20Reconstruction Experiment
- Crawl and reconstruct 24 sites of various sizes
- 1. small (1-150 resources) 2. medium (151-499
resources)3. large (500 resources) - Perform 5 reconstructions for each website
- One using all four repositories together
- Four using each repository separately
- Calculate reconstruction vector for each
reconstruction (changed, missing, added)
21Frank McCown, Joan A. Smith, Michael L. Nelson,
and Johan Bollen. Reconstructing Websites for the
Lazy Webmaster, Technical Report, arXiv
cs.IR/0512069, 2005.
22Recovery Success by MIME Type
23Repository Contributions
24Current Future Work
- Building a web interface for Warrick
- Currently crawling reconstructing 300 randomly
sampled websites each week - Move from descriptive model to proscriptive
predictive model - Injecting server-side functionality into WI
- Recover the PHP code, not just the HTML
25Time Queries
26Traditional Web Crawler
27Web-Repository Crawler
- Web crawling
- Limit hit rate per host
- Websites periodically unavailable
- Portions of website off-limits (robots.txt,
passwords) - Deep web
- Spam
- Duplicate content
- Flash and JavaScript interfaces
- Crawler traps
- Web-repo crawling
- Limit hit rate per repo
- Limited hits per day (API query quotas)
- Repos periodically unavailable
- Flash and JavaScript interfaces
- Can only recover what repos have stored
- Lossy format conversions (thumb nail images,
HTMLlized PDFs, etc.)