Title: Lazy Preservation: Reconstructing Websites by Crawling the Crawlers
1Lazy Preservation Reconstructing Websites by
Crawling the Crawlers
- Frank McCown, Joan A. Smith, Michael L. Nelson,
Johan Bollen - Old Dominion UniversityNorfolk, Virginia, USA
-
- Arlington, VirginiaNovember 10, 2006
2Outline
- Web page threats
- Web Infrastructure
- Web caching experiment
- Web repository crawling
- Website reconstruction experiment
3Black hat http//img.webpronews.com/securityprone
ws/110705blackhat.jpgVirus image
http//polarboing.com/images/topics/misc/story.com
puter.virus_1137794805.jpg Hard drive
http//www.datarecoveryspecialist.com/images/head-
crash-2.jpg
4How much of the Web is indexed?
Estimates from The Indexable Web is More than
11.5 billion pages by Gulli and Signorini
(WWW05)
5 6(No Transcript)
7Cached Image
8Cached PDF
http//www.fda.gov/cder/about/whatwedo/testtube.pd
f
canonical
MSN version Yahoo
version Google version
9Web Repository Characteristics
C Canonical version is stored M Modified version
is stored (modified images are thumbnails, all
others are html conversions) R Indexed but not
retrievable S Indexed but not stored
10Timeline of Web Resource
11Web Caching Experiment
- Create 4 websites composed of HTML, PDF, images
- http//www.owenbrau.com/
- http//www.cs.odu.edu/fmccown/lazy/
- http//www.cs.odu.edu/jsmit/
- http//www.cs.odu.edu/mln/lazp/
- Remove pages each day
- Query GMY each day using identifiers
12(No Transcript)
13(No Transcript)
14(No Transcript)
15(No Transcript)
16Crawling the Web and web repositories
17- First developed in fall of 2005
- Available for download at http//www.cs.odu.edu/f
mccown/warrick/ - www2006.org first lost website reconstructed
(Nov 2005) - DCkickball.org first website someone else
reconstructed without our help (late Jan 2006) - www.iclnet.org first website we reconstructed
for someone else (mid Mar 2006) - Internet Archive officially endorses Warrick (mid
Mar 2006)
18How Much Did We Reconstruct?
Lost web site Reconstructed
web site
A
A
B
C
F
B
C
G
E
D
E
F
Four categories of recovered resources 1)
Identical A, E2) Changed B, C3) Missing D,
F4) Added G
Missing link to D points to old resource G
F cant be found
19Reconstruction Diagram
added 20
changed 33
missing 17
identical 50
20Reconstruction Experiment
- Crawl and reconstruct 24 sites of various sizes
- 1. small (1-150 resources) 2. medium (151-499
resources)3. large (500 resources) - Perform 5 reconstructions for each website
- One using all four repositories together
- Four using each repository separately
- Calculate reconstruction vector for each
reconstruction (changed, missing, added)
21Frank McCown, Joan A. Smith, Michael L. Nelson,
and Johan Bollen. Reconstructing Websites for the
Lazy Webmaster, Technical Report, arXiv
cs.IR/0512069, 2005.
22Recovery Success by MIME Type
23Repository Contributions
24Current Future Work
- Building a web interface for Warrick
- Currently crawling reconstructing 300 randomly
sampled websites each week - Move from descriptive model to proscriptive
predictive model - Injecting server-side functionality into WI
- Recover the PHP code, not just the HTML
25Time Queries
26Traditional Web Crawler
27Web-Repository Crawler
28Limitations
- Web crawling
- Limit hit rate per host
- Websites periodically unavailable
- Portions of website off-limits (robots.txt,
passwords) - Deep web
- Spam
- Duplicate content
- Flash and JavaScript interfaces
- Crawler traps
- Web-repo crawling
- Limit hit rate per repo
- Limited hits per day (API query quotas)
- Repos periodically unavailable
- Flash and JavaScript interfaces
- Can only recover what repos have stored
- Lossy format conversions (thumb nail images,
HTMLlized PDFs, etc.)