Lazy Preservation: Reconstructing Websites by Crawling the Crawlers

About This Presentation

Title:

Lazy Preservation: Reconstructing Websites by Crawling the Crawlers

Description:

Lazy Preservation: Reconstructing Websites by Crawling ... How much of the Web is indexed? ... Move from descriptive model to proscriptive & predictive model ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 25

Provided by: Fra5197

Learn more at: https://www.cs.odu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Lazy Preservation: Reconstructing Websites by Crawling the Crawlers

1
Lazy Preservation Reconstructing Websites by
Crawling the Crawlers

Frank McCown, Joan A. Smith, Michael L. Nelson,
Johan Bollen
Old Dominion UniversityNorfolk, Virginia, USA
Arlington, VirginiaNovember 10, 2006

2
Outline

Web page threats
Web Infrastructure
Web caching experiment
Web repository crawling
Website reconstruction experiment

3
Black hat http//img.webpronews.com/securityprone
ws/110705blackhat.jpgVirus image
http//polarboing.com/images/topics/misc/story.com
puter.virus_1137794805.jpg Hard drive
http//www.datarecoveryspecialist.com/images/head-
crash-2.jpg
4
How much of the Web is indexed?
Estimates from The Indexable Web is More than
11.5 billion pages by Gulli and Signorini
(WWW05)
5

6
(No Transcript)
7
Cached Image
8
Cached PDF
http//www.fda.gov/cder/about/whatwedo/testtube.pd
f
canonical
MSN version Yahoo
version Google version
9
Web Repository Characteristics

C Canonical version is stored M Modified version
is stored (modified images are thumbnails, all
others are html conversions) R Indexed but not
retrievable S Indexed but not stored
10
Timeline of Web Resource
11
Web Caching Experiment

Create 4 websites composed of HTML, PDF, images
http//www.owenbrau.com/
http//www.cs.odu.edu/fmccown/lazy/
http//www.cs.odu.edu/jsmit/
http//www.cs.odu.edu/mln/lazp/
Remove pages each day
Query GMY each day using identifiers

12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
(No Transcript)
16
Crawling the Web and web repositories
17

First developed in fall of 2005
Available for download at http//www.cs.odu.edu/f
mccown/warrick/
www2006.org first lost website reconstructed
(Nov 2005)
DCkickball.org first website someone else
reconstructed without our help (late Jan 2006)
www.iclnet.org first website we reconstructed
for someone else (mid Mar 2006)
Internet Archive officially endorses Warrick (mid
Mar 2006)

18
How Much Did We Reconstruct?
Lost web site Reconstructed
web site
A
A
B
C
F
B
C
G
E
D
E
F
Four categories of recovered resources 1)
Identical A, E2) Changed B, C3) Missing D,
F4) Added G
Missing link to D points to old resource G
F cant be found
19
Reconstruction Diagram
added 20
changed 33
missing 17
identical 50
20
Reconstruction Experiment

Crawl and reconstruct 24 sites of various sizes
1. small (1-150 resources) 2. medium (151-499
resources)3. large (500 resources)
Perform 5 reconstructions for each website
One using all four repositories together
Four using each repository separately
Calculate reconstruction vector for each
reconstruction (changed, missing, added)

21
Frank McCown, Joan A. Smith, Michael L. Nelson,
and Johan Bollen. Reconstructing Websites for the
Lazy Webmaster, Technical Report, arXiv
cs.IR/0512069, 2005.
22
Recovery Success by MIME Type
23
Repository Contributions
24
Current Future Work