Web Harvesting and Mirroring Tools - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Web Harvesting and Mirroring Tools

Description:

Crawls outside of specified scope. Ignores empty files. Creates extra files for directory listing ... Wget (same directory crawled twice with different ... – PowerPoint PPT presentation

Number of Views:100

Avg rating:3.0/5.0

Slides: 26

Provided by: ignacio7

Category:

more less

Transcript and Presenter's Notes

Title: Web Harvesting and Mirroring Tools

1
Web Harvesting and Mirroring Tools

A Comparison

Ignacio Garcia del Campo
Adviser Dr. Nelson
2
Introduction

Importance of Digital Information
We need to retrieve, and preserve this
information
Most likely outcome of webpages they will
dissapear
What can be done?
How can we do it?

3
Outline

Scenario
Case 1 Listing web site contents
Tools used
Results
Case 2 Crawling / Mirroring all listed contents
Tools used
Results
Conclusion
Questions

4
Scenario

Website background
Copied portion of the CS webpage
User folders skipped
Only 285Mb copied
Structure maintained
Due to sectioning, some links are broken (404)

5
Scenario (cont.)

Typical web page file layout

6
Scenario (cont.)

Problems due to using subset of pages

Only 9.61 of the files remained accessible
7
Scenario (cont.)

Solution everything is linked from one seed page

seed
Without seeding, only 9.61 of the files are
accessible
8
Case 1 Listing all

We need everything
Are all the files in a website linked?
Most likely they are not
Even if they are, how do we know it?
Must have server side assistance
Direct access
find
mdfind
Installed tools to list contents
Googles sitemap
mod_oai

9
Case 1 (cont.)
find Unix based tool. It searches through a
directory tree of a filesystem. mdfind MacOSX
utility. It consults the central metadata store
to list files. sitemap.py Googles python
script. This script can create Sitemaps from URL
lists, web server directories, or from
access logs.
command used find webserver/htdocs/ \! type d
print
command used mdfind . onlyin webserver/htdocs/
command used python sitemap.py configconfig.xml
10
Case 1 (cont.)
find vs mdfind

218 files missing on mdfind listing
mdfind does not include system files
mdfind does not include non-extension files
mdfind provides extensive selection mechanisms
using metadata

11
Case 1 (cont.)
find vs sitemap.py

Only file difference is the sitemap.xml file
created.
The sitemap scripts ignores it every time.
find includes it as part of the web server
contents
sitemap can look at the logs and find more
usable urls (dynamic files)
It included directories as web server contents
(275 folders)

12
Case 2 Mirror entire site

We want to copy everything
Are all the files accessible using HTTP?
No, files on cgi-bin folder are protected
What are we getting?
Actual file (same content as stored in server)
Parsed file (response generated by server app.)

13
Case 2 (cont.)

Tools used to crawl / mirror the site
heritrix 1.10.0
Wget 1.9.1
modoai 0.6 over apache 2.0

Heritrix is the Internet Archive's open-source,
extensible, web-scale, archival-quality web
crawler project.
GNU Wget is a free software package for
retrieving files using HTTP, HTTPS and FTP, the
most widely-used Internet protocols.
Based on OAI-PMH (Open Archives Initiative
Protocol for Metadata Harvesting), modoai is an
apache2.0 module able to respond to OAI-PMH
requests
14
Case 2 Heritrix 1.10.0

Heritrix is the Internet Archive's open-source,
extensible, web-scale, archival-quality web
crawler project.
It is designed to respect exclusion directives,
and collect material at a measured, adaptive pace
unlikely to disrupt normal website activity.
Heritrix allows users to change multiple settings
to fit their needs.
Request delay removed to improve speed
404 errors ignored
Site mirrored instead of archived (ARC files)

15
Case 2 Heritrix 1.10.0
16
Case 2 Heritrix 1.10.0

Problems with heritrix
Crawls outside of specified scope
Ignores empty files
Creates extra files for directory listing
Does not have a re-crawl utility that will get
modified files

17
Case 2 GNU Wget 1.9.1

GNU Wget is a free software package for
retrieving files using HTTP, HTTPS and FTP, the
most widely-used Internet protocols.
It is a non-interactive command line tool, so it
may easily be called from scripts, cron jobs,
terminals without X11 support, etc.
Powerful tool to mirror complete websites.
It has the option to download only modified / new
files
Depends on Last modified header (usually missing)

command used wget -mirror p w 0
http//darwin.seven.research.odu.edu
18
Case 2 GNU Wget 1.9.1
174438 (3.82 MB/s) - darwin.seven.research.odu.
edu/1992/92-11.abs.Z' saved 401/401 --174438-
- http//darwin.seven.research.odu.edu/1992/92-12
.abs.Z gt darwin.seven.research.odu.ed
u/1992/92-12.abs.Z' Reusing connection to
darwin.seven.research.odu.edu80. HTTP request
sent, awaiting response... 200 OK Length 464
application/x-compress 100
gt 464
--.--K/s 174438 (4.43 MB/s) -
darwin.seven.research.odu.edu/1992/92-12.abs.Z'
saved 464/464 --174438-- http//darwin.seven
.research.odu.edu/1992/92-12.ps.Z gt
darwin.seven.research.odu.edu/1992/92-12.ps.Z' Re
using connection to darwin.seven.research.odu.edu
80. HTTP request sent, awaiting response... 200
OK Length 123,705 application/x-compress 100

gt 123,705 --.--K/s 174438
(7.87 MB/s) - darwin.seven.research.odu.edu/1992/
92-12.ps.Z' saved 123705/123705 --174438--
http//darwin.seven.research.odu.edu/1992/92-13.ab
s.Z gt darwin.seven.research.odu.edu/1
992/92-13.abs.Z' Reusing connection to
darwin.seven.research.odu.edu80. HTTP request
sent, awaiting response... 200 OK Length 428
application/x-compress 100
gt 428
--.--K/s
19
Case 2 modoai 0.6

Installed as an apache 2.0 module
Uses OAI-PMH protocol to provide information
about server contents
Used for metadata information, but can provide
file contents as well.
Responds to requests are given in XML format
Works with the file-system
Access to datestamp information

20
Case 2 modoai 0.6

Current issues
Cannot use files with handlers for security
reasons
Serves actual file, which would include
confidential information
Uses base64 encoding to serve file contents (33
bigger)
Ignores cgi-bin folder protection

21
Case 2 Results
22
Case 2 Results

4615 matches (5)
heritrix vs modoai (41 differences)
18 dynamic files skipped by modoai
19 index.html added by heritrix on directory
listings
5 naming differences
modoai vs heritrix (188 differences)
178 files contained in cgi-bin were included by
modoai
5 naming differences
5 files skipped by heritrix
2 empty files
2 xxx files
1 containing uncommon characters (93_0?1.abs.Z)

23
Case 2 Results

4639 matches (17)
heritrix vs Wget (17 differences)
All due to naming issues in heritrix
Wget vs heritrix (35 differences)
17 due to naming issues
18 extra files on Wget
2 empty files that were ignored by heritrix
16 extra files created by Wget (same directory
crawled twice with different capitalization)

24
Conclusion