Title: Web Harvesting and Mirroring Tools
1Web Harvesting and Mirroring Tools
Ignacio Garcia del Campo
Adviser Dr. Nelson
2Introduction
- Importance of Digital Information
- We need to retrieve, and preserve this
information - Most likely outcome of webpages they will
dissapear - What can be done?
- How can we do it?
3Outline
- Scenario
- Case 1 Listing web site contents
- Tools used
- Results
- Case 2 Crawling / Mirroring all listed contents
- Tools used
- Results
- Conclusion
- Questions
4Scenario
- Website background
- Copied portion of the CS webpage
- User folders skipped
- Only 285Mb copied
- Structure maintained
- Due to sectioning, some links are broken (404)
5Scenario (cont.)
- Typical web page file layout
6Scenario (cont.)
- Problems due to using subset of pages
Only 9.61 of the files remained accessible
7Scenario (cont.)
- Solution everything is linked from one seed page
seed
Without seeding, only 9.61 of the files are
accessible
8Case 1 Listing all
- We need everything
- Are all the files in a website linked?
- Most likely they are not
- Even if they are, how do we know it?
- Must have server side assistance
- Direct access
- find
- mdfind
- Installed tools to list contents
- Googles sitemap
- mod_oai
9Case 1 (cont.)
find Unix based tool. It searches through a
directory tree of a filesystem. mdfind MacOSX
utility. It consults the central metadata store
to list files. sitemap.py Googles python
script. This script can create Sitemaps from URL
lists, web server directories, or from
access logs.
command used find webserver/htdocs/ \! type d
print
command used mdfind . onlyin webserver/htdocs/
command used python sitemap.py configconfig.xml
10Case 1 (cont.)
find vs mdfind
- 218 files missing on mdfind listing
- mdfind does not include system files
- mdfind does not include non-extension files
- mdfind provides extensive selection mechanisms
using metadata
11Case 1 (cont.)
find vs sitemap.py
- Only file difference is the sitemap.xml file
created. - The sitemap scripts ignores it every time.
- find includes it as part of the web server
contents - sitemap can look at the logs and find more
usable urls (dynamic files) - It included directories as web server contents
(275 folders)
12Case 2 Mirror entire site
- We want to copy everything
- Are all the files accessible using HTTP?
- No, files on cgi-bin folder are protected
- What are we getting?
- Actual file (same content as stored in server)
- Parsed file (response generated by server app.)
13Case 2 (cont.)
- Tools used to crawl / mirror the site
- heritrix 1.10.0
- Wget 1.9.1
- modoai 0.6 over apache 2.0
Heritrix is the Internet Archive's open-source,
extensible, web-scale, archival-quality web
crawler project.
GNU Wget is a free software package for
retrieving files using HTTP, HTTPS and FTP, the
most widely-used Internet protocols.
Based on OAI-PMH (Open Archives Initiative
Protocol for Metadata Harvesting), modoai is an
apache2.0 module able to respond to OAI-PMH
requests
14Case 2 Heritrix 1.10.0
- Heritrix is the Internet Archive's open-source,
extensible, web-scale, archival-quality web
crawler project. - It is designed to respect exclusion directives,
and collect material at a measured, adaptive pace
unlikely to disrupt normal website activity. - Heritrix allows users to change multiple settings
to fit their needs. - Request delay removed to improve speed
- 404 errors ignored
- Site mirrored instead of archived (ARC files)
15Case 2 Heritrix 1.10.0
16Case 2 Heritrix 1.10.0
- Problems with heritrix
- Crawls outside of specified scope
- Ignores empty files
- Creates extra files for directory listing
- Does not have a re-crawl utility that will get
modified files
17Case 2 GNU Wget 1.9.1
- GNU Wget is a free software package for
retrieving files using HTTP, HTTPS and FTP, the
most widely-used Internet protocols. - It is a non-interactive command line tool, so it
may easily be called from scripts, cron jobs,
terminals without X11 support, etc. - Powerful tool to mirror complete websites.
- It has the option to download only modified / new
files - Depends on Last modified header (usually missing)
command used wget -mirror p w 0
http//darwin.seven.research.odu.edu
18Case 2 GNU Wget 1.9.1
174438 (3.82 MB/s) - darwin.seven.research.odu.
edu/1992/92-11.abs.Z' saved 401/401 --174438-
- http//darwin.seven.research.odu.edu/1992/92-12
.abs.Z gt darwin.seven.research.odu.ed
u/1992/92-12.abs.Z' Reusing connection to
darwin.seven.research.odu.edu80. HTTP request
sent, awaiting response... 200 OK Length 464
application/x-compress 100
gt 464
--.--K/s 174438 (4.43 MB/s) -
darwin.seven.research.odu.edu/1992/92-12.abs.Z'
saved 464/464 --174438-- http//darwin.seven
.research.odu.edu/1992/92-12.ps.Z gt
darwin.seven.research.odu.edu/1992/92-12.ps.Z' Re
using connection to darwin.seven.research.odu.edu
80. HTTP request sent, awaiting response... 200
OK Length 123,705 application/x-compress 100
gt 123,705 --.--K/s 174438
(7.87 MB/s) - darwin.seven.research.odu.edu/1992/
92-12.ps.Z' saved 123705/123705 --174438--
http//darwin.seven.research.odu.edu/1992/92-13.ab
s.Z gt darwin.seven.research.odu.edu/1
992/92-13.abs.Z' Reusing connection to
darwin.seven.research.odu.edu80. HTTP request
sent, awaiting response... 200 OK Length 428
application/x-compress 100
gt 428
--.--K/s
19Case 2 modoai 0.6
- Installed as an apache 2.0 module
- Uses OAI-PMH protocol to provide information
about server contents - Used for metadata information, but can provide
file contents as well. - Responds to requests are given in XML format
- Works with the file-system
- Access to datestamp information
20Case 2 modoai 0.6
- Current issues
- Cannot use files with handlers for security
reasons - Serves actual file, which would include
confidential information - Uses base64 encoding to serve file contents (33
bigger) - Ignores cgi-bin folder protection
21Case 2 Results
22Case 2 Results
- 4615 matches (5)
- heritrix vs modoai (41 differences)
- 18 dynamic files skipped by modoai
- 19 index.html added by heritrix on directory
listings - 5 naming differences
- modoai vs heritrix (188 differences)
- 178 files contained in cgi-bin were included by
modoai - 5 naming differences
- 5 files skipped by heritrix
- 2 empty files
- 2 xxx files
- 1 containing uncommon characters (93_0?1.abs.Z)
23Case 2 Results
- 4639 matches (17)
- heritrix vs Wget (17 differences)
- All due to naming issues in heritrix
- Wget vs heritrix (35 differences)
- 17 due to naming issues
- 18 extra files on Wget
- 2 empty files that were ignored by heritrix
- 16 extra files created by Wget (same directory
crawled twice with different capitalization)
24Conclusion
- It is almost impossible to mirror 100 of a
website - Different crawlers, produce different results
- Selection depends on server contents
- Dynamic vs Static content
- Character codification
- Operating System can make a difference
- modoai could end up being the do-it-all
application - Dynamic files inclusion
- Admin access to get protected files
25Questions?