Web Harvesting and Mirroring Tools - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Web Harvesting and Mirroring Tools

Description:

Crawls outside of specified scope. Ignores empty files. Creates extra files for directory listing ... Wget (same directory crawled twice with different ... – PowerPoint PPT presentation

Number of Views:100
Avg rating:3.0/5.0
Slides: 26
Provided by: ignacio7
Category:

less

Transcript and Presenter's Notes

Title: Web Harvesting and Mirroring Tools


1
Web Harvesting and Mirroring Tools
  • A Comparison

Ignacio Garcia del Campo
Adviser Dr. Nelson
2
Introduction
  • Importance of Digital Information
  • We need to retrieve, and preserve this
    information
  • Most likely outcome of webpages they will
    dissapear
  • What can be done?
  • How can we do it?

3
Outline
  • Scenario
  • Case 1 Listing web site contents
  • Tools used
  • Results
  • Case 2 Crawling / Mirroring all listed contents
  • Tools used
  • Results
  • Conclusion
  • Questions

4
Scenario
  • Website background
  • Copied portion of the CS webpage
  • User folders skipped
  • Only 285Mb copied
  • Structure maintained
  • Due to sectioning, some links are broken (404)

5
Scenario (cont.)
  • Typical web page file layout

6
Scenario (cont.)
  • Problems due to using subset of pages

Only 9.61 of the files remained accessible
7
Scenario (cont.)
  • Solution everything is linked from one seed page

seed
Without seeding, only 9.61 of the files are
accessible
8
Case 1 Listing all
  • We need everything
  • Are all the files in a website linked?
  • Most likely they are not
  • Even if they are, how do we know it?
  • Must have server side assistance
  • Direct access
  • find
  • mdfind
  • Installed tools to list contents
  • Googles sitemap
  • mod_oai

9
Case 1 (cont.)
find Unix based tool. It searches through a
directory tree of a filesystem. mdfind MacOSX
utility. It consults the central metadata store
to list files. sitemap.py Googles python
script. This script can create Sitemaps from URL
lists, web server directories, or from
access logs.
command used find webserver/htdocs/ \! type d
print
command used mdfind . onlyin webserver/htdocs/
command used python sitemap.py configconfig.xml
10
Case 1 (cont.)
find vs mdfind
  • 218 files missing on mdfind listing
  • mdfind does not include system files
  • mdfind does not include non-extension files
  • mdfind provides extensive selection mechanisms
    using metadata

11
Case 1 (cont.)
find vs sitemap.py
  • Only file difference is the sitemap.xml file
    created.
  • The sitemap scripts ignores it every time.
  • find includes it as part of the web server
    contents
  • sitemap can look at the logs and find more
    usable urls (dynamic files)
  • It included directories as web server contents
    (275 folders)

12
Case 2 Mirror entire site
  • We want to copy everything
  • Are all the files accessible using HTTP?
  • No, files on cgi-bin folder are protected
  • What are we getting?
  • Actual file (same content as stored in server)
  • Parsed file (response generated by server app.)

13
Case 2 (cont.)
  • Tools used to crawl / mirror the site
  • heritrix 1.10.0
  • Wget 1.9.1
  • modoai 0.6 over apache 2.0

Heritrix is the Internet Archive's open-source,
extensible, web-scale, archival-quality web
crawler project.
GNU Wget is a free software package for
retrieving files using HTTP, HTTPS and FTP, the
most widely-used Internet protocols.
Based on OAI-PMH (Open Archives Initiative
Protocol for Metadata Harvesting), modoai is an
apache2.0 module able to respond to OAI-PMH
requests
14
Case 2 Heritrix 1.10.0
  • Heritrix is the Internet Archive's open-source,
    extensible, web-scale, archival-quality web
    crawler project.
  • It is designed to respect exclusion directives,
    and collect material at a measured, adaptive pace
    unlikely to disrupt normal website activity.
  • Heritrix allows users to change multiple settings
    to fit their needs.
  • Request delay removed to improve speed
  • 404 errors ignored
  • Site mirrored instead of archived (ARC files)

15
Case 2 Heritrix 1.10.0
16
Case 2 Heritrix 1.10.0
  • Problems with heritrix
  • Crawls outside of specified scope
  • Ignores empty files
  • Creates extra files for directory listing
  • Does not have a re-crawl utility that will get
    modified files

17
Case 2 GNU Wget 1.9.1
  • GNU Wget is a free software package for
    retrieving files using HTTP, HTTPS and FTP, the
    most widely-used Internet protocols.
  • It is a non-interactive command line tool, so it
    may easily be called from scripts, cron jobs,
    terminals without X11 support, etc.
  • Powerful tool to mirror complete websites.
  • It has the option to download only modified / new
    files
  • Depends on Last modified header (usually missing)

command used wget -mirror p w 0
http//darwin.seven.research.odu.edu
18
Case 2 GNU Wget 1.9.1
174438 (3.82 MB/s) - darwin.seven.research.odu.
edu/1992/92-11.abs.Z' saved 401/401 --174438-
- http//darwin.seven.research.odu.edu/1992/92-12
.abs.Z gt darwin.seven.research.odu.ed
u/1992/92-12.abs.Z' Reusing connection to
darwin.seven.research.odu.edu80. HTTP request
sent, awaiting response... 200 OK Length 464
application/x-compress 100
gt 464
--.--K/s 174438 (4.43 MB/s) -
darwin.seven.research.odu.edu/1992/92-12.abs.Z'
saved 464/464 --174438-- http//darwin.seven
.research.odu.edu/1992/92-12.ps.Z gt
darwin.seven.research.odu.edu/1992/92-12.ps.Z' Re
using connection to darwin.seven.research.odu.edu
80. HTTP request sent, awaiting response... 200
OK Length 123,705 application/x-compress 100

gt 123,705 --.--K/s 174438
(7.87 MB/s) - darwin.seven.research.odu.edu/1992/
92-12.ps.Z' saved 123705/123705 --174438--
http//darwin.seven.research.odu.edu/1992/92-13.ab
s.Z gt darwin.seven.research.odu.edu/1
992/92-13.abs.Z' Reusing connection to
darwin.seven.research.odu.edu80. HTTP request
sent, awaiting response... 200 OK Length 428
application/x-compress 100
gt 428
--.--K/s
19
Case 2 modoai 0.6
  • Installed as an apache 2.0 module
  • Uses OAI-PMH protocol to provide information
    about server contents
  • Used for metadata information, but can provide
    file contents as well.
  • Responds to requests are given in XML format
  • Works with the file-system
  • Access to datestamp information

20
Case 2 modoai 0.6
  • Current issues
  • Cannot use files with handlers for security
    reasons
  • Serves actual file, which would include
    confidential information
  • Uses base64 encoding to serve file contents (33
    bigger)
  • Ignores cgi-bin folder protection

21
Case 2 Results
22
Case 2 Results
  • 4615 matches (5)
  • heritrix vs modoai (41 differences)
  • 18 dynamic files skipped by modoai
  • 19 index.html added by heritrix on directory
    listings
  • 5 naming differences
  • modoai vs heritrix (188 differences)
  • 178 files contained in cgi-bin were included by
    modoai
  • 5 naming differences
  • 5 files skipped by heritrix
  • 2 empty files
  • 2 xxx files
  • 1 containing uncommon characters (93_0?1.abs.Z)

23
Case 2 Results
  • 4639 matches (17)
  • heritrix vs Wget (17 differences)
  • All due to naming issues in heritrix
  • Wget vs heritrix (35 differences)
  • 17 due to naming issues
  • 18 extra files on Wget
  • 2 empty files that were ignored by heritrix
  • 16 extra files created by Wget (same directory
    crawled twice with different capitalization)

24
Conclusion
  • It is almost impossible to mirror 100 of a
    website
  • Different crawlers, produce different results
  • Selection depends on server contents
  • Dynamic vs Static content
  • Character codification
  • Operating System can make a difference
  • modoai could end up being the do-it-all
    application
  • Dynamic files inclusion
  • Admin access to get protected files

25
Questions?
Write a Comment
User Comments (0)
About PowerShow.com