Brass: A Queueing Manager for Warrick - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Brass: A Queueing Manager for Warrick

Description:

McCown and Nelson, Evaluation of Crawling Policies for a Web-Repository Crawler, HYPERTEXT 2006. ... Crawler traps. Web-repo crawling. Limit hit rate per repo ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 27
Provided by: FrankM80
Learn more at: https://www.cs.odu.edu
Category:

less

Transcript and Presenter's Notes

Title: Brass: A Queueing Manager for Warrick


1
Brass A Queueing Manager for Warrick
  • Frank McCown, Amine Benjelloun, and Michael L.
    Nelson
  • Old Dominion UniversityComputer Science
    DepartmentNorfolk, Virginia, USAIWAW 2007
  • Vancouver, BCJune 23, 2007

2
Agenda
  • Dangers facing website
  • Web-repository crawling
  • Comparing web crawling with web-repository
    crawling
  • All about Brass
  • Alternate Warrick deployments

3
Black hat http//img.webpronews.com/securityprone
ws/110705blackhat.jpgVirus image
http//polarboing.com/images/topics/misc/story.com
puter.virus_1137794805.jpg Hard drive
http//www.datarecoveryspecialist.com/images/head-
crash-2.jpg
4
(No Transcript)
5
A couple weeks ago I accidentally deleted my
entire database of about 30 articles. After I
finished berating myself for being so stupid, I
realized that my hosting company would have a
backup, so I sent an email asking them to restore
the database. Their reply stated that backups
were coming soonOUCH! So right after I signed
up with a better hosting company I had to figure
out a plan B.
6
Crawling the Crawlers
7
  • McCown, et al., Brass A Queueing Manager for
    Warrick, IWAW 2007.
  • McCown, et al., Factors Affecting Website
    Reconstruction from the Web Infrastructure, ACM
    IEEE JCDL 2007.
  • McCown and Nelson, Evaluation of Crawling
    Policies for a Web-Repository Crawler, HYPERTEXT
    2006.
  • McCown, et al., Lazy Preservation Reconstructing
    Websites by Crawling the Crawlers, ACM WIDM 2006.

Available at http//warrick.cs.odu.edu/
8


9
(No Transcript)
10
Cached Image
11
Cached PDF
http//www.fda.gov/cder/about/whatwedo/testtube.pd
f
canonical
MSN version Yahoo
version Google version
12
Examples of Lost Websites Recovered with Warrick
13
Web Crawler
14
Web-Repository Crawler
15
Issues
  • Web crawling
  • Limit hit rate per host
  • Websites periodically unavailable
  • Portions of website off-limits (robots.txt,
    passwords)
  • Deep web
  • Spam
  • Duplicate content
  • Flash and JavaScript interfaces
  • Crawler traps
  • Web-repo crawling
  • Limit hit rate per repo
  • Limited hits per day (API query quotas)
  • Repos periodically unavailable
  • Flash and JavaScript interfaces
  • Can only recover what repos have stored
  • Lossy format conversions (thumb nail images,
    HTMLlized PDFs, etc.)

16
Problems with Warrick
  • Requires user to download, install, and run from
    the command line
  • warrick.pl d r o log.txt c wr ia
    http//foo.org/
  • Google API keys are no longer available
  • Screen-scrapes Googles web user interface which
    can cause Google to black-list an IP address

17
Solution Brass
  • Queueing system using ODU nodes, so API query
    limits can be spread across several machines
  • Uses Google API keys which we obtained before
    they were no longer made available
  • Easy-to-use web interface utilizing email to
    notify user when reconstructions are complete

18
Warrick Brown Captain Jim Brass
http//www.cbs.com/primetime/csi/bios/index.php?ca
st_membergary
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
Brass Architecture
24
Job Processing
  1. Pending Waiting to be confirmed
  2. Queued Waiting to be started
  3. Processing Currently being executed
  4. Complete Ready to be picked-up

25
Other Warrick Deployments
  • GUI interface for client executable
  • Installation difficulties
  • Lack of Google API keys
  • Web interface along with client application which
    makes queries
  • Browser plug-in, Flash, or applet
  • Must manage Google API keys
  • Browser must be left open and continued Internet
    access

26
Conclusions
  • Warrick interface is almost ready for the public
  • Web interface will likely greatly increase
    Warrick usage
  • Collection of usage data will allow us to better
    understand what kinds of websites the public is
    interesting in recovering

27
And thats everything there is to know about
Brass!
Thanks, Dad, but I just wanted to know when you
were going to change my diaper
Frank McCownfmccown_at_cs.odu.edu
Write a Comment
User Comments (0)
About PowerShow.com