Title: Brass: A Queueing Manager for Warrick
1Brass A Queueing Manager for Warrick
- Frank McCown, Amine Benjelloun, and Michael L.
Nelson - Old Dominion UniversityComputer Science
DepartmentNorfolk, Virginia, USAIWAW 2007 - Vancouver, BCJune 23, 2007
2Agenda
- Screen-scraping the web user interface (WUI)
- Search engine APIs
- Comparing search results
- Five month experiment
- Significant findings and conclusions
3Black hat http//img.webpronews.com/securityprone
ws/110705blackhat.jpgVirus image
http//polarboing.com/images/topics/misc/story.com
puter.virus_1137794805.jpg Hard drive
http//www.datarecoveryspecialist.com/images/head-
crash-2.jpg
4(No Transcript)
5A couple weeks ago I accidentally deleted my
entire database of about 30 articles. After I
finished berating myself for being so stupid, I
realized that my hosting company would have a
backup, so I sent an email asking them to restore
the database. Their reply stated that backups
were coming soonOUCH! So right after I signed
up with a better hosting company I had to figure
out a plan B.
6Crawling the Crawlers
7- McCown, et al., Brass A Queueing Manager for
Warrick, IWAW 2007. - McCown, et al., Factors Affecting Website
Reconstruction from the Web Infrastructure, ACM
IEEE JCDL 2007. - McCown and Nelson, Evaluation of Crawling
Policies for a Web-Repository Crawler, HYPERTEXT
2006. - McCown, et al., Lazy Preservation Reconstructing
Websites by Crawling the Crawlers, ACM WIDM 2006.
Available at http//warrick.cs.odu.edu/
8 9(No Transcript)
10Cached Image
11Cached PDF
http//www.fda.gov/cder/about/whatwedo/testtube.pd
f
canonical
MSN version Yahoo
version Google version
12Examples of Lost Websites Recovered with Warrick
13Web Crawler
14Web-Repository Crawler
15Limitations
- Web crawling
- Limit hit rate per host
- Websites periodically unavailable
- Portions of website off-limits (robots.txt,
passwords) - Deep web
- Spam
- Duplicate content
- Flash and JavaScript interfaces
- Crawler traps
- Web-repo crawling
- Limit hit rate per repo
- Limited hits per day (API query quotas)
- Repos periodically unavailable
- Flash and JavaScript interfaces
- Can only recover what repos have stored
- Lossy format conversions (thumb nail images,
HTMLlized PDFs, etc.)
16Problems with Warrick
- Requires user to download, install, and run from
the command line - warrick.pl d r o log.txt c wr ia
http//foo.org/ - Google API keys are no longer available
- Screen-scrapes Googles web user interface which
can cause Google to black-list an IP address
17Solution Brass
- Queueing system using ODU nodes, so API query
limits can be spread across several machines - Uses Google API keys which we obtained before
they were no longer made available - Easy-to-use web interface utilizing email to
notify user when reconstructions are complete
18Warrick Brown Captain Jim Brass
http//www.cbs.com/primetime/csi/bios/index.php?ca
st_membergary
19(No Transcript)
20(No Transcript)
21(No Transcript)
22(No Transcript)
23Brass Architecture
24Job Processing
- Pending Waiting to be confirmed
- Queued Waiting to be started
- Processing Currently being executed
- Complete Ready to be picked-up
25Other Warrick Deployments
- GUI interface for client executable
- Installation difficulties
- Lack of Google API keys
- Web interface along with client application which
makes queries - Browser plug-in, Flash, or applet
- Must manage Google API keys
- Browser must be left open and continued Internet
access
26Conclusions
- Warrick interface is almost ready for the public
- Web interface will likely greatly increase
Warrick usage - Collection of usage data will allow us to better
understand what kinds of websites the public is
interesting in recovering
27And thats everything there is to know about
Brass!
And a lot more than I wanted to know
Frank McCownfmccown_at_cs.odu.edu