Title: A Reflection on Spiders, Bots and Aggregators
1A Reflection on Spiders, Bots and Aggregators
- An Independent Study by Jeff Heaton
- Advised by Bill Darte
2Presented byJeff Heaton
- Email heatonj_at_heat-on.com
- Web http//www.jeffheaton.com
3Upcoming Book by Presenter
Programming Spiders, Bots, and Aggregators in
Javaby Jeff Heaton Published in March 2002.
Paperback - 512 pages 1st edition (March 2002)
Sybex ISBN 0782140408
4Basic Terminology
- Spider
- Robot (Bot)
- Aggregator
- Agent
- Intelligent agent
5An Overview
6The HTTP Protocol
- Bots must navigate web sites
- The HTTP protocol is the basic mode of
transportation for web pages - A bot must use the HTTP protocol
7An HTTP Request
GET /grindex.asp HTTP/1.1 Accept image/gif,
image/x-xbitmap, image/jpeg, image/pjpeg, applicat
ion/vnd.ms-powerpoint, application/vnd.ms-excel, a
pplication/msword, / Accept-Language
en-us Accept-Encoding gzip, deflate User-Agent
Mozilla/4.0 (compatible MSIE 5.5 Windows NT
5.0) Host www.classinfo.net Connection
Keep-Alive Cookie ASPSESSIONIDGGGGQHPKBHLGFGOCHA
PALILEEMNIMAFG
8Types of HTTP Requests
- GET most common, used to download a single
resource. - POST used to respond to a FORM.
- HEAD least common, used to verify the existence
of a web paged.
9An HTTP Response
HTTP/1.1 200 OK Connection Keep-Alive Server
Microsoft-IIS/4.0 Content-Type
text/html Cache-control private Transfer-Encoding
chunked Via 1.1 c760 (NetCache 4.1R4D1) Date
Tue, 13 Mar 2001 035505 GMT lt!DOCTYPE HTML
PUBLIC "-//W3C//DTD HTML 3.2//EN"gt ltHTMLgt ... the
rest of the HTML document ...
10Retrieving a Web Page
- Most web pages are a mixture of text and
graphics. - First the web browser downloads the HTML page.
- Then every image contained in that page is
downloaded.
11A Typical Web Page
12HTTP Messages
13Example HTTP
14Building a Bot
- A bot retrieves information from a web site.
- Often a bot can be used to monitor a page.
- The BBS bot.
15A Typical BBS
16Example Watch BBS Bot
17What is a Spider?
- A spider is a specialized bot that moves from web
page to web page. - A spider takes its name from the insect spider.
18Page Queues
- Waiting queue the page is waiting to be
downloaded. - Running queue the page is downloading.
- Error queue the page resulted in an error.
- Complete queue the page has been downloaded,
and should not be redownloaded.
19Page State Transition
20A Typical Web
21Spider Flowchart
22Example Download a Site
23Business Issues of Spiders, Bots and Aggregators.
- Two sides of the same coin
- How your company can use bots
- How bots can be used against your company
24Uses for Bots
- Tracking shipments
- Account aggregation
- Reputation monitoring
- Indexing/searching
- Monitoring reliability
25Sites That Extensively Use Bots
- AltaVista BabbleFish
- Yodlee
- PriceWatch
- Google
26AltaVista Babble Fish
- Used to translate a site to/from any language
- Directly integrated with AltaVista
- Located at http//world.altavista.com
27Yodlee
- Aggregates many on-line accounts into one
- ASP model, most users access Yodlee through an
intermediary - Located at http//www.yodlee.Com
28Pricewatch
- Used to compare different prices from multiple
vendors - Has encountered some legal problems
- Located at http//www.pricewatch.Com
29Google
- Primary business is searching
- Bots index web pages and indexes
30Why Be Friendly to Bots
- Allow your site to be indexed into search engines
- Allow customers to access your data in new ways
- If you use bots yourself
31Bot Friendly Sites
- Meta tags for bots to locate
- Friendly robots.txt files
- Terms of service agreements that allow bot usage
32Mata Tags
ltBASE HREF"http//www.wustl.edu/"gt ltHTMLgt ltHEADgt
ltMETA HTTP-EQUIV"content-type"
content"text/htmlcharsetiso-8859-1"gt ltTITLEgtWas
hington University in St. Louis Home
Page lt/TITLEgt ltMETA NAME"description"
content"Washington University in St. Louis
Official Home Page"gt ltMETA NAME"keywords"
content"Washington University, Washington
University in St. Louis, Wash U, WU, WUSTL,
WUStL, Wash. U., University Washington,
university, universities, American Universities,
St. Louis, Missouri, education, research, higher
education, undergraduate, university libraries,
academic, college, colleges, Midwestern
universities, online applications, health care,
medicine, academic, academics, campus, students,
university students, college students, Washington
University School of Architecture, Washington
University School of Art, Washington University
Arts and Sciences, Washington University School
of Business, John M. Olin School of Business,
Washington University School of Engineering and
Applied Science, Washington University School of
Law, Washington University School of Medicine,
Washington University School of Social Work,
George Warren Brown School of Social Work,
University College, College of Arts and
Sciences"gt
33Robots.txt
robots, scram User-agent Disallow
/cgi-bin Disallow /TRANSCRIPTS Disallow
/development Disallow /thirdDisallow
/beta Disallow /java Disallow
/shockwave Disallow /JOBS Disallow
/pr Disallow /Interactive Disallow
/alt_index.html Disallow /webmaster_logs Disall
ow /newscenter Disallow
/virtual Disallow /DIGEST Disallow
/QUICKNEWS Disallow /SEARCH
34A Friendly TOS Agreement
Use of Data and Products The information on
government servers are in the public domain,
unless specifically annotated otherwise, and may
be used freely by the public. Before using
information obtained from this server special
attention should be given to the date time of
the data and products being displayed. This
information shall not be modified in content and
then presented as official government
material. (from the National Weather Service)
35A Unfriendly TOS Agreement
- User will not access any software or data
provided via indirect means or any method not
intended or agreed upon by PCQuote. Robot
programs (automated query systems) are strictly
prohibited and any use of such systems will
result in immediate termination of access. - (From PCQuote.Com)
36Bot Ethics
- Do unto others as you would have them do unto
you. - Do unto others as the law would have you do unto
them.
37Detecting Bots
User Agent Name - What user agent name are you
specifying for your bot? If you are not using
anonymous access, your bot will stand out easily
on an access log. Frequency of Access - How
often are you accessing the site, and is it
always from the same IP address. A very large
volume of accesses from the same IP address is
usually a tale-tell sign of a bot or
spider. Access Method - How is the bot accessing
the site? Is it only pulling text files and not
downloading any images? Web browsers being used
by human users will almost always download all
of the images too. A bot typically only goes
after the text.
38Web Site Hostility
- Usenet Postings - The web master can
make Usenet - posting to defame your bot and site. If your bot
is a - annoyance, most web masters will want to warn
other - web masters.
- Legal Measures - If you are violating
their terms of - service, they may bring legal action against you.
- Bot Exclusion File - Weve already
examined the - robots.txt file. By using this file the web
master can request - your spider to leave the site alone. This is the
least subtle - means of thwarting a spider or bot. If this
method fails, - a more sever alternative will likely be pursued.
- Filter based on IP - If a large volume
of traffic is - coming from a single IP address, that IP address
could be - denied access.
- Filter based on Agent Name - If a large volume of
traffic - is coming from a single agent name, that agent
name can be - denied access.
39The Future of Bots
- More consistent web sites
- XML
- SOAP
40What is SOAP?
- Main Entry 1soap Pronunciation 'sOpFunction
nounEtymology Middle English sope, from Old
English sApe akin to Old High German seifa
soapDate before 12th century1 a a cleansing
and emulsifying agent made usually by action of
alkali on fat or fatty acids and consisting
essentially of sodium or potassium salts of such
acids b a salt of a fatty acid and a metal2
SOAP OPERA - (from Merriam-Webster's 10th Collegiate
Dictionary)
41No, Really. What is SOAP?
- Main Entry 1soap Pronunciation 'sOpFunction
protocolEtymology Archnom for Simple Object
Access Protocol.Date before 20th century1 A
cross-platform, XML-based, protocol used to
access objects that may not reside on the same
local system. 2 SOAP OPERA When different
vendors implementations are not compatible.
42What does SOAP Offer
- XML based
- Can use many different transfer protocols(i.e.
HTML, SMTP) - Asynchronous
- Distributed Systems
43Who is involved in SOAP
- Introduced and governed by W3C.
- Sun is incorporating SOAP into Java via JAXM(Java
API for XML Messaging) - Microsoft is incorporating SOAP into the .NET
protocol.
44SOAP Components
- XML Based
- SOAP Messages
- Web Service Definition Language (WSDL)
45What is XML?
- Another format for text files
- Hierarchical, not flat.
- Like HTML, based on SGML.
- Much stricter than HTML.
- Supported by a wide variety of tools
46What does XML look like?
Node A single element that can enclose other
elements. Attribute A value that is stored as
part of a node begin tag. Beginning Tag A tag
that does not start or end with a /. Ending Tag
A tag that starts with a /. Begin-End Tag A tag
that ends, but does not start, with a /.
ltStudentListgt ltStudent id"555"gt
ltfirstgtTomlt/firstgt ltlastgtSmithlt/lastgt
ltmiddlegtlt/middlegt lt/Stundentgt ltStudent
id"556"gt ltfirstgtReginalt/firstgt
ltlastgtSmithlt/lastgt ltmiddlegtAlt/middlegt
lt/Stundentgt lt/StudentListgt
47What are SOAP Messages
- Blocks of data that makeup SOAP requests and
responses - Blocks of data are in XML
- SOAP can use a variety of underlying transfer
protocols.
48How are SOAP Messages Sent
- Always sent Asynchronous
- HTTP
- SMTP
49What does SOAP Look Like?
- ltSOAP-ENVEnvelope
- xmlnsSOAP-ENV"http//schemas.xmlsoap.org/soap/
envelope/" - SOAP-ENVencodingStyle"http//schemas.xmlsoap.o
rg/soap/encoding/"gt - ltSOAP-ENVBodygt
- ltmGetCurrentTemperature
xmlnsm"Some-URI"gt - ltsymbolgtKSTLlt/symbolgt
- lt/m GetCurrentTemperature gt
- lt/SOAP-ENVBodygt
- lt/SOAP-ENVEnvelopegt
50What is WSDL?
- Web Service Definition Language
- A SOAP roadmap
- Used to describe the kinds of SOAP messages a
service expects.
51Questions?