A Reflection on Spiders, Bots and Aggregators - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

A Reflection on Spiders, Bots and Aggregators

Description:

AltaVista Babble Fish. Used to translate a site to/from any language. Directly integrated with AltaVista. Located at http://world.altavista.com. Yodlee ... – PowerPoint PPT presentation

Number of Views:192
Avg rating:3.0/5.0
Slides: 52
Provided by: JeffH66
Category:

less

Transcript and Presenter's Notes

Title: A Reflection on Spiders, Bots and Aggregators


1
A Reflection on Spiders, Bots and Aggregators
  • An Independent Study by Jeff Heaton
  • Advised by Bill Darte

2
Presented byJeff Heaton
  • Email heatonj_at_heat-on.com
  • Web http//www.jeffheaton.com

3
Upcoming Book by Presenter
Programming Spiders, Bots, and Aggregators in
Javaby Jeff Heaton Published in March 2002.
Paperback - 512 pages 1st edition (March 2002)
Sybex ISBN 0782140408
4
Basic Terminology
  • Spider
  • Robot (Bot)
  • Aggregator
  • Agent
  • Intelligent agent

5
An Overview
6
The HTTP Protocol
  • Bots must navigate web sites
  • The HTTP protocol is the basic mode of
    transportation for web pages
  • A bot must use the HTTP protocol

7
An HTTP Request
GET /grindex.asp HTTP/1.1 Accept image/gif,
image/x-xbitmap, image/jpeg, image/pjpeg, applicat
ion/vnd.ms-powerpoint, application/vnd.ms-excel, a
pplication/msword, / Accept-Language
en-us Accept-Encoding gzip, deflate User-Agent
Mozilla/4.0 (compatible MSIE 5.5 Windows NT
5.0) Host www.classinfo.net Connection
Keep-Alive Cookie ASPSESSIONIDGGGGQHPKBHLGFGOCHA
PALILEEMNIMAFG
8
Types of HTTP Requests
  • GET most common, used to download a single
    resource.
  • POST used to respond to a FORM.
  • HEAD least common, used to verify the existence
    of a web paged.

9
An HTTP Response
HTTP/1.1 200 OK Connection Keep-Alive Server
Microsoft-IIS/4.0 Content-Type
text/html Cache-control private Transfer-Encoding
chunked Via 1.1 c760 (NetCache 4.1R4D1) Date
Tue, 13 Mar 2001 035505 GMT lt!DOCTYPE HTML
PUBLIC "-//W3C//DTD HTML 3.2//EN"gt ltHTMLgt ... the
rest of the HTML document ...
10
Retrieving a Web Page
  • Most web pages are a mixture of text and
    graphics.
  • First the web browser downloads the HTML page.
  • Then every image contained in that page is
    downloaded.

11
A Typical Web Page
12
HTTP Messages
13
Example HTTP
14
Building a Bot
  • A bot retrieves information from a web site.
  • Often a bot can be used to monitor a page.
  • The BBS bot.

15
A Typical BBS
16
Example Watch BBS Bot
17
What is a Spider?
  • A spider is a specialized bot that moves from web
    page to web page.
  • A spider takes its name from the insect spider.

18
Page Queues
  • Waiting queue the page is waiting to be
    downloaded.
  • Running queue the page is downloading.
  • Error queue the page resulted in an error.
  • Complete queue the page has been downloaded,
    and should not be redownloaded.

19
Page State Transition
20
A Typical Web
21
Spider Flowchart
22
Example Download a Site
23
Business Issues of Spiders, Bots and Aggregators.
  • Two sides of the same coin
  • How your company can use bots
  • How bots can be used against your company

24
Uses for Bots
  • Tracking shipments
  • Account aggregation
  • Reputation monitoring
  • Indexing/searching
  • Monitoring reliability

25
Sites That Extensively Use Bots
  • AltaVista BabbleFish
  • Yodlee
  • PriceWatch
  • Google

26
AltaVista Babble Fish
  • Used to translate a site to/from any language
  • Directly integrated with AltaVista
  • Located at http//world.altavista.com

27
Yodlee
  • Aggregates many on-line accounts into one
  • ASP model, most users access Yodlee through an
    intermediary
  • Located at http//www.yodlee.Com

28
Pricewatch
  • Used to compare different prices from multiple
    vendors
  • Has encountered some legal problems
  • Located at http//www.pricewatch.Com

29
Google
  • Primary business is searching
  • Bots index web pages and indexes

30
Why Be Friendly to Bots
  • Allow your site to be indexed into search engines
  • Allow customers to access your data in new ways
  • If you use bots yourself

31
Bot Friendly Sites
  • Meta tags for bots to locate
  • Friendly robots.txt files
  • Terms of service agreements that allow bot usage

32
Mata Tags
ltBASE HREF"http//www.wustl.edu/"gt ltHTMLgt ltHEADgt
ltMETA HTTP-EQUIV"content-type"
content"text/htmlcharsetiso-8859-1"gt ltTITLEgtWas
hington University in St. Louis Home
Page lt/TITLEgt ltMETA NAME"description"
content"Washington University in St. Louis
Official Home Page"gt ltMETA NAME"keywords"
content"Washington University, Washington
University in St. Louis, Wash U, WU, WUSTL,
WUStL, Wash. U., University Washington,
university, universities, American Universities,
St. Louis, Missouri, education, research, higher
education, undergraduate, university libraries,
academic, college, colleges, Midwestern
universities, online applications, health care,
medicine, academic, academics, campus, students,
university students, college students, Washington
University School of Architecture, Washington
University School of Art, Washington University
Arts and Sciences, Washington University School
of Business, John M. Olin School of Business,
Washington University School of Engineering and
Applied Science, Washington University School of
Law, Washington University School of Medicine,
Washington University School of Social Work,
George Warren Brown School of Social Work,
University College, College of Arts and
Sciences"gt
33
Robots.txt
robots, scram User-agent Disallow
/cgi-bin Disallow /TRANSCRIPTS Disallow
/development Disallow /thirdDisallow
/beta Disallow /java Disallow
/shockwave Disallow /JOBS Disallow
/pr Disallow /Interactive Disallow
/alt_index.html Disallow /webmaster_logs Disall
ow /newscenter Disallow
/virtual Disallow /DIGEST Disallow
/QUICKNEWS Disallow /SEARCH
34
A Friendly TOS Agreement
Use of Data and Products   The information on
government servers are in the public domain,
unless specifically annotated otherwise, and may
be used freely by the public. Before using
information obtained from this server special
attention should be given to the date time of
the data and products being displayed. This
information shall not be modified in content and
then presented as official government
material. (from the National Weather Service)
35
A Unfriendly TOS Agreement
  • User will not access any software or data
    provided via indirect means or any method not
    intended or agreed upon by PCQuote. Robot
    programs (automated query systems) are strictly
    prohibited and any use of such systems will
    result in immediate termination of access.
  • (From PCQuote.Com)

36
Bot Ethics
  • Do unto others as you would have them do unto
    you.
  • Do unto others as the law would have you do unto
    them.

37
Detecting Bots
User Agent Name - What user agent name are you
specifying for your bot? If you are not using
anonymous access, your bot will stand out easily
on an access log. Frequency of Access - How
often are you accessing the site, and is it
always from the same IP address. A very large
volume of accesses from the same IP address is
usually a tale-tell sign of a bot or
spider. Access Method - How is the bot accessing
the site? Is it only pulling text files and not
downloading any images? Web browsers being used
by human users will almost always download all
of the images too. A bot typically only goes
after the text.
38
Web Site Hostility
  •         Usenet Postings - The web master can
    make Usenet
  • posting to defame your bot and site. If your bot
    is a
  • annoyance, most web masters will want to warn
    other
  • web masters.
  •         Legal Measures - If you are violating
    their terms of
  • service, they may bring legal action against you.
  •         Bot Exclusion File - Weve already
    examined the
  • robots.txt file. By using this file the web
    master can request
  • your spider to leave the site alone. This is the
    least subtle
  • means of thwarting a spider or bot. If this
    method fails,
  • a more sever alternative will likely be pursued.
  •         Filter based on IP - If a large volume
    of traffic is
  • coming from a single IP address, that IP address
    could be
  • denied access.
  • Filter based on Agent Name - If a large volume of
    traffic
  • is coming from a single agent name, that agent
    name can be
  • denied access.

39
The Future of Bots
  • More consistent web sites
  • XML
  • SOAP

40
What is SOAP?
  • Main Entry 1soap Pronunciation 'sOpFunction
    nounEtymology Middle English sope, from Old
    English sApe akin to Old High German seifa
    soapDate before 12th century1 a a cleansing
    and emulsifying agent made usually by action of
    alkali on fat or fatty acids and consisting
    essentially of sodium or potassium salts of such
    acids b a salt of a fatty acid and a metal2
    SOAP OPERA
  • (from Merriam-Webster's 10th Collegiate
    Dictionary)

41
No, Really. What is SOAP?
  • Main Entry 1soap Pronunciation 'sOpFunction
    protocolEtymology Archnom for Simple Object
    Access Protocol.Date before 20th century1 A
    cross-platform, XML-based, protocol used to
    access objects that may not reside on the same
    local system. 2 SOAP OPERA When different
    vendors implementations are not compatible.

42
What does SOAP Offer
  • XML based
  • Can use many different transfer protocols(i.e.
    HTML, SMTP)
  • Asynchronous
  • Distributed Systems

43
Who is involved in SOAP
  • Introduced and governed by W3C.
  • Sun is incorporating SOAP into Java via JAXM(Java
    API for XML Messaging)
  • Microsoft is incorporating SOAP into the .NET
    protocol.

44
SOAP Components
  • XML Based
  • SOAP Messages
  • Web Service Definition Language (WSDL)

45
What is XML?
  • Another format for text files
  • Hierarchical, not flat.
  • Like HTML, based on SGML.
  • Much stricter than HTML.
  • Supported by a wide variety of tools

46
What does XML look like?
Node A single element that can enclose other
elements. Attribute A value that is stored as
part of a node begin tag. Beginning Tag A tag
that does not start or end with a /. Ending Tag
A tag that starts with a /. Begin-End Tag A tag
that ends, but does not start, with a /.
ltStudentListgt ltStudent id"555"gt
ltfirstgtTomlt/firstgt ltlastgtSmithlt/lastgt
ltmiddlegtlt/middlegt lt/Stundentgt ltStudent
id"556"gt ltfirstgtReginalt/firstgt
ltlastgtSmithlt/lastgt ltmiddlegtAlt/middlegt
lt/Stundentgt lt/StudentListgt
47
What are SOAP Messages
  • Blocks of data that makeup SOAP requests and
    responses
  • Blocks of data are in XML
  • SOAP can use a variety of underlying transfer
    protocols.

48
How are SOAP Messages Sent
  • Always sent Asynchronous
  • HTTP
  • SMTP

49
What does SOAP Look Like?
  • ltSOAP-ENVEnvelope
  • xmlnsSOAP-ENV"http//schemas.xmlsoap.org/soap/
    envelope/"
  • SOAP-ENVencodingStyle"http//schemas.xmlsoap.o
    rg/soap/encoding/"gt
  • ltSOAP-ENVBodygt
  • ltmGetCurrentTemperature
    xmlnsm"Some-URI"gt
  • ltsymbolgtKSTLlt/symbolgt
  • lt/m GetCurrentTemperature gt
  • lt/SOAP-ENVBodygt
  • lt/SOAP-ENVEnvelopegt

50
What is WSDL?
  • Web Service Definition Language
  • A SOAP roadmap
  • Used to describe the kinds of SOAP messages a
    service expects.

51
Questions?
Write a Comment
User Comments (0)
About PowerShow.com