Retrieving Information on the Web - PowerPoint PPT Presentation

About This Presentation
Title:

Retrieving Information on the Web

Description:

In 1990 the World Wide Web (WWW) was developed by Tim Berners-Lee at CERN to ... were developed to search names of text files available through Gopher servers. ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 29
Provided by: lor95
Category:

less

Transcript and Presenter's Notes

Title: Retrieving Information on the Web


1
Retrieving Information on the Web
  • Presented
  • by
  • Md. Zaheed Iftekhar
  • Course Information Retrieval (IFT6255)
  • Professor Jian E. Nie
  • DIRO, University of Montreal
  • April 9th, 2003

2
Overview
  • Web search general description
  • Introduction of web, search engines
  • Definitions
  • Major search engines
  • Current technologies
  • The future
  • Where is the technology heading
  • Proposal for further improvement
  • Conclusion
  • References

IR On The Web
3
History of the Web
  • In 1990 the World Wide Web (WWW) was developed by
    Tim Berners-Lee at CERN to organize research
    documents available on the Internet.
  • Combined idea of documents available by FTP with
    the idea of hypertext to link documents.
  • Developed initial HTTP network protocol, URLs,
    HTML, and first web server.

The web
4
World Wide Web
  • Ted Nelson developed idea of hypertext in 1965.
  • Doug Engelbart invented the mouse and built the
    first implementation of hypertext in the late
    1960s at SRI.
  • ARPANET was developed in the early 1970s.
  • The basic technology was in place in the 1970s
    but it took the PC revolution and widespread
    networking to inspire the web and make it
    practical.

Web Search
5
Web Browser
  • Early browsers were developed in 1992 (Erwise,
    ViolaWWW).
  • In 1993, Marc Andreessen and Eric Bina at UIUC
    NCSA developed the Mosaic.
  • Andreessen joined with James Clark (Stanford
    Prof. and Silicon Graphics founder) to form
    Mosaic Communications Inc. in 1994 (which became
    Netscape to avoid conflict with UIUC).
  • Microsoft licensed the original Mosaic from UIUC
    and used it to build Internet Explorer in 1995.

Web Browser
6
Web Search
  • By late 1980s many files were available by
    anonymous FTP.
  • In 1990, Alan Emtage of McGill Univ. developed
    Archie (short for archives)
  • Assembled lists of files available on many FTP
    servers.
  • Allowed regex search of these file names.
  • In 1993, Veronica and Jughead were developed to
    search names of text files available through
    Gopher servers.

Web Search
7
Web Search
  • In 1993, early web robots (spiders) were built to
    collect URLs
  • Wanderer
  • ALIWEB (Archie-Like Index of the WEB)
  • WWW Worm (indexed URLs and titles for regex
    search)
  • In 1994, Stanford grad students David Filo and
    Jerry Yang started manually collecting popular
    web sites into a topical hierarchy called Yahoo.

Web Search
8
Web Search
  • In early 1994, Brian Pinkerton developed
    WebCrawler as a class project at U Wash. (became
    part of Excite and AOL).
  • The same year, Fuzzy Maudlin, a grad student at
    CMU developed Lycos.
  • First to use a standard IR system.
  • First to index a large set of pages.
  • In late 1995, DEC developed Altavista. Supported
    boolean operators, phrases, and reverse pointer
    queries.
  • In 1998, Larry Page and Sergey Brin, Ph.D.
    students at Stanford, started Google.

Web Search
9
Spiders (Robots/Bots/Crawlers)
  • Start with a comprehensive set of root URLs from
    which to start the search.
  • Follow all links on these pages recursively to
    find additional pages.
  • Index all novel found pages in an inverted index
    as they are encountered.
  • May allow users to directly submit pages to be
    indexed (and crawled from).

10
Web search
Breadth-first Search
11
Web search
Depth-first Search
12
Search Strategy Trade-Offs
  • Breadth-first explores uniformly outward from the
    root page but requires memory of all nodes on the
    previous level (exponential in depth). Standard
    spidering method.
  • Depth-first requires memory of only depth times
    branching-factor (linear in depth) but gets
    lost pursuing a single thread.
  • Both strategies implementable using a queue of
    links (URLs).

13
Avoiding Page Duplication
  • Must detect when revisiting a page that has
    already been spidered (web is a graph not a
    tree).
  • Must efficiently index visited pages to allow
    rapid recognition test.
  • Tree indexing (e.g. trie)
  • Hashtable
  • Index page using URL as a key.
  • Must canonicalize URLs (e.g. delete ending /)
  • Not detect duplicated or mirrored pages.
  • Index page using textual content as a key.
  • Requires first downloading page.

14
Spidering Algorithm
Initialize queue (Q) with initial set of known
URLs. Until Q empty or page or time limit
exhausted Pop URL, L, from front of Q.
If L is not to an HTML page (.gif, .jpeg, .ps,
.pdf, .ppt) continue loop.
If already visited L, continue loop.
Download page, P, for L. If cannot download
P (e.g. 404 error, robot excluded)
continue loop. Index P (e.g. add to
inverted index or store cached copy). Parse
P to obtain list of new links N. Append N
to the end of Q.
15
Queueing Strategy
  • How new links added to the queue determines
    search strategy.
  • FIFO (append to end of Q) gives breadth-first
    search.
  • LIFO (add to front of Q) gives depth-first
    search.
  • Heuristically ordering the Q gives a focused
    crawler that directs its search towards
    interesting pages.

16
Source http//www.bruceclay.com
17
Google
  • Google is a search engine that maintains its own
    spider based index.
  • Google also has a directory that is powered by
    the Open Directory
  • Google supports
  • Boolean search
  • Phrase
  • Similarity
  • Proximity

Source lookoff.com, http//www.bruceclay.com
18
Google
  • Strengths
  • The interface is tremendously simple, but the
    quality in results is not significantly impeded
  • Accuracy for common topics
  • Weaknesses
  • Lack of power features
  • Coverage of the Internet is much less than some
    competitors
  • No OR keyword support for boolean searches

Source lookoff.com, http//www.bruceclay.com
19
Yahoo!
  • Strengths
  • Coverage of the Internet is excellent
  • Links are generally quite up to date and free of
    spam and poor quality sites
  • Human maintainers ensure that sites are placed
    correctly within the relevant topic
  • The search interface is very fast
  • Yahoo integrates with indexed searches after
    presenting Yahoo topic areas
  • Accuracy for common topics
  • Weaknesses
  • The search interface is very effective for
    general searches but could be better for powerful
    searches
  • Not all relevant sites are listed in Yahoo - they
    have to be submitted and accepted.

Source lookoff.com, http//www.bruceclay.com
20
Ask Jeeves
  • Strengths
  • A simple interface makes it very easy to form
    queries. Excellent for new users and children.
  • If your query corresponds to a pre-packaged
    answer, you can expect some surprisingly good
    results. Millions of bundled answers provide
    premium answers that are superior to standard
    index search.es
  • The site is actively maintained.
  • An integrated metacrawler provides results for
    your search from Goto, AltaVista, Mamma and
    4Anything.
  • The search code is very fast.
  • Weaknesses
  • The site supposedly takes pay for top spots,
    sometimes placing dubious quality links at the
    top of results.
  • No advanced search.
  • Very little power in constructing your keywords
  • Little control over filtering results.

21
MSN
  • Strengths
  • Very active news portal with updated and
    well-presented headlines.
  • Integrated single sign-on with hotmail, msn, etc.
  • Configurable interface lets you customize
    content, layout and colors.
  • Very actively maintained.
  • Many interesting (although often
    commercially-oriented) services tied into the MSN
    network.
  • Nationalized versions for quite a few countries
    providing a more specific content and news feed.
  • Ability to save (i.e. tag) results to quickly
    filter search results into a candidates list.
  • Weaknesses
  • Not a low-bandwidth interface. Slow modem users
    should beware.
  • Mediocre search interface
  • Less web coverage than most search engines

22
Program Pages () Class FAQ FTP Index Meta Misc News Portal
Dejanews 300M msg Best N N N N Y Y N
Raging 250M Best N N Y N N N N
Yahoo 500T Best N N N N N N Y
AllTheWeb 300M Excellent N N Y N N N N
AltaVista 250M Excellent N N Y N N Y Y
FAQS 3300 FAQs Excellent Y N N N Y N N
FTPSearch 100M file Excellent N Y N N N N N
Search.com N/A Excellent N N N Y N N N
About ? Good N N N N Y N Y
AskJeeves 8M Ques. Good Y N Y N N N Y
DirectHit ? Good N N N N N N Y
Excite ? Good N N Y N N Y Y
Go 50M? Good N N Y N N N Y
Google 100M? Good N N Y N N N N
HotBot 150M? Good N N Y N N N Y
Lycos 250M? Good N Y Y N N N Y
MetaCrawler N/A Good N N N Y N N N
MSN 120M? Good N N Y N N N Y
NorthernLight 200M? Good N N Y N N Y N
OpenDirectory 1M? Good N N N N N N Y
WebCenter 500T? Good N N N N N N Y
DogPile N/A Okay N Y N Y Y Y Y
GoTo ? Okay N N Y N N N Y
InfoSpace very few Okay N N Y N Y N N
iWon 350M? Okay N N Y N Y N N
Snap ? Okay N N Y N N N Y
Mamma n/a Weak N N N Y N N N
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
Conclusion
  • Intelligent agent technology could be used to
    improve the searching method.
  • Quantum searching method also could be explored.

28
Web search
Thank you all!
Write a Comment
User Comments (0)
About PowerShow.com