Title: Retrieving Information on the Web
1Retrieving Information on the Web
- Presented
- by
- Md. Zaheed Iftekhar
- Course Information Retrieval (IFT6255)
- Professor Jian E. Nie
- DIRO, University of Montreal
- April 9th, 2003
2Overview
- Web search general description
- Introduction of web, search engines
- Definitions
- Major search engines
- Current technologies
- The future
- Where is the technology heading
- Proposal for further improvement
- Conclusion
- References
IR On The Web
3History of the Web
- In 1990 the World Wide Web (WWW) was developed by
Tim Berners-Lee at CERN to organize research
documents available on the Internet. - Combined idea of documents available by FTP with
the idea of hypertext to link documents. - Developed initial HTTP network protocol, URLs,
HTML, and first web server.
The web
4World Wide Web
- Ted Nelson developed idea of hypertext in 1965.
- Doug Engelbart invented the mouse and built the
first implementation of hypertext in the late
1960s at SRI. - ARPANET was developed in the early 1970s.
- The basic technology was in place in the 1970s
but it took the PC revolution and widespread
networking to inspire the web and make it
practical.
Web Search
5Web Browser
- Early browsers were developed in 1992 (Erwise,
ViolaWWW). - In 1993, Marc Andreessen and Eric Bina at UIUC
NCSA developed the Mosaic. - Andreessen joined with James Clark (Stanford
Prof. and Silicon Graphics founder) to form
Mosaic Communications Inc. in 1994 (which became
Netscape to avoid conflict with UIUC). - Microsoft licensed the original Mosaic from UIUC
and used it to build Internet Explorer in 1995.
Web Browser
6Web Search
- By late 1980s many files were available by
anonymous FTP. - In 1990, Alan Emtage of McGill Univ. developed
Archie (short for archives) - Assembled lists of files available on many FTP
servers. - Allowed regex search of these file names.
- In 1993, Veronica and Jughead were developed to
search names of text files available through
Gopher servers.
Web Search
7Web Search
- In 1993, early web robots (spiders) were built to
collect URLs - Wanderer
- ALIWEB (Archie-Like Index of the WEB)
- WWW Worm (indexed URLs and titles for regex
search) - In 1994, Stanford grad students David Filo and
Jerry Yang started manually collecting popular
web sites into a topical hierarchy called Yahoo.
Web Search
8Web Search
- In early 1994, Brian Pinkerton developed
WebCrawler as a class project at U Wash. (became
part of Excite and AOL). - The same year, Fuzzy Maudlin, a grad student at
CMU developed Lycos. - First to use a standard IR system.
- First to index a large set of pages.
- In late 1995, DEC developed Altavista. Supported
boolean operators, phrases, and reverse pointer
queries. - In 1998, Larry Page and Sergey Brin, Ph.D.
students at Stanford, started Google.
Web Search
9Spiders (Robots/Bots/Crawlers)
- Start with a comprehensive set of root URLs from
which to start the search. - Follow all links on these pages recursively to
find additional pages. - Index all novel found pages in an inverted index
as they are encountered. - May allow users to directly submit pages to be
indexed (and crawled from).
10Web search
Breadth-first Search
11Web search
Depth-first Search
12Search Strategy Trade-Offs
- Breadth-first explores uniformly outward from the
root page but requires memory of all nodes on the
previous level (exponential in depth). Standard
spidering method. - Depth-first requires memory of only depth times
branching-factor (linear in depth) but gets
lost pursuing a single thread. - Both strategies implementable using a queue of
links (URLs).
13Avoiding Page Duplication
- Must detect when revisiting a page that has
already been spidered (web is a graph not a
tree). - Must efficiently index visited pages to allow
rapid recognition test. - Tree indexing (e.g. trie)
- Hashtable
- Index page using URL as a key.
- Must canonicalize URLs (e.g. delete ending /)
- Not detect duplicated or mirrored pages.
- Index page using textual content as a key.
- Requires first downloading page.
14Spidering Algorithm
Initialize queue (Q) with initial set of known
URLs. Until Q empty or page or time limit
exhausted Pop URL, L, from front of Q.
If L is not to an HTML page (.gif, .jpeg, .ps,
.pdf, .ppt) continue loop.
If already visited L, continue loop.
Download page, P, for L. If cannot download
P (e.g. 404 error, robot excluded)
continue loop. Index P (e.g. add to
inverted index or store cached copy). Parse
P to obtain list of new links N. Append N
to the end of Q.
15Queueing Strategy
- How new links added to the queue determines
search strategy. - FIFO (append to end of Q) gives breadth-first
search. - LIFO (add to front of Q) gives depth-first
search. - Heuristically ordering the Q gives a focused
crawler that directs its search towards
interesting pages.
16Source http//www.bruceclay.com
17 Google
- Google is a search engine that maintains its own
spider based index. - Google also has a directory that is powered by
the Open Directory - Google supports
- Boolean search
- Phrase
- Similarity
- Proximity
Source lookoff.com, http//www.bruceclay.com
18Google
- Strengths
- The interface is tremendously simple, but the
quality in results is not significantly impeded - Accuracy for common topics
- Weaknesses
- Lack of power features
- Coverage of the Internet is much less than some
competitors - No OR keyword support for boolean searches
Source lookoff.com, http//www.bruceclay.com
19Yahoo!
- Strengths
- Coverage of the Internet is excellent
- Links are generally quite up to date and free of
spam and poor quality sites - Human maintainers ensure that sites are placed
correctly within the relevant topic - The search interface is very fast
- Yahoo integrates with indexed searches after
presenting Yahoo topic areas - Accuracy for common topics
- Weaknesses
- The search interface is very effective for
general searches but could be better for powerful
searches - Not all relevant sites are listed in Yahoo - they
have to be submitted and accepted.
Source lookoff.com, http//www.bruceclay.com
20Ask Jeeves
- Strengths
- A simple interface makes it very easy to form
queries. Excellent for new users and children. - If your query corresponds to a pre-packaged
answer, you can expect some surprisingly good
results. Millions of bundled answers provide
premium answers that are superior to standard
index search.es - The site is actively maintained.
- An integrated metacrawler provides results for
your search from Goto, AltaVista, Mamma and
4Anything. - The search code is very fast.
- Weaknesses
- The site supposedly takes pay for top spots,
sometimes placing dubious quality links at the
top of results. - No advanced search.
- Very little power in constructing your keywords
- Little control over filtering results.
21MSN
- Strengths
- Very active news portal with updated and
well-presented headlines. - Integrated single sign-on with hotmail, msn, etc.
- Configurable interface lets you customize
content, layout and colors. - Very actively maintained.
- Many interesting (although often
commercially-oriented) services tied into the MSN
network. - Nationalized versions for quite a few countries
providing a more specific content and news feed. - Ability to save (i.e. tag) results to quickly
filter search results into a candidates list. - Weaknesses
- Not a low-bandwidth interface. Slow modem users
should beware. - Mediocre search interface
- Less web coverage than most search engines
22Program Pages () Class FAQ FTP Index Meta Misc News Portal
Dejanews 300M msg Best N N N N Y Y N
Raging 250M Best N N Y N N N N
Yahoo 500T Best N N N N N N Y
AllTheWeb 300M Excellent N N Y N N N N
AltaVista 250M Excellent N N Y N N Y Y
FAQS 3300 FAQs Excellent Y N N N Y N N
FTPSearch 100M file Excellent N Y N N N N N
Search.com N/A Excellent N N N Y N N N
About ? Good N N N N Y N Y
AskJeeves 8M Ques. Good Y N Y N N N Y
DirectHit ? Good N N N N N N Y
Excite ? Good N N Y N N Y Y
Go 50M? Good N N Y N N N Y
Google 100M? Good N N Y N N N N
HotBot 150M? Good N N Y N N N Y
Lycos 250M? Good N Y Y N N N Y
MetaCrawler N/A Good N N N Y N N N
MSN 120M? Good N N Y N N N Y
NorthernLight 200M? Good N N Y N N Y N
OpenDirectory 1M? Good N N N N N N Y
WebCenter 500T? Good N N N N N N Y
DogPile N/A Okay N Y N Y Y Y Y
GoTo ? Okay N N Y N N N Y
InfoSpace very few Okay N N Y N Y N N
iWon 350M? Okay N N Y N Y N N
Snap ? Okay N N Y N N N Y
Mamma n/a Weak N N N Y N N N
23(No Transcript)
24(No Transcript)
25(No Transcript)
26(No Transcript)
27Conclusion
- Intelligent agent technology could be used to
improve the searching method. - Quantum searching method also could be explored.
28Web search
Thank you all!