Title: CS5286 Algorithms And Techniques for Web Search
1CS5286 Algorithms And Techniques for Web Search
Objective Provide a practical introduction to
algorithms and techniques for information
retrieval over the Internet.
2Contact
- Lecturer
- Professor DENG, Xiaotie
- Room Y6321 Ext 8632 Email csdeng
- TA
- SUN Wei
- Room CYC2207 Ext 8030 Email sunwei_at_cs
3Assessment
- Coursework 50
- 20 marks for quiz two, each 10 of the final
mark. - 27 marks for a group project (2-3 people in a
group). - 3 participation points, at Discussion Forum,
tutorials and classes (one point each). - Examination 50
- one 1.5-hour examination.
- At least 30 examination marks are required to
pass.
4Reference Books
- Modern Information Retrieval, by Ricardo
Baeza-Yates and Berthier Ribeiro-Neto, Addison
Wesley, 1999. - GUIDE TO SEARCH ENGINES, by Wes Sonnenreich and
Tim Macinta, Wiley Computer Publishing, 1998.
5Students Will Acquire The Following
- Web access
- Automated access to existing search engines
- The use of spiders/robots for web searching
- Collection of visitor information to ones own
web site - Web mining
- Ranking techniques for web sites on specific
topics - Automated abstract generation
- User profile
- Information retrieval
- Basic Models
- Major Query Operations
- Indexing and Searching
- New research topics
6Some Helpful Web Sites
- A history of search engines
- http//www.wiley.com/legacy/compbooks/sonnenreich/
webdev/history.html - Java and the class URL (search under class net)
- http//java.sun.com/j2se/1.3/docs/api/index.html
- Free search engines written in Java
- http//www.freewarejava.com/applets/search.shtml
- Robots
- http//www.robotstxt.org/wc/robots.html
7Tentative Lecture Plan
- The Internet and Web
- Collection of Information over the Web
- Quiz 1
- Models of Information Retrieval
- Query techniques
- Quiz 2
- Start of Project
- Text Operations
- Indexing and Searching Techniques
8Tentative Tutorial Session Plan
- The purpose To provide hand-on experience
learning - Materials to be covered
- Review of Java and Link to the Internet
- Functionality of Spider/Robot
- Access to Major Search Engines
- A simple search engine in Java
- In addition, we will conduct the following in
tutorial sessions - Submission and discussion of project proposal and
plan - Project Presentation
9Plan For The Group Project
- Two or Three people in a group
- It is best to do a project that use one of the
following available tools for some application
problem. - Spider/Robot
- Major Search Engines
- The simple search engine in Java
- Some example of possible projects
- Build a network map of co-authorship relations.
- Build relationship networks by Internet
information retrieval. - Design a method to test which search engine
covers more webpages. - Start your project as early as possible.
10Pre-Requisites
- Know how to program in JAVA.
- Or
- Capable of learning JAVA programming in one week
or so. - DROP the course if you dont.
- We will have some quick quiz on JAVA to determine
whether the course is suitable for you.
11Lecture 1 Introduction
12A Simple Search Engine Architecture
Web
Spider
Indexer
Database
Query Interface
Query Engine
User
13 Major issues
- Spider and communication between computer and the
Internet - Data/document model for information retrieval
- Query protocol design
- User profile techniques
- Interactive Information Retrieval Technique Design
14 Spiders
- Automatically Retrieve web pages
- Start with an URL
- retrieve the associated web page
- Find all URLs on the web page
- recursively retrieve not-yet searched URLs
- Algorithmic Issues
- How to choose the next URL?
- Avoid overloaded sub-networks
15Indexer
- Selects terms to index for a document
- may utilise co-operation from web page authors
through Meta tags to indicate specific terms to
index - ltMETA name"keywords" contentinformation
retrievalgt - Algorithmic issues
- How to choose terms/phrases or other entities to
index so as to accurately and efficiently respond
to use queries
16Database
- Tradeoff of Hardware/Speed Efficiency
- Algorithmic issues
- efficiency in space
- redundancy as trade-off for speed in query
response - Cost efficiency
- How many computers to use?
- How to distribute load efficiently?
17Query Engine
- Return the most relevant documents for queries
- Algorithmic Issues
- document model
- relevance analysis
18Query Interface
- Analyse user profiles
- generate user specific query result
- Algorithmic issues
- Design of efficient and user-friendly query
protocols
19 Interesting Problems
- Finding the needle in the haystack
- search for certain specific information on the
Internet - User-specific ranking of documents on the web
- how to collect and apply user information to
provide better service - Trust analysis of information on the web
- avoid providing false information
- Trustworthiness analysis of virtual identities
over the Internet. - http//www.firstgov.gov/Citizen/Topics/Internet_Fr
aud.shtml
20Some Facts about the Internet
21Statistics About Internet
- Internet Domain Growth
- http//www.isc.org/index.pl?/ops/ds/
- How to conduct Internet Domain Survey
- http//www.isc.org/ds/faq.html
22Internet Growth Charts
23Internet Provides Varieties of Information
- Text documents
- Multimedia files
- Interactive information services
- Internet group membership services
- Databases
- Frauds Trojan horses and Phishing tricks
24Major Features of Information Retrieval on the
Internet
- Large amount of information
- Rapid information update
- Dynamic hyperlink structure
- Varieties of data format, language, qualities
25Some Difficulties for Internet Informational
Retrieval System
- Diversified user base (from layman to computer
nerds). - could we develop an evolving system that adapts
to user? - Language Ambiguity
- This becomes an especially important issue
because of varieties of different data on the
Internet - How do we collect and apply user profiling
techniques to resolve it?
26Search Engines Today
27Evolving Search Engines
- Tools for finding information on the Web
- Problem hidden databases, e.g. New York Times
- Directory
- A hand-constructed hierarchy of topics (e.g.
Yahoo) - Search engine
- A machine-constructed index (usually by keyword)
- Interactive Searching
- http//www.learnthenet.com/english/html/78tutorial
.htm - Specialized Searching
- Google Scholar http//www.scholar.google.com/
- Guide to find search engines
- http//www.searchenginecolossus.com/
- New trends in search engines
- http//www.searchengineshowdown.com/
28Coverage of Search Engine
- Number of web pages covered
- Self claimed.
- Maybe include link-only without analyzing the
page - Page Depth
- The maximum amount of information indexed for an
individual webpage. - http//blog.searchenginewatch.com/blog/041111-0842
21
29Search Engine Sizes (Apr. 6, 2001)
Estimated total web pages 2 billion
AV Altavista EX Excite FAST FAST GG Google Go Go
(Infoseek) INK Inktomi NL Northern
Light WT WebTop.com
SHADED DATA FOR GG AND INKTOMI INCLUDES
PAGES INDEXED BUT NOT VISITED
SEARCHES/DAY (MILLIONS)
100 12 50 47 50
5
SOURCE SEARCHENGINEWATCH.COM
30Search Engine Sizes (Dec 11, 2001)
AV Altavista EX Excite FAST FAST GG Google Go Go
(Infoseek) INK Inktomi NL Northern
Light WT WebTop.com
SOURCE http//searchenginewatch.com/reports/sizes
.html
31Search Engine Size Trends
SOURCE http//searchenginewatch.com/reports/artic
le.php/2156481trend
32Search Engines Disjointness
SOURCE SEARCHENGINESHOWDOWN
33Search Engines Uniqueness
SOURCE http//www.searchengineshowdown.com/stats/
overlap.shtml
34Time Spent Per Visitor (minutes)by Search
Engine, April 1999
AV Altavista EX Excite Go/IS Go/Infoseek GT GoTo H
B Hotbot LS LookSmart LY Lycos MSN MSN NS Netscape
WC Webcrawler YH Yahoo
SOURCE http//www.nielsen-netratings.com/
35Time Spent Per Visitor (minutes)by Search
Engine, June 2002
MSNMSN, YHYahoo, GGGoogle, AOLAOL, AJAsk
Jeeves, ISInfoSpaceOVROverture (GoTo),
AVAltaVista, NSNetscape, LSLookSmart,
LYLycosDPDogpile.
SOURCE http//searchenginewatch.com/reports/netra
tings.html
36Total (millions of) Hours Spent onby Search
Engine, June 2002
MSNMSN, YHYahoo, GGGoogle, AOLAOL, AJAsk
Jeeves, ISInfoSpaceOVROverture (GoTo),
AVAltaVista, NSNetscape, LSLookSmart,
LYLycosDPDogpile.
SOURCE http//searchenginewatch.com/reports/netra
tings.html
37Audience Reach by Search Engine, July , 2001
AJ Ask Jeeves AV Altavista DH Direct
Hit DP Dogpile EX Excite GG Google GO Go/Infoseek
G2N GoTo HB Hotbot iWN iWon LS LookSmart LY Lycos
MC Metacrawler MM Mamma MSN MSN NL Northern
Light NS Netscape WC Webcrawler YH Yahoo
Audience Reach of active surfers
visiting during month. Totals exceed 100
because of overlap
SOURCE http//wreportus.mediametrix.com/clientCen
ter.html
38Audience Reach by Search Engine, Mar. 2002
Audience Reach of active surfers
visiting during month. Totals exceed 100
because of overlap
MSNMSN, YHYahoo, GGGoogle, AOLAOL, AJAsk
Jeeves, LSLookSmart,ISPInfoSpace,
NSNetscape, OVROverture (GoTo).
SOURCE http//searchenginewatch.com/reports/media
metrix.html
39Start With Spider
40Spider Architecture
Add a new URL
Web Space
Shared URL pool
Http Request
url_spider
url_spider
url_spider
url_spider
url_spider
spiders
Http Response
Get an URL
Database Interface
Database
41Communication
- How a web browser communicates with computer
- How a browser communicates with the Internet
- How data travels through the Internet
- How a web browser communicates with a web server
42Web Browser
- A primary tool to gather information from the
Internet - Netscape Navigator now firefox
- Microsofts Internet Explorer
43Web Server
- It provides the connection of the computer to the
Internet - Serving Web pages to browsers
- It usually runs on TCP port 80
44Uniform Resource Locator(URL)
- The address of a web page on the net
- The web server is waiting at this address for the
browsers. - URL is used by a web browser
- to travel to the address and request desired Web
page from the web server. - If the web server give the page to the Web
browser - The browser then display it to user.
45TCP/IP for Internet Connection
- IP stands for Internet Protocol
- TCP stands for Transmission Control Protocol
- TCP is layered on top of IP
- The result communication system is TCP/IP.
46The IP layer
- Inter-network layer
- Data are breaking down into packets of fixed size
and sent over to the destinations. - IP address consists of 4 8-bit numbers
- example 144.214.37.200
- Routes use IP address to send packets to their
destinations - packets of the same stream of data may go through
different routes.
47The TCP layer
- A service provider protocol
- Provide a logical connection between the sender
and the receiver of data over the unreliable
network - Its data integrity support functions and
mechanism are the basis for application services
such as FTP, Telnet, etc.
48TCP/IP Port Number
- One for each specific application layer service
- Used between two host computers to identify which
application program is to receive the incoming
traffic. - 0-255 are pre-assigned and are called well-known
ports. If you want to assign a port number to a
specific application, use a number above 255.
49Browser/Server Interaction
- You type a URL (or click at it)
- your browser opens up a connection with the web
server at the URL - your browser tells the web server the particular
page you want - the web server sends back a response giving
information about the page - then sends back the appropriate page
50The Spider
- Does that automatically (without clicking on a
line nor type a URL) - It is an automated program that search the web.
- Read a web page
- store/index the relevant information on the page
- follow all the links on the page (and repeat the
above for each link)
51Caution About Using A Spider
- It may puts an unexpected amount of traffic load
if poorly written - Be responsible for your actions
- Use a well-tested one instead of writing your own
- Test it locally before running it over the
Internet - Follow the standard guideline
- www.robotstxt.org/wc/guidelines.html
52Tutorials
- Start with a review of Java
- Then how to connect to the internet
- Use of spider
- Major functionality of search engine
- In addition, certain tasks will be assigned to
gain the first hand experience in learning.
53Todays Tutorial
- A typical Java program
- A typical Java program that uses a URL as input
and return the content of the web page - Some further questions will be left as your
exercise.
54Next Weeks Tutorial
- Java network programming introduction
- HTTP introduction
- Java URL class for establish HTTP connection