CS5286 Algorithms And Techniques for Web Search - PowerPoint PPT Presentation

1 / 54

About This Presentation

Title:

CS5286 Algorithms And Techniques for Web Search

Description:

Query Interface. Analyse user profiles. generate user specific query result. Algorithmic issues: ... DP=Dogpile. 8/22/09. 36. CS5286 Algorithms and Techniques ... – PowerPoint PPT presentation

Number of Views:178

Avg rating:3.0/5.0

Slides: 55

Provided by: scie241

Category:

more less

Transcript and Presenter's Notes

Title: CS5286 Algorithms And Techniques for Web Search

1
CS5286 Algorithms And Techniques for Web Search
Objective Provide a practical introduction to
algorithms and techniques for information
retrieval over the Internet.
2
Contact

Lecturer
Professor DENG, Xiaotie
Room Y6321 Ext 8632 Email csdeng
TA
SUN Wei
Room CYC2207 Ext 8030 Email sunwei_at_cs

3
Assessment

Coursework 50
20 marks for quiz two, each 10 of the final
mark.
27 marks for a group project (2-3 people in a
group).
3 participation points, at Discussion Forum,
tutorials and classes (one point each).
Examination 50
one 1.5-hour examination.
At least 30 examination marks are required to
pass.

4
Reference Books

Modern Information Retrieval, by Ricardo
Baeza-Yates and Berthier Ribeiro-Neto, Addison
Wesley, 1999.
GUIDE TO SEARCH ENGINES, by Wes Sonnenreich and
Tim Macinta, Wiley Computer Publishing, 1998.

5
Students Will Acquire The Following

Web access
Automated access to existing search engines
The use of spiders/robots for web searching
Collection of visitor information to ones own
web site
Web mining
Ranking techniques for web sites on specific
topics
Automated abstract generation
User profile
Information retrieval
Basic Models
Major Query Operations
Indexing and Searching
New research topics

6
Some Helpful Web Sites

A history of search engines
http//www.wiley.com/legacy/compbooks/sonnenreich/
webdev/history.html
Java and the class URL (search under class net)
http//java.sun.com/j2se/1.3/docs/api/index.html
Free search engines written in Java
http//www.freewarejava.com/applets/search.shtml
Robots
http//www.robotstxt.org/wc/robots.html

7
Tentative Lecture Plan

The Internet and Web
Collection of Information over the Web
Quiz 1
Models of Information Retrieval
Query techniques
Quiz 2
Start of Project
Text Operations
Indexing and Searching Techniques

8
Tentative Tutorial Session Plan

The purpose To provide hand-on experience
learning
Materials to be covered
Review of Java and Link to the Internet
Functionality of Spider/Robot
Access to Major Search Engines
A simple search engine in Java
In addition, we will conduct the following in
tutorial sessions
Submission and discussion of project proposal and
plan
Project Presentation

9
Plan For The Group Project

Two or Three people in a group
It is best to do a project that use one of the
following available tools for some application
problem.
Spider/Robot
Major Search Engines
The simple search engine in Java
Some example of possible projects
Build a network map of co-authorship relations.
Build relationship networks by Internet
information retrieval.
Design a method to test which search engine
covers more webpages.
Start your project as early as possible.

10
Pre-Requisites

Know how to program in JAVA.
Or
Capable of learning JAVA programming in one week
or so.
DROP the course if you dont.
We will have some quick quiz on JAVA to determine
whether the course is suitable for you.

11
Lecture 1 Introduction
12
A Simple Search Engine Architecture
Web
Spider
Indexer
Database
Query Interface
Query Engine
User
13
Major issues

Spider and communication between computer and the
Internet
Data/document model for information retrieval
Query protocol design
User profile techniques
Interactive Information Retrieval Technique Design

14
Spiders

Automatically Retrieve web pages
Start with an URL
retrieve the associated web page
Find all URLs on the web page
recursively retrieve not-yet searched URLs
Algorithmic Issues
How to choose the next URL?
Avoid overloaded sub-networks

15
Indexer

Selects terms to index for a document
may utilise co-operation from web page authors
through Meta tags to indicate specific terms to
index
ltMETA name"keywords" contentinformation
retrievalgt
Algorithmic issues
How to choose terms/phrases or other entities to
index so as to accurately and efficiently respond
to use queries

16
Database

Tradeoff of Hardware/Speed Efficiency
Algorithmic issues
efficiency in space
redundancy as trade-off for speed in query
response
Cost efficiency
How many computers to use?
How to distribute load efficiently?

17
Query Engine

Return the most relevant documents for queries
Algorithmic Issues
document model
relevance analysis

18
Query Interface

Analyse user profiles
generate user specific query result
Algorithmic issues
Design of efficient and user-friendly query
protocols

19
Interesting Problems

Finding the needle in the haystack
search for certain specific information on the
Internet
User-specific ranking of documents on the web
how to collect and apply user information to
provide better service
Trust analysis of information on the web
avoid providing false information
Trustworthiness analysis of virtual identities
over the Internet.
http//www.firstgov.gov/Citizen/Topics/Internet_Fr
aud.shtml

20
Some Facts about the Internet
21
Statistics About Internet

Internet Domain Growth
http//www.isc.org/index.pl?/ops/ds/
How to conduct Internet Domain Survey
http//www.isc.org/ds/faq.html

22
Internet Growth Charts
23
Internet Provides Varieties of Information

Text documents
Multimedia files
Interactive information services
Internet group membership services
Databases
Frauds Trojan horses and Phishing tricks

24
Major Features of Information Retrieval on the
Internet

Large amount of information
Rapid information update
Dynamic hyperlink structure
Varieties of data format, language, qualities

25
Some Difficulties for Internet Informational
Retrieval System

Diversified user base (from layman to computer
nerds).
could we develop an evolving system that adapts
to user?
Language Ambiguity
This becomes an especially important issue
because of varieties of different data on the
Internet
How do we collect and apply user profiling
techniques to resolve it?

26
Search Engines Today
27
Evolving Search Engines

Tools for finding information on the Web
Problem hidden databases, e.g. New York Times
Directory
A hand-constructed hierarchy of topics (e.g.
Yahoo)
Search engine
A machine-constructed index (usually by keyword)
Interactive Searching
http//www.learnthenet.com/english/html/78tutorial
.htm
Specialized Searching
Google Scholar http//www.scholar.google.com/
Guide to find search engines
http//www.searchenginecolossus.com/
New trends in search engines
http//www.searchengineshowdown.com/

28
Coverage of Search Engine

Number of web pages covered
Self claimed.
Maybe include link-only without analyzing the
page
Page Depth
The maximum amount of information indexed for an
individual webpage.
http//blog.searchenginewatch.com/blog/041111-0842
21

29
Search Engine Sizes (Apr. 6, 2001)
Estimated total web pages 2 billion
AV Altavista EX Excite FAST FAST GG Google Go Go
(Infoseek) INK Inktomi NL Northern
Light WT WebTop.com
SHADED DATA FOR GG AND INKTOMI INCLUDES
PAGES INDEXED BUT NOT VISITED
SEARCHES/DAY (MILLIONS)
100 12 50 47 50
5
SOURCE SEARCHENGINEWATCH.COM
30
Search Engine Sizes (Dec 11, 2001)
AV Altavista EX Excite FAST FAST GG Google Go Go
(Infoseek) INK Inktomi NL Northern
Light WT WebTop.com
SOURCE http//searchenginewatch.com/reports/sizes
.html
31
Search Engine Size Trends
SOURCE http//searchenginewatch.com/reports/artic
le.php/2156481trend
32
Search Engines Disjointness
SOURCE SEARCHENGINESHOWDOWN
33
Search Engines Uniqueness
SOURCE http//www.searchengineshowdown.com/stats/
overlap.shtml
34
Time Spent Per Visitor (minutes)by Search
Engine, April 1999
AV Altavista EX Excite Go/IS Go/Infoseek GT GoTo H
B Hotbot LS LookSmart LY Lycos MSN MSN NS Netscape
WC Webcrawler YH Yahoo
SOURCE http//www.nielsen-netratings.com/
35
Time Spent Per Visitor (minutes)by Search
Engine, June 2002
MSNMSN, YHYahoo, GGGoogle, AOLAOL, AJAsk
Jeeves, ISInfoSpaceOVROverture (GoTo),
AVAltaVista, NSNetscape, LSLookSmart,
LYLycosDPDogpile.
SOURCE http//searchenginewatch.com/reports/netra
tings.html
36
Total (millions of) Hours Spent onby Search
Engine, June 2002
MSNMSN, YHYahoo, GGGoogle, AOLAOL, AJAsk
Jeeves, ISInfoSpaceOVROverture (GoTo),
AVAltaVista, NSNetscape, LSLookSmart,
LYLycosDPDogpile.
SOURCE http//searchenginewatch.com/reports/netra
tings.html
37
Audience Reach by Search Engine, July , 2001
AJ Ask Jeeves AV Altavista DH Direct
Hit DP Dogpile EX Excite GG Google GO Go/Infoseek
G2N GoTo HB Hotbot iWN iWon LS LookSmart LY Lycos
MC Metacrawler MM Mamma MSN MSN NL Northern
Light NS Netscape WC Webcrawler YH Yahoo
Audience Reach of active surfers
visiting during month. Totals exceed 100
because of overlap
SOURCE http//wreportus.mediametrix.com/clientCen
ter.html
38
Audience Reach by Search Engine, Mar. 2002
Audience Reach of active surfers
visiting during month. Totals exceed 100
because of overlap
MSNMSN, YHYahoo, GGGoogle, AOLAOL, AJAsk
Jeeves, LSLookSmart,ISPInfoSpace,
NSNetscape, OVROverture (GoTo).
SOURCE http//searchenginewatch.com/reports/media
metrix.html
39
Start With Spider
40
Spider Architecture
Add a new URL
Web Space
Shared URL pool
Http Request
url_spider
url_spider
url_spider
url_spider
url_spider
spiders
Http Response
Get an URL
Database Interface
Database
41
Communication

How a web browser communicates with computer
How a browser communicates with the Internet
How data travels through the Internet
How a web browser communicates with a web server

42
Web Browser

A primary tool to gather information from the
Internet
Netscape Navigator now firefox
Microsofts Internet Explorer

43
Web Server

It provides the connection of the computer to the
Internet
Serving Web pages to browsers
It usually runs on TCP port 80

44
Uniform Resource Locator(URL)

The address of a web page on the net
The web server is waiting at this address for the
browsers.
URL is used by a web browser
to travel to the address and request desired Web
page from the web server.
If the web server give the page to the Web
browser
The browser then display it to user.

45
TCP/IP for Internet Connection

IP stands for Internet Protocol
TCP stands for Transmission Control Protocol
TCP is layered on top of IP
The result communication system is TCP/IP.

46
The IP layer

Inter-network layer
Data are breaking down into packets of fixed size
and sent over to the destinations.
IP address consists of 4 8-bit numbers
example 144.214.37.200
Routes use IP address to send packets to their
destinations
packets of the same stream of data may go through
different routes.

47
The TCP layer

A service provider protocol
Provide a logical connection between the sender
and the receiver of data over the unreliable
network
Its data integrity support functions and
mechanism are the basis for application services
such as FTP, Telnet, etc.

48
TCP/IP Port Number

One for each specific application layer service
Used between two host computers to identify which
application program is to receive the incoming
traffic.
0-255 are pre-assigned and are called well-known
ports. If you want to assign a port number to a
specific application, use a number above 255.

49
Browser/Server Interaction

You type a URL (or click at it)
your browser opens up a connection with the web
server at the URL
your browser tells the web server the particular
page you want
the web server sends back a response giving
information about the page
then sends back the appropriate page

50
The Spider

Does that automatically (without clicking on a
line nor type a URL)
It is an automated program that search the web.
Read a web page
store/index the relevant information on the page
follow all the links on the page (and repeat the
above for each link)

51
Caution About Using A Spider

It may puts an unexpected amount of traffic load
if poorly written
Be responsible for your actions
Use a well-tested one instead of writing your own
Test it locally before running it over the
Internet
Follow the standard guideline
www.robotstxt.org/wc/guidelines.html

52
Tutorials

Start with a review of Java
Then how to connect to the internet
Use of spider
Major functionality of search engine
In addition, certain tasks will be assigned to
gain the first hand experience in learning.

53
Todays Tutorial

A typical Java program
A typical Java program that uses a URL as input
and return the content of the web page
Some further questions will be left as your
exercise.

54
Next Weeks Tutorial