CS5286 Algorithms And Techniques for Web Search - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

CS5286 Algorithms And Techniques for Web Search

Description:

Query Interface. Analyse user profiles. generate user specific query result. Algorithmic issues: ... DP=Dogpile. 8/22/09. 36. CS5286 Algorithms and Techniques ... – PowerPoint PPT presentation

Number of Views:178
Avg rating:3.0/5.0
Slides: 55
Provided by: scie241
Category:

less

Transcript and Presenter's Notes

Title: CS5286 Algorithms And Techniques for Web Search


1
CS5286 Algorithms And Techniques for Web Search
Objective Provide a practical introduction to
algorithms and techniques for information
retrieval over the Internet.
2
Contact
  • Lecturer
  • Professor DENG, Xiaotie
  • Room Y6321 Ext 8632 Email csdeng
  • TA
  • SUN Wei
  • Room CYC2207 Ext 8030 Email sunwei_at_cs

3
Assessment
  • Coursework 50
  • 20 marks for quiz two, each 10 of the final
    mark.
  • 27 marks for a group project (2-3 people in a
    group).
  • 3 participation points, at Discussion Forum,
    tutorials and classes (one point each).
  • Examination 50
  • one 1.5-hour examination.
  • At least 30 examination marks are required to
    pass.

4
Reference Books
  • Modern Information Retrieval, by Ricardo
    Baeza-Yates and Berthier Ribeiro-Neto, Addison
    Wesley, 1999.
  • GUIDE TO SEARCH ENGINES, by Wes Sonnenreich and
    Tim Macinta, Wiley Computer Publishing, 1998.

5
Students Will Acquire The Following
  • Web access
  • Automated access to existing search engines
  • The use of spiders/robots for web searching
  • Collection of visitor information to ones own
    web site
  • Web mining
  • Ranking techniques for web sites on specific
    topics
  • Automated abstract generation
  • User profile
  • Information retrieval
  • Basic Models
  • Major Query Operations
  • Indexing and Searching
  • New research topics

6
Some Helpful Web Sites
  • A history of search engines
  • http//www.wiley.com/legacy/compbooks/sonnenreich/
    webdev/history.html
  • Java and the class URL (search under class net)
  • http//java.sun.com/j2se/1.3/docs/api/index.html
  • Free search engines written in Java
  • http//www.freewarejava.com/applets/search.shtml
  • Robots
  • http//www.robotstxt.org/wc/robots.html

7
Tentative Lecture Plan
  • The Internet and Web
  • Collection of Information over the Web
  • Quiz 1
  • Models of Information Retrieval
  • Query techniques
  • Quiz 2
  • Start of Project
  • Text Operations
  • Indexing and Searching Techniques

8
Tentative Tutorial Session Plan
  • The purpose To provide hand-on experience
    learning
  • Materials to be covered
  • Review of Java and Link to the Internet
  • Functionality of Spider/Robot
  • Access to Major Search Engines
  • A simple search engine in Java
  • In addition, we will conduct the following in
    tutorial sessions
  • Submission and discussion of project proposal and
    plan
  • Project Presentation

9
Plan For The Group Project
  • Two or Three people in a group
  • It is best to do a project that use one of the
    following available tools for some application
    problem.
  • Spider/Robot
  • Major Search Engines
  • The simple search engine in Java
  • Some example of possible projects
  • Build a network map of co-authorship relations.
  • Build relationship networks by Internet
    information retrieval.
  • Design a method to test which search engine
    covers more webpages.
  • Start your project as early as possible.

10
Pre-Requisites
  • Know how to program in JAVA.
  • Or
  • Capable of learning JAVA programming in one week
    or so.
  • DROP the course if you dont.
  • We will have some quick quiz on JAVA to determine
    whether the course is suitable for you.

11
Lecture 1 Introduction
12
A Simple Search Engine Architecture
Web
Spider
Indexer
Database
Query Interface
Query Engine
User
13
Major issues
  • Spider and communication between computer and the
    Internet
  • Data/document model for information retrieval
  • Query protocol design
  • User profile techniques
  • Interactive Information Retrieval Technique Design

14
Spiders
  • Automatically Retrieve web pages
  • Start with an URL
  • retrieve the associated web page
  • Find all URLs on the web page
  • recursively retrieve not-yet searched URLs
  • Algorithmic Issues
  • How to choose the next URL?
  • Avoid overloaded sub-networks

15
Indexer
  • Selects terms to index for a document
  • may utilise co-operation from web page authors
    through Meta tags to indicate specific terms to
    index
  • ltMETA name"keywords" contentinformation
    retrievalgt
  • Algorithmic issues
  • How to choose terms/phrases or other entities to
    index so as to accurately and efficiently respond
    to use queries

16
Database
  • Tradeoff of Hardware/Speed Efficiency
  • Algorithmic issues
  • efficiency in space
  • redundancy as trade-off for speed in query
    response
  • Cost efficiency
  • How many computers to use?
  • How to distribute load efficiently?

17
Query Engine
  • Return the most relevant documents for queries
  • Algorithmic Issues
  • document model
  • relevance analysis

18
Query Interface
  • Analyse user profiles
  • generate user specific query result
  • Algorithmic issues
  • Design of efficient and user-friendly query
    protocols

19
Interesting Problems
  • Finding the needle in the haystack
  • search for certain specific information on the
    Internet
  • User-specific ranking of documents on the web
  • how to collect and apply user information to
    provide better service
  • Trust analysis of information on the web
  • avoid providing false information
  • Trustworthiness analysis of virtual identities
    over the Internet.
  • http//www.firstgov.gov/Citizen/Topics/Internet_Fr
    aud.shtml

20
Some Facts about the Internet
21
Statistics About Internet
  • Internet Domain Growth
  • http//www.isc.org/index.pl?/ops/ds/
  • How to conduct Internet Domain Survey
  • http//www.isc.org/ds/faq.html

22
Internet Growth Charts
23
Internet Provides Varieties of Information
  • Text documents
  • Multimedia files
  • Interactive information services
  • Internet group membership services
  • Databases
  • Frauds Trojan horses and Phishing tricks

24
Major Features of Information Retrieval on the
Internet
  • Large amount of information
  • Rapid information update
  • Dynamic hyperlink structure
  • Varieties of data format, language, qualities

25
Some Difficulties for Internet Informational
Retrieval System
  • Diversified user base (from layman to computer
    nerds).
  • could we develop an evolving system that adapts
    to user?
  • Language Ambiguity
  • This becomes an especially important issue
    because of varieties of different data on the
    Internet
  • How do we collect and apply user profiling
    techniques to resolve it?

26
Search Engines Today
27
Evolving Search Engines
  • Tools for finding information on the Web
  • Problem hidden databases, e.g. New York Times
  • Directory
  • A hand-constructed hierarchy of topics (e.g.
    Yahoo)
  • Search engine
  • A machine-constructed index (usually by keyword)
  • Interactive Searching
  • http//www.learnthenet.com/english/html/78tutorial
    .htm
  • Specialized Searching
  • Google Scholar http//www.scholar.google.com/
  • Guide to find search engines
  • http//www.searchenginecolossus.com/
  • New trends in search engines
  • http//www.searchengineshowdown.com/

28
Coverage of Search Engine
  • Number of web pages covered
  • Self claimed.
  • Maybe include link-only without analyzing the
    page
  • Page Depth
  • The maximum amount of information indexed for an
    individual webpage.
  • http//blog.searchenginewatch.com/blog/041111-0842
    21

29
Search Engine Sizes (Apr. 6, 2001)
Estimated total web pages 2 billion
AV Altavista EX Excite FAST FAST GG Google Go Go
(Infoseek) INK Inktomi NL Northern
Light WT WebTop.com
SHADED DATA FOR GG AND INKTOMI INCLUDES
PAGES INDEXED BUT NOT VISITED
SEARCHES/DAY (MILLIONS)
100 12 50 47 50
5
SOURCE SEARCHENGINEWATCH.COM
30
Search Engine Sizes (Dec 11, 2001)
AV Altavista EX Excite FAST FAST GG Google Go Go
(Infoseek) INK Inktomi NL Northern
Light WT WebTop.com
SOURCE http//searchenginewatch.com/reports/sizes
.html
31
Search Engine Size Trends
SOURCE http//searchenginewatch.com/reports/artic
le.php/2156481trend
32
Search Engines Disjointness
SOURCE SEARCHENGINESHOWDOWN
33
Search Engines Uniqueness
SOURCE http//www.searchengineshowdown.com/stats/
overlap.shtml
34
Time Spent Per Visitor (minutes)by Search
Engine, April 1999
AV Altavista EX Excite Go/IS Go/Infoseek GT GoTo H
B Hotbot LS LookSmart LY Lycos MSN MSN NS Netscape
WC Webcrawler YH Yahoo
SOURCE http//www.nielsen-netratings.com/
35
Time Spent Per Visitor (minutes)by Search
Engine, June 2002
MSNMSN, YHYahoo, GGGoogle, AOLAOL, AJAsk
Jeeves, ISInfoSpaceOVROverture (GoTo),
AVAltaVista, NSNetscape, LSLookSmart,
LYLycosDPDogpile.
SOURCE http//searchenginewatch.com/reports/netra
tings.html
36
Total (millions of) Hours Spent onby Search
Engine, June 2002
MSNMSN, YHYahoo, GGGoogle, AOLAOL, AJAsk
Jeeves, ISInfoSpaceOVROverture (GoTo),
AVAltaVista, NSNetscape, LSLookSmart,
LYLycosDPDogpile.
SOURCE http//searchenginewatch.com/reports/netra
tings.html
37
Audience Reach by Search Engine, July , 2001
AJ Ask Jeeves AV Altavista DH Direct
Hit DP Dogpile EX Excite GG Google GO Go/Infoseek
G2N GoTo HB Hotbot iWN iWon LS LookSmart LY Lycos
MC Metacrawler MM Mamma MSN MSN NL Northern
Light NS Netscape WC Webcrawler YH Yahoo
Audience Reach of active surfers
visiting during month. Totals exceed 100
because of overlap
SOURCE http//wreportus.mediametrix.com/clientCen
ter.html
38
Audience Reach by Search Engine, Mar. 2002
Audience Reach of active surfers
visiting during month. Totals exceed 100
because of overlap
MSNMSN, YHYahoo, GGGoogle, AOLAOL, AJAsk
Jeeves, LSLookSmart,ISPInfoSpace,
NSNetscape, OVROverture (GoTo).
SOURCE http//searchenginewatch.com/reports/media
metrix.html
39
Start With Spider
40
Spider Architecture
Add a new URL
Web Space
Shared URL pool
Http Request
url_spider
url_spider
url_spider
url_spider
url_spider
spiders
Http Response
Get an URL
Database Interface
Database
41
Communication
  • How a web browser communicates with computer
  • How a browser communicates with the Internet
  • How data travels through the Internet
  • How a web browser communicates with a web server

42
Web Browser
  • A primary tool to gather information from the
    Internet
  • Netscape Navigator now firefox
  • Microsofts Internet Explorer

43
Web Server
  • It provides the connection of the computer to the
    Internet
  • Serving Web pages to browsers
  • It usually runs on TCP port 80

44
Uniform Resource Locator(URL)
  • The address of a web page on the net
  • The web server is waiting at this address for the
    browsers.
  • URL is used by a web browser
  • to travel to the address and request desired Web
    page from the web server.
  • If the web server give the page to the Web
    browser
  • The browser then display it to user.

45
TCP/IP for Internet Connection
  • IP stands for Internet Protocol
  • TCP stands for Transmission Control Protocol
  • TCP is layered on top of IP
  • The result communication system is TCP/IP.

46
The IP layer
  • Inter-network layer
  • Data are breaking down into packets of fixed size
    and sent over to the destinations.
  • IP address consists of 4 8-bit numbers
  • example 144.214.37.200
  • Routes use IP address to send packets to their
    destinations
  • packets of the same stream of data may go through
    different routes.

47
The TCP layer
  • A service provider protocol
  • Provide a logical connection between the sender
    and the receiver of data over the unreliable
    network
  • Its data integrity support functions and
    mechanism are the basis for application services
    such as FTP, Telnet, etc.

48
TCP/IP Port Number
  • One for each specific application layer service
  • Used between two host computers to identify which
    application program is to receive the incoming
    traffic.
  • 0-255 are pre-assigned and are called well-known
    ports. If you want to assign a port number to a
    specific application, use a number above 255.

49
Browser/Server Interaction
  • You type a URL (or click at it)
  • your browser opens up a connection with the web
    server at the URL
  • your browser tells the web server the particular
    page you want
  • the web server sends back a response giving
    information about the page
  • then sends back the appropriate page

50
The Spider
  • Does that automatically (without clicking on a
    line nor type a URL)
  • It is an automated program that search the web.
  • Read a web page
  • store/index the relevant information on the page
  • follow all the links on the page (and repeat the
    above for each link)

51
Caution About Using A Spider
  • It may puts an unexpected amount of traffic load
    if poorly written
  • Be responsible for your actions
  • Use a well-tested one instead of writing your own
  • Test it locally before running it over the
    Internet
  • Follow the standard guideline
  • www.robotstxt.org/wc/guidelines.html

52
Tutorials
  • Start with a review of Java
  • Then how to connect to the internet
  • Use of spider
  • Major functionality of search engine
  • In addition, certain tasks will be assigned to
    gain the first hand experience in learning.

53
Todays Tutorial
  • A typical Java program
  • A typical Java program that uses a URL as input
    and return the content of the web page
  • Some further questions will be left as your
    exercise.

54
Next Weeks Tutorial
  • Java network programming introduction
  • HTTP introduction
  • Java URL class for establish HTTP connection
Write a Comment
User Comments (0)
About PowerShow.com