Information Retrieval

About This Presentation

Title:

Information Retrieval

Description:

Information Retrieval – PowerPoint PPT presentation

Number of Views:19

Avg rating:3.0/5.0

Slides: 13

Provided by: dragomi3

Category:

more less

Transcript and Presenter's Notes

Title: Information Retrieval

1
Information Retrieval
March 11, 2005

Handout 9

2
Course Information

Instructor Dragomir R. Radev (radev_at_si.umich.edu)
Office 3080, West Hall Connector
Phone (734) 615-5225
Office hours M 11-12 Th 12-1 or via email
Course page http//tangra.si.umich.edu/radev/650
/
Class meets on Fridays, 210-455 PM in 409 West
Hall

3
Measuring the Web
4
Bharat and Broder 1998

Based on crawls of HotBot, Altavista, Excite, and
InfoSeek
10,000 queries in mid and late 1997
Estimate is 200M pages
Only 1.4 are indexed by all of them

5
Example (from BharatBroder)
A similar approach by Lawrence and Giles yields
320M pages (Lawrence and Giles 1998).
6
Crawling the web
7
Basic principles

The HTTP/HTML protocols
Following hyperlinks
Some problems
Link extraction
Link normalization
Robot exclusion
Loops
Spider traps
Server overload

8
Example

U-Ms root robots.txt file
http//www.umich.edu/robots.txt
User-agent
Disallow /websvcs/projects/
Disallow /7Ewebsvcs/projects/
Disallow /homepage/
Disallow /7Ehomepage/
Disallow /smartgl/
Disallow /7Esmartgl/
Disallow /gateway/
Disallow /7Egateway/

9
Example crawler

E.g., poacher
http//search.cpan.org/neilb/Robot-0.011/examples
/poacher
/data0/projects/perltree-index

10
ParseCommandLine() Initialise() robot-gtrun(
siteRoot)
Initialise()
- initialise global variables, contents, tables,
etc This function sets up various global
variables such as the version number for
WebAssay, the program name identifier, usage
statement, etc.
sub
Initialise robot new WWWRobot(
'NAME' gt BOTNAME,
'VERSION' gt VERSION,
'EMAIL' gt EMAIL,
'TRAVERSAL' gt
TRAVERSAL, 'VERBOSE'
gt VERBOSE, )
robot-gtaddHook('follow-url-test',
\follow_url_test) robot-gtaddHook('invoke-on
-contents', \process_contents)
robot-gtaddHook('invoke-on-get-error',
\process_get_error)

follow_url_test() - tell the robot module whether
is should follow link
sub
follow_url_test

process_get_error() - hook function invoked
whenever a GET fails
sub
process_get_error

process_contents() - process the contents of a
URL we've retrieved
sub
process_contents run_command(COMMAND,
filename) if defined COMMAND
11
(No Transcript)
12
Focused crawling

Topical locality
Pages that are linked are similar in content (and
vice-versa Davison 00, Menczer 02, 04, Radev et
al. 04)
The radius-1 hypothesis
given that page i is relevant to a query and that
page i points to page j, then page j is also
likely to be relevant (at least, more so than a
random web page)
Focused crawling
Keeping a priority queue of the most relevant
pages

Write a Comment

User Comments (0)

About PowerShow.com

Information Retrieval - PowerPoint PPT Presentation

Information Retrieval

Information Retrieval – PowerPoint PPT presentation