Title: Information Retrieval
1Information Retrieval
March 11, 2005
2Course Information
- Instructor Dragomir R. Radev (radev_at_si.umich.edu)
- Office 3080, West Hall Connector
- Phone (734) 615-5225
- Office hours M 11-12 Th 12-1 or via email
- Course page http//tangra.si.umich.edu/radev/650
/ - Class meets on Fridays, 210-455 PM in 409 West
Hall
3Measuring the Web
4Bharat and Broder 1998
- Based on crawls of HotBot, Altavista, Excite, and
InfoSeek - 10,000 queries in mid and late 1997
- Estimate is 200M pages
- Only 1.4 are indexed by all of them
5Example (from BharatBroder)
A similar approach by Lawrence and Giles yields
320M pages (Lawrence and Giles 1998).
6Crawling the web
7Basic principles
- The HTTP/HTML protocols
- Following hyperlinks
- Some problems
- Link extraction
- Link normalization
- Robot exclusion
- Loops
- Spider traps
- Server overload
8Example
- U-Ms root robots.txt file
- http//www.umich.edu/robots.txt
- User-agent
- Disallow /websvcs/projects/
- Disallow /7Ewebsvcs/projects/
- Disallow /homepage/
- Disallow /7Ehomepage/
- Disallow /smartgl/
- Disallow /7Esmartgl/
- Disallow /gateway/
- Disallow /7Egateway/
9Example crawler
- E.g., poacher
- http//search.cpan.org/neilb/Robot-0.011/examples
/poacher - /data0/projects/perltree-index
10 ParseCommandLine() Initialise() robot-gtrun(
siteRoot)
Initialise()
- initialise global variables, contents, tables,
etc This function sets up various global
variables such as the version number for
WebAssay, the program name identifier, usage
statement, etc.
sub
Initialise robot new WWWRobot(
'NAME' gt BOTNAME,
'VERSION' gt VERSION,
'EMAIL' gt EMAIL,
'TRAVERSAL' gt
TRAVERSAL, 'VERBOSE'
gt VERBOSE, )
robot-gtaddHook('follow-url-test',
\follow_url_test) robot-gtaddHook('invoke-on
-contents', \process_contents)
robot-gtaddHook('invoke-on-get-error',
\process_get_error)
follow_url_test() - tell the robot module whether
is should follow link
sub
follow_url_test
process_get_error() - hook function invoked
whenever a GET fails
sub
process_get_error
process_contents() - process the contents of a
URL we've retrieved
sub
process_contents run_command(COMMAND,
filename) if defined COMMAND
11(No Transcript)
12Focused crawling
- Topical locality
- Pages that are linked are similar in content (and
vice-versa Davison 00, Menczer 02, 04, Radev et
al. 04) - The radius-1 hypothesis
- given that page i is relevant to a query and that
page i points to page j, then page j is also
likely to be relevant (at least, more so than a
random web page) - Focused crawling
- Keeping a priority queue of the most relevant
pages