Title: Information Retrieval (9)
1Information Retrieval(9)
- Prof. Dragomir R. Radev
- radev_at_umich.edu
2IR Winter 2010
14. Webometrics The Bow-tie model
3Brief history of the Web
- FTP/Gopher
- WWW (1989)
- Archie (1990)
- Mosaic (1993)
- Webcrawler (1994)
- Lycos (1994)
- Yahoo! (1994)
- Google (1998)
4Size
- The Web is the largest repository of data and it
grows exponentially. - 320 Million Web pages Lawrence Giles 1998
- 800 Million Web pages, 15 TB Lawrence Giles
1999 - 20 Billion Web pages indexed now
- Amount of data
- roughly 200 TB Lyman et al. 2003
5Zipfian properties
- In-degree
- Out-degree
- Visits to a page
6Bow-tie model of the Web
TEND44M
SCC56 M
OUT44 M
IN44 M
24 of pagesreachable froma given page
DISC17 M
Bröder al. WWW 2000, Dill al. VLDB 2001
7Measuring the size of the web
- Using extrapolation methods
- Random queries and their coverage by different
search engines - Overlap between search engines
- HTTP requests to random IP addresses
8Bharat and Broder 1998
- Based on crawls of HotBot, Altavista, Excite, and
InfoSeek - 10,000 queries in mid and late 1997
- Estimate is 200M pages
- Only 1.4 are indexed by all of them
9Example (from BharatBroder)
A similar approach by Lawrence and Giles yields
320M pages (Lawrence and Giles 1998).
10What makes Web IR different?
- Much bigger
- No fixed document collection
- Users
- Non-human users
- Varied user base
- Miscellaneous user needs
- Dynamic content
- Evolving content
- Spam
- Infinite sized size is whatever can be indexed!
11IR Winter 2010
15. Crawling the Web Hypertext retrieval
Web-based IR Document closures
Focused crawling
12Web crawling
- The HTTP/HTML protocols
- Following hyperlinks
- Some problems
- Link extraction
- Link normalization
- Robot exclusion
- Loops
- Spider traps
- Server overload
13Example
- U-Ms root robots.txt file
- http//www.umich.edu/robots.txt
- User-agent
- Disallow /websvcs/projects/
- Disallow /7Ewebsvcs/projects/
- Disallow /homepage/
- Disallow /7Ehomepage/
- Disallow /smartgl/
- Disallow /7Esmartgl/
- Disallow /gateway/
- Disallow /7Egateway/
14Example crawler
- E.g., poacher
- http//search.cpan.org/neilb/Robot-0.011/examples
/poacher - Included in clairlib
15 ParseCommandLine() Initialise() robot-gtrun(
siteRoot)
Initialise()
- initialise global variables, contents, tables,
etc This function sets up various global
variables such as the version number for
WebAssay, the program name identifier, usage
statement, etc.
sub
Initialise robot new WWWRobot(
'NAME' gt BOTNAME,
'VERSION' gt VERSION,
'EMAIL' gt EMAIL,
'TRAVERSAL' gt
TRAVERSAL, 'VERBOSE'
gt VERBOSE, )
robot-gtaddHook('follow-url-test',
\follow_url_test) robot-gtaddHook('invoke-on
-contents', \process_contents)
robot-gtaddHook('invoke-on-get-error',
\process_get_error)
follow_url_test() - tell the robot module whether
is should follow link
sub
follow_url_test
process_get_error() - hook function invoked
whenever a GET fails
sub
process_get_error
process_contents() - process the contents of a
URL we've retrieved
sub
process_contents run_command(COMMAND,
filename) if defined COMMAND
16Focused crawling
- Topical locality
- Pages that are linked are similar in content (and
vice-versa Davison 00, Menczer 02, 04, Radev et
al. 04) - The radius-1 hypothesis
- given that page i is relevant to a query and that
page i points to page j, then page j is also
likely to be relevant (at least, more so than a
random web page) - Focused crawling
- Keeping a priority queue of the most relevant
pages
17Challenges in indexing the web
- Page importance varies a lot
- Anchor text
- User modeling
- Detecting duplicates
- Dealing with spam (content-based and link-based)
18Duplicate detection
- Shingles
- TO BE OR
- BE OR NOT
- OR NOT TO
- NOT TO BE
- The use the Jaccard coefficient (size of
intersection/size of union) to determine
similarity - Hashing
- Shingling (separate lecture)
19Document closures for QA
spain
spain
Madrid
capital
capital
L
P
P
20Document closures for IR
University of
Michigan
PhysicsDepartment
Michigan
Physics
L
P
P
21The link-content hypothesis
- Topical locality page is similar (?) to the page
that points to it (?). - Davison (TFIDF, 100K pages)
- 0.31 same domain
- 0.23 linked pages
- 0.19 sibling
- 0.02 random
- Menczer (373K pages, non-linear least squares
fit) - Chakrabarti (focused crawling) - prob. of losing
the topic
?11.8, ?20.6,
Van Rijsbergen 1979, Chakrabarti al. WWW 1999,
Davison SIGIR 2000, Menczer 2001