Information Retrieval (9) - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Information Retrieval (9)

Description:

... web Using extrapolation methods Random queries and their coverage by different search engines Overlap between search engines HTTP requests to random IP ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 22
Provided by: rade84
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval (9)


1
Information Retrieval(9)
  • Prof. Dragomir R. Radev
  • radev_at_umich.edu

2
IR Winter 2010
14. Webometrics The Bow-tie model
3
Brief history of the Web
  • FTP/Gopher
  • WWW (1989)
  • Archie (1990)
  • Mosaic (1993)
  • Webcrawler (1994)
  • Lycos (1994)
  • Yahoo! (1994)
  • Google (1998)

4
Size
  • The Web is the largest repository of data and it
    grows exponentially.
  • 320 Million Web pages Lawrence Giles 1998
  • 800 Million Web pages, 15 TB Lawrence Giles
    1999
  • 20 Billion Web pages indexed now
  • Amount of data
  • roughly 200 TB Lyman et al. 2003

5
Zipfian properties
  • In-degree
  • Out-degree
  • Visits to a page

6
Bow-tie model of the Web
TEND44M
SCC56 M
OUT44 M
IN44 M
24 of pagesreachable froma given page
DISC17 M
Bröder al. WWW 2000, Dill al. VLDB 2001
7
Measuring the size of the web
  • Using extrapolation methods
  • Random queries and their coverage by different
    search engines
  • Overlap between search engines
  • HTTP requests to random IP addresses

8
Bharat and Broder 1998
  • Based on crawls of HotBot, Altavista, Excite, and
    InfoSeek
  • 10,000 queries in mid and late 1997
  • Estimate is 200M pages
  • Only 1.4 are indexed by all of them

9
Example (from BharatBroder)
A similar approach by Lawrence and Giles yields
320M pages (Lawrence and Giles 1998).
10
What makes Web IR different?
  • Much bigger
  • No fixed document collection
  • Users
  • Non-human users
  • Varied user base
  • Miscellaneous user needs
  • Dynamic content
  • Evolving content
  • Spam
  • Infinite sized size is whatever can be indexed!

11
IR Winter 2010
15. Crawling the Web Hypertext retrieval
Web-based IR Document closures
Focused crawling
12
Web crawling
  • The HTTP/HTML protocols
  • Following hyperlinks
  • Some problems
  • Link extraction
  • Link normalization
  • Robot exclusion
  • Loops
  • Spider traps
  • Server overload

13
Example
  • U-Ms root robots.txt file
  • http//www.umich.edu/robots.txt
  • User-agent
  • Disallow /websvcs/projects/
  • Disallow /7Ewebsvcs/projects/
  • Disallow /homepage/
  • Disallow /7Ehomepage/
  • Disallow /smartgl/
  • Disallow /7Esmartgl/
  • Disallow /gateway/
  • Disallow /7Egateway/

14
Example crawler
  • E.g., poacher
  • http//search.cpan.org/neilb/Robot-0.011/examples
    /poacher
  • Included in clairlib

15
ParseCommandLine() Initialise() robot-gtrun(
siteRoot)
Initialise()
- initialise global variables, contents, tables,
etc This function sets up various global
variables such as the version number for
WebAssay, the program name identifier, usage
statement, etc.
sub
Initialise robot new WWWRobot(
'NAME' gt BOTNAME,
'VERSION' gt VERSION,
'EMAIL' gt EMAIL,
'TRAVERSAL' gt
TRAVERSAL, 'VERBOSE'
gt VERBOSE, )
robot-gtaddHook('follow-url-test',
\follow_url_test) robot-gtaddHook('invoke-on
-contents', \process_contents)
robot-gtaddHook('invoke-on-get-error',
\process_get_error)

follow_url_test() - tell the robot module whether
is should follow link
sub
follow_url_test

process_get_error() - hook function invoked
whenever a GET fails
sub
process_get_error

process_contents() - process the contents of a
URL we've retrieved
sub
process_contents run_command(COMMAND,
filename) if defined COMMAND
16
Focused crawling
  • Topical locality
  • Pages that are linked are similar in content (and
    vice-versa Davison 00, Menczer 02, 04, Radev et
    al. 04)
  • The radius-1 hypothesis
  • given that page i is relevant to a query and that
    page i points to page j, then page j is also
    likely to be relevant (at least, more so than a
    random web page)
  • Focused crawling
  • Keeping a priority queue of the most relevant
    pages

17
Challenges in indexing the web
  • Page importance varies a lot
  • Anchor text
  • User modeling
  • Detecting duplicates
  • Dealing with spam (content-based and link-based)

18
Duplicate detection
  • Shingles
  • TO BE OR
  • BE OR NOT
  • OR NOT TO
  • NOT TO BE
  • The use the Jaccard coefficient (size of
    intersection/size of union) to determine
    similarity
  • Hashing
  • Shingling (separate lecture)

19
Document closures for QA
spain
spain
Madrid
capital
capital
L
P
P
20
Document closures for IR
University of
Michigan
PhysicsDepartment
Michigan
Physics
L
P
P
21
The link-content hypothesis
  • Topical locality page is similar (?) to the page
    that points to it (?).
  • Davison (TFIDF, 100K pages)
  • 0.31 same domain
  • 0.23 linked pages
  • 0.19 sibling
  • 0.02 random
  • Menczer (373K pages, non-linear least squares
    fit)
  • Chakrabarti (focused crawling) - prob. of losing
    the topic

?11.8, ?20.6,
Van Rijsbergen 1979, Chakrabarti al. WWW 1999,
Davison SIGIR 2000, Menczer 2001
Write a Comment
User Comments (0)
About PowerShow.com