CSM06 Information Retrieval - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

CSM06 Information Retrieval

Description:

CSM06 Information Retrieval Lecture 4: Web IR part 1 Dr Andrew Salway a.salway_at_surrey.ac.uk – PowerPoint PPT presentation

Number of Views:116
Avg rating:3.0/5.0
Slides: 24
Provided by: css146
Category:

less

Transcript and Presenter's Notes

Title: CSM06 Information Retrieval


1
CSM06 Information Retrieval
  • Lecture 4 Web IR part 1
  • Dr Andrew Salway a.salway_at_surrey.ac.uk

2
Lecture 4 OVERVIEW
  • Previously we looked at IR techniques that
    indexed a document based on the words that occur
    in the document
  • Some of these techniques are applied in web
    search engines (but VSM may not be appropriate).
    However, web IR can also exploit a distinctive
    feature of information on the web hypertext
    link structure
  • Use of anchor text for indexing web pages
  • The PageRank algorithm based on link structure
    analysis
  • Other techniques for ranking web pages

3
Challenges for IR on the Web
  • High volume of information
  • Heterogeneous information (multimedia and
    multilingual)
  • Diverse users - hence diverse information needs,
    and many inexperienced users
  • Average query length 2-5 words
  • Poorly structured and low quality information

4
Scale
  • Projection of worldwide Internet population in
    2005 1.07 billion users, www.clickz.com/stats/we
    b_worldwide/
  • Early in 2005 Google claimed to index over 8
    billion web pages, Yahoo recently claimed 19
    billion, now Google claims to index 3 times more
    than nearest competitor
  • http//select.nytimes.com/gst/abstract.html?resF3
    0610F93E540C748EDDA00894DD404482
  • Given the low overlap in search engine results
    for a given query, it is likely that the total
    number of webpages is much greater than that
    indexed by any single web search engine

5
Requirements of Web Search Engine Users?
  • Fast response time
  • Some relevant results in first page maybe less
    concern with getting all relevant results
  • Good coverage of web, at least of important
    sites
  • Up-to-date links
  • Simple and intuitive to use making queries and
    understanding results
  • NB. Some of these requirements contrast with
    those of expert researchers using specialist
    information retrieval systems

6
User Goals (Information Needs)
  • Queries are used to express a users goal (or
    information need), but note that the same query
    might be used for quite different goals
  • (Rose and Levinson 2004)

7
User Goals Rose and Levinsons classification
(2004)
  • Navigational wanting a specific known website
  • Informational my goal is to learn something by
    reading or viewing web pages e.g. closed and
    open-ended questions, advice
  • Resource my goal is to obtain a resource (not
    information) available on web pages e.g.
    download music, interact with online shopping
    service
  • NOTE prior to web most IR was concerned only
    with Informational queries

8
User Goals Rose and Levinsons classification
(2004)
  • The more a search engine understands about a
    users goal then the better results it can
    provide
  • ? User goals may be deduced not only from the
    query, but also from
  • The results returned by the search engine
  • Results clicked on by the user
  • Further searches / actions by the user

9
Opportunity
  • Web search engines can exploit the fact that
    information on the web is in the form of
    hypertext

10
Hypertext
  • The web is, in some senses at least,
    hypertextual, i.e. it can be viewed as networks
    of nodes (e.g. pages) and links (between pages)

11
Hypertext
  • Links suggest relatedness of topic / perhaps
    also a recommendation
  • Topological information about the hypertext graph
    gained by link structure analysis can be
    exploited for ranking

12
Use of Anchor Text (Brin and Page 1998)
  • Words in the anchor text can be used to index the
    webpage being linked to the text in an anchor
    may give a good description of the page it points
    to, e.g.
  • ltahrefwww.bio.com/beckhambio.html"gt A Biography
    of David Beckhamlt/agtlt/pgt
  • The words in the anchor text might be a better
    indicator of what the webpage is about than the
    words in the webpage
  • Anchor text is also good for resources like
    images that can not be analysed as keywords

13
PageRank (Brin and Page 1998)
  • Google makes use of both link structure and
    anchor text
  • The citation (link) graph of the web is an
    important resource that has largely gone unused
    in existing web search engines
  • ? PageRank is an objective measure of a web
    pages citation importance that corresponds well
    with peoples subjective idea of importance

14
Calculating PageRank
  • PR(A) (1-d) d(PR(T1)/C(T1)
    PR(Tn)/C(Tn)
  • PR(A) PageRank of webpage A
  • C (A) the number of links out of webpage A
  • T1Tn the webpages that point to webpage A
  • d a damping factor set between 0-1
  • In reality, the calculation of PageRank is
    iterative

15
Web-adjacency Analysis (a similar idea to
PageRank)
  • Kleinberg and colleagues proposed a method for
    identifying authoritative web-pages
  • Identify set of relevant pages (as normal)
  • Identify those with a large in-degree, i.e. lots
    of pages point to them (cf. impact)
  • Ensure that the authorities selected are referred
    to by a number of the same hubs, i.e. those with
    a large out-degree

16
Web-adjacency Analysis
  • Hubs and authorities exhibit what could be
    called a mutually reinforcing relationship
    (Kleinberg 1998)
  • Computing authority and hub values for web-pages
    is an iterative process over a graph, where each
    node is a web-page
  • Two weights are given to each node relating to
    in-degree and out-degree total in-degree weights
    and total out-degree weights are kept constant
  • Weights are modified each iteration depending on
    weights of connected nodes

17
Some other Factors used to rank Web Pages (Hock
2001)
  • Popularity of the Page measured either by how
    many other web-pages link to it, or by how many
    people have clicked on it when they had the same
    query
  • Frequency of search terms need to consider
    length of the document, and web-page authors
    attempts to affect ranking by deliberate
    repetition
  • Number of query terms matched but remember many
    queries are only one or two words

18
Other Factors (continued)
  • Rarity of terms rank pages containing rare
    search terms more highly (cf. TFIDF)
  • Weighting by Field give high ranking to pages
    including search terms in important fields, e.g.
    Title
  • Proximity of Terms rank pages more highly if
    search terms occur near one another
  • Order of Query Terms give priority to pages
    containing the search term entered first

19
Set Reading for Lecture 4
  • Page and Brin (1998), The Anatomy of a
    Large-Scale Hypertextual Web Search Engine.
    SECTIONS 1 and 2. Explains Googles use of
    anchor text and PageRank.
  • www-db.stanford.edu/backrub/google.html
  • Hock (2001), The extreme searcher's guide to web
    search engines, pages 25-31. Gives an overview
    of some factors used by web search engines to
    rank webpages. AVAILABLE in Main Library
    collection and in Library Article Collection.

20
Exercise
  • Explore the idea of PageRank using an online
    PageRank calculator, e.g.
  • www.markhorrell.com/seo/pagerank.shtml
  • OR
  • www.webworkshop.net/pagerank_calculator.php3

21
Further Reading
  • Rose and Levinson (2004), Understanding User
    Goals in Web Search, 13th International WWW
    Conference, 2004. www.sims.berkeley.edu/courses/is
    141/f05/readings/rose_www04.pdf
  • Page, Brin, Motwani and Winograd (1999), The
    PageRank Citation Ranking Bringing Order to the
    Web. http//dbpubs.stanford.edu8090/pub/1999-66
  • Belew (2000), Finding Out About, pages 195-199
    for an overview of Kleinbergs work on
    web-adjacency analysis and authorities and hubs.
  • Kleinberg (1998), Authoritative Sources in a
    Hyperlinked Environment, Journal of the ACM.
    http//citeseer.nj.nec.com/87928.html
  • Kobayashi and Takeda (2000), Information
    Retrieval on the Web, ACM Computing Surveys
    32(2), pp. 144-173. AVAILABLE IN LIBRARY /
    ARTICLE COLLECTION. This comprehensive article
    reviews a lot the ideas covered so far in this
    module and discusses them in the context of Web
    IR. NOTE, it is already a little out of date in
    places because of the rapid changes of the Web.

22
Lecture 4 LEARNING OUTCOMES
  • After this lecture you should be able to
  • Explain how the challenges of web IR are
    different than those facing the developers of
    traditional IR systems
  • Explain how web search engines can exploit the
    hypertext structure of the web to index and rank
    web pages, e.g. using Anchor Text, and PageRank
  • Explain how PageRank is calculated
  • Discuss and critique a range of factors used by
    web search engines to rank web pages

23
Reading ahead for LECTURE 5
  • If you want to read about next weeks lecture
    topics, see
  • Dean and Henzinger (1999), Finding Related Pages
    in the World Wide Web. Pages 1-10.
  • http//citeseer.ist.psu.edu/dean99finding.html
  • Agichtein, Lawrence and Gravano (2001), Learning
    Search Engine Specific Query Transformations for
    Question Answering, Procs. 10th International
    WWW Conference. Section 1 and Section 3
  • www.cs.columbia.edu/eugene/papers/www10.pdf
  • Oppenheim, Morris and McKnight (2000), The
    Evaluation of WWW Search Engines, Journal of
    Documentation, 56(2). Pages 194-205. In Library
    Article Collection.
Write a Comment
User Comments (0)
About PowerShow.com