Title: CSM06 Information Retrieval
1CSM06 Information Retrieval
- Lecture 4 Web IR part 1
- Dr Andrew Salway a.salway_at_surrey.ac.uk
2Lecture 4 OVERVIEW
- Previously we looked at IR techniques that
indexed a document based on the words that occur
in the document - Some of these techniques are applied in web
search engines (but VSM may not be appropriate).
However, web IR can also exploit a distinctive
feature of information on the web hypertext
link structure - Use of anchor text for indexing web pages
- The PageRank algorithm based on link structure
analysis - Other techniques for ranking web pages
3Challenges for IR on the Web
- High volume of information
- Heterogeneous information (multimedia and
multilingual) - Diverse users - hence diverse information needs,
and many inexperienced users - Average query length 2-5 words
- Poorly structured and low quality information
4Scale
- Projection of worldwide Internet population in
2005 1.07 billion users, www.clickz.com/stats/we
b_worldwide/ - Early in 2005 Google claimed to index over 8
billion web pages, Yahoo recently claimed 19
billion, now Google claims to index 3 times more
than nearest competitor - http//select.nytimes.com/gst/abstract.html?resF3
0610F93E540C748EDDA00894DD404482 - Given the low overlap in search engine results
for a given query, it is likely that the total
number of webpages is much greater than that
indexed by any single web search engine
5Requirements of Web Search Engine Users?
- Fast response time
- Some relevant results in first page maybe less
concern with getting all relevant results - Good coverage of web, at least of important
sites - Up-to-date links
- Simple and intuitive to use making queries and
understanding results - NB. Some of these requirements contrast with
those of expert researchers using specialist
information retrieval systems
6User Goals (Information Needs)
- Queries are used to express a users goal (or
information need), but note that the same query
might be used for quite different goals - (Rose and Levinson 2004)
7User Goals Rose and Levinsons classification
(2004)
- Navigational wanting a specific known website
- Informational my goal is to learn something by
reading or viewing web pages e.g. closed and
open-ended questions, advice - Resource my goal is to obtain a resource (not
information) available on web pages e.g.
download music, interact with online shopping
service - NOTE prior to web most IR was concerned only
with Informational queries
8User Goals Rose and Levinsons classification
(2004)
- The more a search engine understands about a
users goal then the better results it can
provide - ? User goals may be deduced not only from the
query, but also from - The results returned by the search engine
- Results clicked on by the user
- Further searches / actions by the user
9Opportunity
- Web search engines can exploit the fact that
information on the web is in the form of
hypertext
10Hypertext
- The web is, in some senses at least,
hypertextual, i.e. it can be viewed as networks
of nodes (e.g. pages) and links (between pages)
11Hypertext
- Links suggest relatedness of topic / perhaps
also a recommendation - Topological information about the hypertext graph
gained by link structure analysis can be
exploited for ranking
12Use of Anchor Text (Brin and Page 1998)
- Words in the anchor text can be used to index the
webpage being linked to the text in an anchor
may give a good description of the page it points
to, e.g. - ltahrefwww.bio.com/beckhambio.html"gt A Biography
of David Beckhamlt/agtlt/pgt - The words in the anchor text might be a better
indicator of what the webpage is about than the
words in the webpage - Anchor text is also good for resources like
images that can not be analysed as keywords
13PageRank (Brin and Page 1998)
- Google makes use of both link structure and
anchor text - The citation (link) graph of the web is an
important resource that has largely gone unused
in existing web search engines - ? PageRank is an objective measure of a web
pages citation importance that corresponds well
with peoples subjective idea of importance
14Calculating PageRank
- PR(A) (1-d) d(PR(T1)/C(T1)
PR(Tn)/C(Tn) - PR(A) PageRank of webpage A
- C (A) the number of links out of webpage A
- T1Tn the webpages that point to webpage A
- d a damping factor set between 0-1
- In reality, the calculation of PageRank is
iterative
15Web-adjacency Analysis (a similar idea to
PageRank)
- Kleinberg and colleagues proposed a method for
identifying authoritative web-pages - Identify set of relevant pages (as normal)
- Identify those with a large in-degree, i.e. lots
of pages point to them (cf. impact) - Ensure that the authorities selected are referred
to by a number of the same hubs, i.e. those with
a large out-degree
16Web-adjacency Analysis
- Hubs and authorities exhibit what could be
called a mutually reinforcing relationship
(Kleinberg 1998) - Computing authority and hub values for web-pages
is an iterative process over a graph, where each
node is a web-page - Two weights are given to each node relating to
in-degree and out-degree total in-degree weights
and total out-degree weights are kept constant - Weights are modified each iteration depending on
weights of connected nodes
17Some other Factors used to rank Web Pages (Hock
2001)
- Popularity of the Page measured either by how
many other web-pages link to it, or by how many
people have clicked on it when they had the same
query - Frequency of search terms need to consider
length of the document, and web-page authors
attempts to affect ranking by deliberate
repetition - Number of query terms matched but remember many
queries are only one or two words
18Other Factors (continued)
- Rarity of terms rank pages containing rare
search terms more highly (cf. TFIDF) - Weighting by Field give high ranking to pages
including search terms in important fields, e.g.
Title - Proximity of Terms rank pages more highly if
search terms occur near one another - Order of Query Terms give priority to pages
containing the search term entered first
19Set Reading for Lecture 4
- Page and Brin (1998), The Anatomy of a
Large-Scale Hypertextual Web Search Engine.
SECTIONS 1 and 2. Explains Googles use of
anchor text and PageRank. - www-db.stanford.edu/backrub/google.html
- Hock (2001), The extreme searcher's guide to web
search engines, pages 25-31. Gives an overview
of some factors used by web search engines to
rank webpages. AVAILABLE in Main Library
collection and in Library Article Collection.
20Exercise
- Explore the idea of PageRank using an online
PageRank calculator, e.g. - www.markhorrell.com/seo/pagerank.shtml
- OR
- www.webworkshop.net/pagerank_calculator.php3
21Further Reading
- Rose and Levinson (2004), Understanding User
Goals in Web Search, 13th International WWW
Conference, 2004. www.sims.berkeley.edu/courses/is
141/f05/readings/rose_www04.pdf - Page, Brin, Motwani and Winograd (1999), The
PageRank Citation Ranking Bringing Order to the
Web. http//dbpubs.stanford.edu8090/pub/1999-66 - Belew (2000), Finding Out About, pages 195-199
for an overview of Kleinbergs work on
web-adjacency analysis and authorities and hubs. - Kleinberg (1998), Authoritative Sources in a
Hyperlinked Environment, Journal of the ACM.
http//citeseer.nj.nec.com/87928.html - Kobayashi and Takeda (2000), Information
Retrieval on the Web, ACM Computing Surveys
32(2), pp. 144-173. AVAILABLE IN LIBRARY /
ARTICLE COLLECTION. This comprehensive article
reviews a lot the ideas covered so far in this
module and discusses them in the context of Web
IR. NOTE, it is already a little out of date in
places because of the rapid changes of the Web.
22Lecture 4 LEARNING OUTCOMES
- After this lecture you should be able to
- Explain how the challenges of web IR are
different than those facing the developers of
traditional IR systems - Explain how web search engines can exploit the
hypertext structure of the web to index and rank
web pages, e.g. using Anchor Text, and PageRank - Explain how PageRank is calculated
- Discuss and critique a range of factors used by
web search engines to rank web pages
23Reading ahead for LECTURE 5
- If you want to read about next weeks lecture
topics, see - Dean and Henzinger (1999), Finding Related Pages
in the World Wide Web. Pages 1-10. - http//citeseer.ist.psu.edu/dean99finding.html
- Agichtein, Lawrence and Gravano (2001), Learning
Search Engine Specific Query Transformations for
Question Answering, Procs. 10th International
WWW Conference. Section 1 and Section 3 - www.cs.columbia.edu/eugene/papers/www10.pdf
- Oppenheim, Morris and McKnight (2000), The
Evaluation of WWW Search Engines, Journal of
Documentation, 56(2). Pages 194-205. In Library
Article Collection.