Title: CSM06 Information Retrieval
1CSM06 Information Retrieval
- LECTURE 4 Tuesday 19th October
- Dr Andrew Salway
- a.salway_at_surrey.ac.uk
2LECTURE 4
- Part I Recap of Lecture 3 PageRank and other
ranking factors review of Set Reading - Part II Web IR continued
- Finding related pages by analysing link structure
only - Evaluating web search engines
- Some current RD Issues for web search engines
- Part III COURSEWORK
3Finding Related Pages in the World Wide Web
(Dean and Henzinger 1999)
- Use a webpage (URL) as a query may be an easier
way for a user to express their information need - The user is saying I want more pages like this
one maybe easier than thinking of good query
words? - e.g. the URL www.nytimes.com (New York Times
newspaper) returns URLs for other newspapers and
news organisations - Aim is for high precision with fast execution
using minimal information - ?Two algorithms to find pages related to the
query page using only connectivity information
(nothing about webpage content or usage) - Companion Algorithm
- Cocitation Algorithm
4What does related mean?
- A related web page is one that addresses the
same topic as the original page, but is not
necessarily semantically identical
5Companion Algorithm
- Based on Kleinbergs HITS algorithm mutually
reinforcing authorities and hubs - 1. Build a vicintiy graph for u
- 2. Contract duplicates and near-duplicates
- 3. Compute edge weights (i.e. links)
- 4. Compute hub and authority scores for each node
(URL) in the graph ? return highest ranked
authorities as results set
6Companion Algorithm
- 1. Build a vicintiy graph for u
- The graph is made up of the following nodes and
edges between them - u
- Up to B parents of u, and for each parent up to
BF of its children if u has gt B parents then
choose randomly if a parent has gt BF children,
then choose children closest to u - Up to F children of u, and for each child up to
FB of its parents - NB. Use of a stop list of URLs with very high
indegree
7Companion Algorithm
- 2. Contract duplicates and near-duplicates if
two nodes each have gt 10 links and gt 95 are in
common then make them into one nodes whose links
are the union of the two - 3. Compute edge weights (i.e. links)
- Edges between nodes on the same host are weighted
0 - Scaling to reduce the influence from any single
host - If there are k edges from documents on a first
host to a single document on a second host then
each edge has authority weight 1/k - If there are l edges from a single document on a
first host to a set of documents on a second
host, we give each edge a hub weight of 1/l
8Companion Algorithm
- 4. Compute hub and authority scores for each node
(URL) in the graph ? return highest ranked
authorities as results set - a document that points to many others is a good
hub, and a document that many documents point to
is a good authority
9Companion Algorithm
- 4. continued
- H hub vector with one element for the Hub value
of each node - A authority vector with one element for the
Authority value of each node - Initially all values set to 1
10Companion Algorithm
- 4. continued
- Until H and A converge
- For all nodes n in the graph N
- An S Hnauthority_weight(n,n)
- For all nodes n in the graph N
- Hn S Anhub_weight(n,n)
11Cocitation Algorithm
- Finds pages that are frequently cocited with the
query web page u it finds other pages that are
pointed to by many other pages that also point to
u - Two nodes are co-cited if they have a common
parent the number of common parents is their
degree of co-citation
12Cocitation Algorithm
- Select up to B parents of u
- For each parent add up to BF of its children to
the set of us siblings S - Return nodes in S with highest degrees of
cocitation with u - NB. If lt 15 nodes in S that are cocited with u at
least twice then restart using us URL with one
path element removed, e.g. aaa.com/X/Y/Z ?
aaa.com/X/Y
13Evaluation
- 59 input URLs chosen by 18 volunteers (mainly
computing professionals) - The volunteers are shown results for each URL
they chose and have to judge it 1 for valuable
and 0 for not valuable - ? Various calculations of precision, e.g.
precision at 10 for the intersection group
(those query URLs that all 3 algorithms returned
results for)
14Evaluation
- Authors suggest that their algorithms perform
better than an algorithm (Netscapes) that
incorporates content and usage information, as
well as connectivity information This is
surprising IS IT?? - Perhaps it is because they had more connectivity
information??
15Evaluation of Web Search Engines
- Precision may be applicable to evaluate a web
search engine, but it may be the precision in the
first page of results that is most important - Recall, as traditionally defined, may not be
applicable because it is difficult or impossible
to identify all the relevant web-pages for a
given query
16Four strategies for evaluation of web search
engines
- Use precision and recall in the traditional way
for a very tightly defined topic only applicable
if all relevant web pages are known in advance - Use relative recall estimate total number of
relevant documents by doing a number of searches
and adding the total number of relevant documents
returned - Statistically sample the web in order to estimate
number of relevant pages - Avoid recall altogether
- SEE Oppenheim, Morris and McKnight (2000), p. 194
17Alternative Evaluation Criteria
- Number of web-pages covered, and coverage Is
more pages covered better? May be more important
that certain domains are included in coverage? - Freshness / broken links Web-page content is
frequently updated so index also needs to be
updated broken links frustrate users. Should be
relatively straightfoward to quantify.
18Alternative Evaluation Criteria (continued)
- Search Syntax More experienced users may like
the option of advanced searches, e.g. phrases,
Boolean operators, and field searching. - Human Factors and Interface Issues Evaluation
from a users perspective is a more subjective
criterion, however it is an important one it
can be argued that an intuitive interface for
formulating queries and interpreting results
helps a user to get better results from the
system. - Quality of Abstracts related to interface issues
are the abstracts of web-pages that a web
search engine displays if good then these help
a user to quickly identify more promising pages
19Some current RD issues for web search engines
- Constant efforts to improve user interfaces, both
to help users express their information needs and
to help them understand more about the results,
i.e. information visualisation More about this
in Lecture 6 - Google Labs gives some insights to potentially up
and coming features - http//labs.google.com
20Some things from Google Labs
- Desktop search
- search application that provides full text
search over your email, computer files, chats,
and the web pages you've viewed - Since you can easily search information on your
computer, you don't need to worry about
organizing your files, email, or bookmarks
21Some things from Google Labs
- SMS search
- Google SMS (Short Message Service) enables you
to easily get precise answers to specialized
queries from your mobile phone or device. Send
your query as a text message and get phone book
listings, dictionary definitions, product prices
and more. Just text. No links. No web pages.
22Some things from Google Labs
- Personalised Search
- Once youve entered a description of your general
interests your search results are modified
accordingly
23(No Transcript)
24Some things from Google Labs
- Google Sets
- Enter a few items and a longer list (of the same
kinds of things) is returned
25(No Transcript)
26Set Reading
- Dean and Henzinger (1999), Finding Related Pages
in the World Wide Web. Pages 1-10. - http//citeseer.ist.psu.edu/dean99finding.html
- Oppenheim, Morris and McKnight (2000), The
Evaluation of WWW Search Engines, Journal of
Documentation, 56(2), pp. 190-211. SET READING
is Pages 194-205. In Library Article Collection.
27Exercise Googles Similar Pages
- It is suggested that Googles Similar Pages
feature is based in part on the work of Dean and
Henzinger. By making a variety of queries to
Google and choosing Similar Pages see what you
can find out about how this works.
28Exercise web search engine evaluation
- Compare two or three web-search engines by making
the same queries to each. How do they compare in
terms of - Coverage?
- Quality of highest ranked results?
- Ease of querying and understanding results?
- Ranking factors that they appear to be using?
29Further Reading
- Rest of the papers given for Set Reading