CSM06 Information Retrieval - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

CSM06 Information Retrieval

Description:

Part I: Recap of Lecture 3 PageRank and other ranking factors; review of Set Reading ... Constant efforts to improve user interfaces, both to help users ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 30

Provided by: csp9

Category:

more less

Transcript and Presenter's Notes

Title: CSM06 Information Retrieval

1
CSM06 Information Retrieval

LECTURE 4 Tuesday 19th October
Dr Andrew Salway
a.salway_at_surrey.ac.uk

2
LECTURE 4

Part I Recap of Lecture 3 PageRank and other
ranking factors review of Set Reading
Part II Web IR continued
Finding related pages by analysing link structure
only
Evaluating web search engines
Some current RD Issues for web search engines
Part III COURSEWORK

3
Finding Related Pages in the World Wide Web
(Dean and Henzinger 1999)

Use a webpage (URL) as a query may be an easier
way for a user to express their information need
The user is saying I want more pages like this
one maybe easier than thinking of good query
words?
e.g. the URL www.nytimes.com (New York Times
newspaper) returns URLs for other newspapers and
news organisations
Aim is for high precision with fast execution
using minimal information
?Two algorithms to find pages related to the
query page using only connectivity information
(nothing about webpage content or usage)
Companion Algorithm
Cocitation Algorithm

4
What does related mean?

A related web page is one that addresses the
same topic as the original page, but is not
necessarily semantically identical

5
Companion Algorithm

Based on Kleinbergs HITS algorithm mutually
reinforcing authorities and hubs
1. Build a vicintiy graph for u
2. Contract duplicates and near-duplicates
3. Compute edge weights (i.e. links)
4. Compute hub and authority scores for each node
(URL) in the graph ? return highest ranked
authorities as results set

6
Companion Algorithm

1. Build a vicintiy graph for u
The graph is made up of the following nodes and
edges between them
u
Up to B parents of u, and for each parent up to
BF of its children if u has gt B parents then
choose randomly if a parent has gt BF children,
then choose children closest to u
Up to F children of u, and for each child up to
FB of its parents
NB. Use of a stop list of URLs with very high
indegree

7
Companion Algorithm

2. Contract duplicates and near-duplicates if
two nodes each have gt 10 links and gt 95 are in
common then make them into one nodes whose links
are the union of the two
3. Compute edge weights (i.e. links)
Edges between nodes on the same host are weighted
0
Scaling to reduce the influence from any single
host
If there are k edges from documents on a first
host to a single document on a second host then
each edge has authority weight 1/k
If there are l edges from a single document on a
first host to a set of documents on a second
host, we give each edge a hub weight of 1/l

8
Companion Algorithm

4. Compute hub and authority scores for each node
(URL) in the graph ? return highest ranked
authorities as results set
a document that points to many others is a good
hub, and a document that many documents point to
is a good authority

9
Companion Algorithm

4. continued
H hub vector with one element for the Hub value
of each node
A authority vector with one element for the
Authority value of each node
Initially all values set to 1

10
Companion Algorithm

4. continued
Until H and A converge
For all nodes n in the graph N
An S Hnauthority_weight(n,n)
For all nodes n in the graph N
Hn S Anhub_weight(n,n)

11
Cocitation Algorithm

Finds pages that are frequently cocited with the
query web page u it finds other pages that are
pointed to by many other pages that also point to
u
Two nodes are co-cited if they have a common
parent the number of common parents is their
degree of co-citation

12
Cocitation Algorithm

Select up to B parents of u
For each parent add up to BF of its children to
the set of us siblings S
Return nodes in S with highest degrees of
cocitation with u
NB. If lt 15 nodes in S that are cocited with u at
least twice then restart using us URL with one
path element removed, e.g. aaa.com/X/Y/Z ?
aaa.com/X/Y

13
Evaluation

59 input URLs chosen by 18 volunteers (mainly
computing professionals)
The volunteers are shown results for each URL
they chose and have to judge it 1 for valuable
and 0 for not valuable
? Various calculations of precision, e.g.
precision at 10 for the intersection group
(those query URLs that all 3 algorithms returned
results for)

14
Evaluation

Authors suggest that their algorithms perform
better than an algorithm (Netscapes) that
incorporates content and usage information, as
well as connectivity information This is
surprising IS IT??
Perhaps it is because they had more connectivity
information??

15
Evaluation of Web Search Engines

Precision may be applicable to evaluate a web
search engine, but it may be the precision in the
first page of results that is most important
Recall, as traditionally defined, may not be
applicable because it is difficult or impossible
to identify all the relevant web-pages for a
given query

16
Four strategies for evaluation of web search
engines

Use precision and recall in the traditional way
for a very tightly defined topic only applicable
if all relevant web pages are known in advance
Use relative recall estimate total number of
relevant documents by doing a number of searches
and adding the total number of relevant documents
returned
Statistically sample the web in order to estimate
number of relevant pages
Avoid recall altogether
SEE Oppenheim, Morris and McKnight (2000), p. 194

17
Alternative Evaluation Criteria

Number of web-pages covered, and coverage Is
more pages covered better? May be more important
that certain domains are included in coverage?
Freshness / broken links Web-page content is
frequently updated so index also needs to be
updated broken links frustrate users. Should be
relatively straightfoward to quantify.

18
Alternative Evaluation Criteria (continued)

Search Syntax More experienced users may like
the option of advanced searches, e.g. phrases,
Boolean operators, and field searching.
Human Factors and Interface Issues Evaluation
from a users perspective is a more subjective
criterion, however it is an important one it
can be argued that an intuitive interface for
formulating queries and interpreting results
helps a user to get better results from the
system.
Quality of Abstracts related to interface issues
are the abstracts of web-pages that a web
search engine displays if good then these help
a user to quickly identify more promising pages

19
Some current RD issues for web search engines

Constant efforts to improve user interfaces, both
to help users express their information needs and
to help them understand more about the results,
i.e. information visualisation More about this
in Lecture 6
Google Labs gives some insights to potentially up
and coming features
http//labs.google.com

20
Some things from Google Labs

Desktop search
search application that provides full text
search over your email, computer files, chats,
and the web pages you've viewed
Since you can easily search information on your
computer, you don't need to worry about
organizing your files, email, or bookmarks

21
Some things from Google Labs

SMS search
Google SMS (Short Message Service) enables you
to easily get precise answers to specialized
queries from your mobile phone or device. Send
your query as a text message and get phone book
listings, dictionary definitions, product prices
and more. Just text. No links. No web pages.

22
Some things from Google Labs

Personalised Search
Once youve entered a description of your general
interests your search results are modified
accordingly

23
(No Transcript)
24
Some things from Google Labs

Google Sets
Enter a few items and a longer list (of the same
kinds of things) is returned

25
(No Transcript)
26
Set Reading

Dean and Henzinger (1999), Finding Related Pages
in the World Wide Web. Pages 1-10.
http//citeseer.ist.psu.edu/dean99finding.html
Oppenheim, Morris and McKnight (2000), The
Evaluation of WWW Search Engines, Journal of
Documentation, 56(2), pp. 190-211. SET READING
is Pages 194-205. In Library Article Collection.

27
Exercise Googles Similar Pages

It is suggested that Googles Similar Pages
feature is based in part on the work of Dean and
Henzinger. By making a variety of queries to
Google and choosing Similar Pages see what you
can find out about how this works.

28
Exercise web search engine evaluation

Compare two or three web-search engines by making
the same queries to each. How do they compare in
terms of
Coverage?
Quality of highest ranked results?
Ease of querying and understanding results?
Ranking factors that they appear to be using?

29
Further Reading