CSM06 Information Retrieval - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

CSM06 Information Retrieval

Description:

Part I: Recap of Lecture 3 PageRank and other ranking factors; review of Set Reading ... Constant efforts to improve user interfaces, both to help users ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 30
Provided by: csp9
Category:

less

Transcript and Presenter's Notes

Title: CSM06 Information Retrieval


1
CSM06 Information Retrieval
  • LECTURE 4 Tuesday 19th October
  • Dr Andrew Salway
  • a.salway_at_surrey.ac.uk

2
LECTURE 4
  • Part I Recap of Lecture 3 PageRank and other
    ranking factors review of Set Reading
  • Part II Web IR continued
  • Finding related pages by analysing link structure
    only
  • Evaluating web search engines
  • Some current RD Issues for web search engines
  • Part III COURSEWORK

3
Finding Related Pages in the World Wide Web
(Dean and Henzinger 1999)
  • Use a webpage (URL) as a query may be an easier
    way for a user to express their information need
  • The user is saying I want more pages like this
    one maybe easier than thinking of good query
    words?
  • e.g. the URL www.nytimes.com (New York Times
    newspaper) returns URLs for other newspapers and
    news organisations
  • Aim is for high precision with fast execution
    using minimal information
  • ?Two algorithms to find pages related to the
    query page using only connectivity information
    (nothing about webpage content or usage)
  • Companion Algorithm
  • Cocitation Algorithm

4
What does related mean?
  • A related web page is one that addresses the
    same topic as the original page, but is not
    necessarily semantically identical

5
Companion Algorithm
  • Based on Kleinbergs HITS algorithm mutually
    reinforcing authorities and hubs
  • 1. Build a vicintiy graph for u
  • 2. Contract duplicates and near-duplicates
  • 3. Compute edge weights (i.e. links)
  • 4. Compute hub and authority scores for each node
    (URL) in the graph ? return highest ranked
    authorities as results set

6
Companion Algorithm
  • 1. Build a vicintiy graph for u
  • The graph is made up of the following nodes and
    edges between them
  • u
  • Up to B parents of u, and for each parent up to
    BF of its children if u has gt B parents then
    choose randomly if a parent has gt BF children,
    then choose children closest to u
  • Up to F children of u, and for each child up to
    FB of its parents
  • NB. Use of a stop list of URLs with very high
    indegree

7
Companion Algorithm
  • 2. Contract duplicates and near-duplicates if
    two nodes each have gt 10 links and gt 95 are in
    common then make them into one nodes whose links
    are the union of the two
  • 3. Compute edge weights (i.e. links)
  • Edges between nodes on the same host are weighted
    0
  • Scaling to reduce the influence from any single
    host
  • If there are k edges from documents on a first
    host to a single document on a second host then
    each edge has authority weight 1/k
  • If there are l edges from a single document on a
    first host to a set of documents on a second
    host, we give each edge a hub weight of 1/l

8
Companion Algorithm
  • 4. Compute hub and authority scores for each node
    (URL) in the graph ? return highest ranked
    authorities as results set
  • a document that points to many others is a good
    hub, and a document that many documents point to
    is a good authority

9
Companion Algorithm
  • 4. continued
  • H hub vector with one element for the Hub value
    of each node
  • A authority vector with one element for the
    Authority value of each node
  • Initially all values set to 1

10
Companion Algorithm
  • 4. continued
  • Until H and A converge
  • For all nodes n in the graph N
  • An S Hnauthority_weight(n,n)
  • For all nodes n in the graph N
  • Hn S Anhub_weight(n,n)

11
Cocitation Algorithm
  • Finds pages that are frequently cocited with the
    query web page u it finds other pages that are
    pointed to by many other pages that also point to
    u
  • Two nodes are co-cited if they have a common
    parent the number of common parents is their
    degree of co-citation

12
Cocitation Algorithm
  • Select up to B parents of u
  • For each parent add up to BF of its children to
    the set of us siblings S
  • Return nodes in S with highest degrees of
    cocitation with u
  • NB. If lt 15 nodes in S that are cocited with u at
    least twice then restart using us URL with one
    path element removed, e.g. aaa.com/X/Y/Z ?
    aaa.com/X/Y

13
Evaluation
  • 59 input URLs chosen by 18 volunteers (mainly
    computing professionals)
  • The volunteers are shown results for each URL
    they chose and have to judge it 1 for valuable
    and 0 for not valuable
  • ? Various calculations of precision, e.g.
    precision at 10 for the intersection group
    (those query URLs that all 3 algorithms returned
    results for)

14
Evaluation
  • Authors suggest that their algorithms perform
    better than an algorithm (Netscapes) that
    incorporates content and usage information, as
    well as connectivity information This is
    surprising IS IT??
  • Perhaps it is because they had more connectivity
    information??

15
Evaluation of Web Search Engines
  • Precision may be applicable to evaluate a web
    search engine, but it may be the precision in the
    first page of results that is most important
  • Recall, as traditionally defined, may not be
    applicable because it is difficult or impossible
    to identify all the relevant web-pages for a
    given query

16
Four strategies for evaluation of web search
engines
  • Use precision and recall in the traditional way
    for a very tightly defined topic only applicable
    if all relevant web pages are known in advance
  • Use relative recall estimate total number of
    relevant documents by doing a number of searches
    and adding the total number of relevant documents
    returned
  • Statistically sample the web in order to estimate
    number of relevant pages
  • Avoid recall altogether
  • SEE Oppenheim, Morris and McKnight (2000), p. 194

17
Alternative Evaluation Criteria
  • Number of web-pages covered, and coverage Is
    more pages covered better? May be more important
    that certain domains are included in coverage?
  • Freshness / broken links Web-page content is
    frequently updated so index also needs to be
    updated broken links frustrate users. Should be
    relatively straightfoward to quantify.

18
Alternative Evaluation Criteria (continued)
  • Search Syntax More experienced users may like
    the option of advanced searches, e.g. phrases,
    Boolean operators, and field searching.
  • Human Factors and Interface Issues Evaluation
    from a users perspective is a more subjective
    criterion, however it is an important one it
    can be argued that an intuitive interface for
    formulating queries and interpreting results
    helps a user to get better results from the
    system.
  • Quality of Abstracts related to interface issues
    are the abstracts of web-pages that a web
    search engine displays if good then these help
    a user to quickly identify more promising pages

19
Some current RD issues for web search engines
  • Constant efforts to improve user interfaces, both
    to help users express their information needs and
    to help them understand more about the results,
    i.e. information visualisation More about this
    in Lecture 6
  • Google Labs gives some insights to potentially up
    and coming features
  • http//labs.google.com

20
Some things from Google Labs
  • Desktop search
  • search application that provides full text
    search over your email, computer files, chats,
    and the web pages you've viewed
  • Since you can easily search information on your
    computer, you don't need to worry about
    organizing your files, email, or bookmarks

21
Some things from Google Labs
  • SMS search
  • Google SMS (Short Message Service) enables you
    to easily get precise answers to specialized
    queries from your mobile phone or device. Send
    your query as a text message and get phone book
    listings, dictionary definitions, product prices
    and more. Just text. No links. No web pages.

22
Some things from Google Labs
  • Personalised Search
  • Once youve entered a description of your general
    interests your search results are modified
    accordingly

23
(No Transcript)
24
Some things from Google Labs
  • Google Sets
  • Enter a few items and a longer list (of the same
    kinds of things) is returned

25
(No Transcript)
26
Set Reading
  • Dean and Henzinger (1999), Finding Related Pages
    in the World Wide Web. Pages 1-10.
  • http//citeseer.ist.psu.edu/dean99finding.html
  • Oppenheim, Morris and McKnight (2000), The
    Evaluation of WWW Search Engines, Journal of
    Documentation, 56(2), pp. 190-211. SET READING
    is Pages 194-205. In Library Article Collection.

27
Exercise Googles Similar Pages
  • It is suggested that Googles Similar Pages
    feature is based in part on the work of Dean and
    Henzinger. By making a variety of queries to
    Google and choosing Similar Pages see what you
    can find out about how this works.

28
Exercise web search engine evaluation
  • Compare two or three web-search engines by making
    the same queries to each. How do they compare in
    terms of
  • Coverage?
  • Quality of highest ranked results?
  • Ease of querying and understanding results?
  • Ranking factors that they appear to be using?

29
Further Reading
  • Rest of the papers given for Set Reading
Write a Comment
User Comments (0)
About PowerShow.com