CS 430: Information Discovery - PowerPoint PPT Presentation

About This Presentation
Title:

CS 430: Information Discovery

Description:

Sports Law encompasses a multitude areas of law brought together in unique ways. Issues ... 5. Display the hits ranked by cj. ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 38
Provided by: wya1
Category:

less

Transcript and Presenter's Notes

Title: CS 430: Information Discovery


1
CS 430 Information Discovery
Lecture 20 Web Search 2
2
Course Administration
Outstanding queries on Assignment 2 have been
answered. Wording change made to Assignment 3
output need not be in Web format.
3
Effective Information Retrieval
1. Comprehensive metadata with Boolean retrieval
(e.g., monograph catalog). Can be excellent for
well-understood categories of material, but
requires expensive metadata, which is rarely
available. 2. Full text indexing with ranked
retrieval (e.g., news articles). Excellent for
relatively homogeneous material, but requires
available full text. Neither of these methods is
very effective when applied directly to the Web.
4
New concepts in Web Searching
  • Goal of search is redefined.
  • Concept of relevance is changed.
  • Browsing is tightly connected to searching.
  • Contextual information is used as an integral
    part of the search.

5
Indexing Goals Precision
  • Short queries applied to very large numbers of
    items
  • leads to large numbers of hits.
  • Goal is that the first 10-100 hits presented
    should satisfy the user's information need
  • -- requires ranking hits in order that fits
    user's requirements
  • Recall is not an important criterion
  • Completeness of index is not an important factor.
  • Comprehensive crawling is unnecessary

6
Concept of Relevance
  • Document measures
  • Relevance, as conventionally defined, is binary
    (relevant or not relevant). It is usually
    estimated by the similarity between the terms in
    the query and each document.
  • Importance measures documents by their
    likelihood of being useful to a variety of users.
    It is usually estimated by some measure of
    popularity.
  • Web search engines rank documents by a
    combination of
  • relevance and importance. The goal is to present
    the user
  • with the most important of the relevant documents.

7
Ranking Options
1. Paid advertisers 2. Manually created
classification 3. Vector space ranking with
corrections for document length 4. Extra
weighting for specific fields, e.g., title,
anchors, etc. 5. Popularity, e.g., PageRank The
balance between 3, 4, and 5 is not made public.
8
Browsing and Searching
  • Searching is followed by browsing.
  • Browsing the hit list
  • helpful summary records (snippets)
  • removal of duplicates
  • grouping results from a single site
  • Browsing the Web pages themselves
  • direct links from the snippets to the pages
  • cache with highlights
  • translation in same format

9
Browsing and Searching
Query Cornell sports LII Law
about...Sports... sports law an overview.
Sports Law encompasses a multitude areas of law
brought together in unique ways. Issues ...
vocation. Amateur Sports. ... www.law.cornell.edu
/topics/sports.html Query NCAA Tarkanian LII
Law about...Sports... purposes. See NCAA v.
Tarkanian, 109 US 454 (1988). State action status
may also be a factor in mandatory drug testing
rules. On ... www.law.cornell.edu/topics/sports.h
tml
10
Contextual information
  • Information about a document
  • Content (terms, formatting, etc.)
  • Metadata (externally created following rules)
  • Context (citations and links, reviews,
    annotations, etc.)
  • Context has many uses
  • Selecting documents to index
  • Retrieval clues (e.g., href text)
  • Ranking

11
Effective Information Retrieval (cont)
3. Full text indexing with contextual
information and ranked retrieval (e.g., Google,
Teoma). Excellent for mixed textual information
with rich structure. 4. Contextual information
with non-textual materials and ranked retrieval
(e.g., Google image retrieval). Promising, but
still experimental.
12
Scalability
The growth of the web
13
Scalability
Web search services are centralized
systems Over the past 9 years, Moore's Law has
enabled the services to keep pace with the growth
of the web and the number of users, while adding
extra function. Will this continue? Possible
areas for concern are staff costs,
telecommunications costs, disk access rates.
14
Cost Example (Google)
  • 85 people
  • 50 technical, 14 Ph.D. in Computer Science
  • Equipment
  • 2,500 Linux machines
  • 80 terabytes of spinning disks
  • 30 new machines installed daily
  • Reported by Larry Page, Google, March 2000
  • At that time, Google was handling 5.5 million
    searches per day
  • Increase rate was 20 per month
  • By fall 2002, Google had grown to over 400
    people.

15
Scalability Staff
  • Programming Have very well trained staff.
    Isolate complex code. Most coding is single
    image.
  • System maintenance Organize for minimal staff
    (e.g., automated log analysis, do not fix broken
    computers).
  • Customer service Automate everything possible,
    but complaints, large collections, etc. require
    staff.

16
Scalability Performance
  • Very large numbers of commodity computers
  • Algorithms and data structures scale linearly
  • Storage
  • Scale with the size of the Web
  • Compression/decompression
  • System
  • Crawling, indexing, sorting simultaneously
  • Searching
  • Bounded by disk I/O

17
Bibliometrics
Techniques that use citation analysis to measure
the similarity of journal articles or their
importance Bibliographic coupling two papers
that cite many of the same papers Co-citation
two papers that were cited by many of the same
papers Impact factor (of a journal) frequency
with which the average article in a journal has
been cited in a particular year or period
18
Citation Graph
cites
Paper
is cited by
Note that journal citations always refer to
earlier work.
19
Graphical Analysis of Hyperlinks on the Web
This page links to many other pages (hub)
2
1
4
Many pages link to this page (authority)
3
6
5
20
PageRank Algorithm
Used to estimate importance of documents. Concept
The rank of a web page is higher if many pages
link to it. Links from highly ranked pages are
given greater weight than links from less highly
ranked pages.
21
Intuitive Model (Basic Concept)
A user 1. Starts at a random page on the web 2.
Selects a random hyperlink from the current page
and jumps to the corresponding page 3. Repeats
Step 2 a very large number of times Pages are
ranked according to the relative frequency with
which they are visited.
22
Matrix Representation
Citing page (from)
P1 P2 P3 P4 P5 P6
Number
P1 1
1 P2 1
1 2 P3
1 1 1
3 P4 1 1
1 1
4 P5 1
1 P6
1 1
Cited page (to)
Number 4 2 1 1
3 1
23
Basic Algorithm Normalize by Number of Links
from Page
Citing page
P1 P2 P3 P4 P5 P6
P1
0.33 P2 0.25 1 P3 0.25
0.5 1 P4
0.25 0.5 0.33 1 P5
0.25 P6
0.33
B
Cited page
Normalized link matrix
Number 4 2 1 1
3 1
24
Basic Algorithm Weighting of Pages
Initially all pages have weight 1 w1
Recalculate weights w2 Bw1
0.33 1.25 1.75 2.08 0.25 0.33
1 1 1 1 1 1
25
Basic Algorithm Iterate
Iterate wk Bwk-1
w1 w2 w3
w4 ... converges
to ... w
-gt -gt -gt -gt -gt -gt
0.00 2.39 2.39 1.19 0.00 0.00
0.08 1.83 2.79 1.12 0.08 0.08
0.03 2.80 2.06 1.05 0.02 0.03
0.33 1.25 1.75 2.08 0.25 0.33
1 1 1 1 1 1
26
Graphical Analysis of Hyperlinks on the Web
There is no link out of 2, 3, 4
2
1
4
3
6
5
27
Google PageRank with Damping
A user 1. Starts at a random page on the
web 2a. With probability p, selects any random
page and jumps to it 2b. With probability 1-p,
selects a random hyperlink from the current page
and jumps to the corresponding page 3. Repeats
Step 2a and 2b a very large number of times Pages
are ranked according to the relative frequency
with which they are visited.
28
The PageRank Iteration
The basic method iterates using the normalized
link matrix, B. wk Bwk-1 This w is the high
order eigenvector of B Google iterates using a
damping factor. The method iterates using a
matrix B', where B' dN (1 - d)B N is the
matrix with every element equal to 1/n. d is a
constant found by experiment.
29
Google PageRank
  • The Google PageRank algorithm is usually written
    with the
  • following notation
  • If page A has pages Ti pointing to it.
  • d damping factor
  • C(A) number of links out of A
  • Iterate until

30
Information Retrieval Using PageRank
Simple Method Consider all hits (i.e., all
document vectors that share at least one term
with the query vector) as equal. Display the hits
ranked by PageRank. The disadvantage of this
method is that it gives no attention to how
closely a document matches a query
31
Reference Pattern Ranking using Dynamic Document
Sets
PageRank calculates document ranks for the entire
(fixed) set of documents. The calculations are
made periodically (e.g., monthy) and the document
ranks are the same for all queries. Concept of
dynamic document sets. Reference patterns among
documents that are related to a specific query
convey more information than patterns calculated
across entire document collections. With dynamic
document sets, references patterns are calculated
for a set of documents that are selected based on
each individual query.
32
Reference Pattern Ranking using Dynamic Document
Sets
Teoma Dynamic Ranking Algorithm (used in Ask
Jeeves) 1. Search using conventional term
weighting. Rank the hits using similarity
between query and documents. 2. Select the
highest ranking hits (e.g., top 5,000 hits). 3.
Carry out PageRank or similar algorithm on this
set of hits. This creates a set of document
ranks that are specific to this query. 4. Display
the results ranked in the order of the reference
patterns calculated.
33
Combining Term Weighting with Reference Pattern
Ranking
Combined Method 1. Find all documents that share
a term with the query vector. 2. The similarity,
using conventional term weighting, between the
query and document j is sj. 3. The rank of
document j using PageRank or other reference
pattern ranking is pj. 4. Calculate a combined
rank cj ?sj (1- ?)pj, where ? is a
constant. 5. Display the hits ranked by cj. This
method is used in several commercial systems, but
the details have not been published.
34
Cornell Note
Jon Kleinberg of Cornell Computer Science has
carried out extensive research in this area, both
theoretical and practical development of new
algorithms. In particular he has studied hubs
(documents that refer to many others) and
authorities (documents that are referenced by
many others).
35
Google API
36
Selective searching
37
Google News
Write a Comment
User Comments (0)
About PowerShow.com