CS 430 INFO 430 Information Retrieval - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

CS 430 INFO 430 Information Retrieval

Description:

Comparing a Term and a Document ... Comparing Two Documents. X'X = (TSD')'TSD' = DS(DS) ... Comparison with: (a) simple term matching, (b) SMART, (c) Voorhees ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 29
Provided by: wya2
Category:

less

Transcript and Presenter's Notes

Title: CS 430 INFO 430 Information Retrieval


1
CS 430 / INFO 430 Information Retrieval
Lecture 9 Latent Semantic Indexing
2
Course Administration
Midterm Examination Wednesday, September 26
from 730 to 900, in Phillips Hall 203 See the
examination page on the web site for more
information and a sample examination from a
previous year Open book. Laptop computers may
be used (a) to store lecture slides, and papers
read in the discussion classes, (b) as a
calculator No electronic device may be used for
any form of communication

3
Latent Semantic Indexing
Objective Replace indexes that use sets of index
terms by indexes that use concepts. Approach Map
the term vector space into a lower dimensional
space, using singular value decomposition. Each
dimension in the new space corresponds to a
latent concept in the original data.
4
Deficiencies with Conventional Automatic Indexing
Synonymy Various words and phrases refer to the
same concept (lowers recall). Polysemy
Individual words have more than one meaning
(lowers precision) Independence No significance
is given to two terms that frequently appear
together Latent semantic indexing addresses the
first of these (synonymy), and the third
(dependence)
5
Example
Query "IDF in computer-based information
look-up" Index terms for a document access,
document, retrieval, indexing How can we
recognize that information look-up is related to
retrieval and indexing? Conversely, if
information has many different contexts in the
set of documents, how can we discover that it is
an unhelpful term for retrieval?
6
Technical Memo Example Titles
 c1 Human machine interface for Lab ABC computer
applications  c2 A survey of user opinion of
computer system response time  c3 The EPS user
interface management system  c4 System and human
system engineering testing of EPS  c5 Relation of
user-perceived response time to error
measurement m1 The generation of random, binary,
unordered trees m2 The intersection graph of
paths in trees m3 Graph minors IV Widths of
trees and well-quasi-ordering  m4 Graph minors A
survey
7
Technical Memo Example Terms and Documents
 
Terms
Documents c1 c2 c3 c4 c5 m1 m2 m3 m4 human 1 0 0
1 0 0 0 0 0 interface 1 0 1 0 0 0 0 0 0 computer 1
1 0 0 0 0 0 0 0 user 0 1 1 0 1 0 0 0 0 system 0 1
1 2 0 0 0 0 0 response 0 1 0 0 1 0 0 0 0 time 0 1
0 0 1 0 0 0 0 EPS 0 0 1 1 0 0 0 0 0 survey 0 1 0
0 0 0 0 0 1 trees 0 0 0 0 0 1 1 1 0 graph 0 0 0 0
0 0 1 1 1 minors 0 0 0 0 0 0 0 1 1
 
8
Technical Memo Example Query
Query Find documents relevant to "human
computer interaction" Simple Term Matching
Matches c1, c2, and c4 Misses c3 and c5
9
Models of Semantic Similarity
Proximity models Put similar items together in
some space or structure Clustering
(hierarchical, partition, overlapping).
Documents are considered close to the extent that
they contain the same terms. Most then arrange
the documents into a hierarchy based on distances
between documents. Covered later in
course. Factor analysis based on matrix of
similarities between documents (single
mode). Two-mode proximity methods. Start with
rectangular matrix and construct explicit
representations of both row and column objects.
10
Selection of Two-mode Factor Analysis
Additional criterion Computationally efficient
O(N2k3) N is number of terms plus documents k
is number of dimensions
11
The term vector space
t3
The space has as many dimensions as there are
terms in the word list.
d1
d2
t2
?
t1
12
Figure 1
Latent concept vector space
term document query --- cosine gt 0.9
13
Mathematical concepts
Define X as the term-document matrix, with t rows
(number of index terms) and d columns (number of
documents). Singular Value Decomposition For
any matrix X, with t rows and d columns, there
exist matrices T0, S0 and D0', such that X
T0S0D0' T0 and D0 are the matrices of left and
right singular vectors T0 and D0 have orthonormal
columns S0 is the diagonal matrix of singular
values
14
Dimensions of matrices
t x d
t x m
m x d
m x m
D0'
S0
X

T0
m is the rank of X lt min(t, d)
15
Reduced Rank
S0 can be chosen so that the diagonal elements
are positive and decreasing in magnitude. Keep
the first k and set the others to zero. Delete
the zero rows and columns of S0 and the
corresponding rows and columns of T0 and D0.
This gives X X TSD' Interpretation If
value of k is selected well, expectation is that
X retains the semantic information from X, but
eliminates noise from synonymy and recognizes
dependence.


16
Selection of singular values
t x d
t x k
k x d
k x k
S
D'


X
T
k is the number of singular values chosen to
represent the concepts in the set of
documents. Usually, k m.
17
Comparing a Term and a Document

An individual cell of X is the number of
occurrences of term i in document j. X TSD'
TS(DS)' where S is a diagonal matrix whose
values are the square root of the corresponding
elements of S.

-
-
-
18
Calculation Similarities in the Concept Space
Objective Calculate similarities between terms,
documents, and queries, using the matrices T, S,
and D.
19
Mathematical Revision
A is a p x q matrix B is a r x q matrix ai is the
vector represented by row i of A bj is the vector
represented by row j of B The inner product ai.bj
is element i, j of AB'
q
r
ith row of A
q
p
B'
jth row of B
A
20
Comparing Two Terms

The dot product of two rows of X reflects the
extent to which two terms have a similar pattern
of occurrences.


XX' TSD'(TSD')' TSD'DS'T'
TSS'T' Since D is orthonormal
TS(TS)' To calculate the i, j cell, take the dot
product between the i and j rows of TS Since S is
diagonal, TS differs from T only by stretching
the coordinate system
21
Comparing Two Documents

The dot product of two columns of X reflects the
extent to which two columns have a similar
pattern of occurrences.


X'X (TSD')'TSD' DS(DS)' To calculate
the i, j cell, take the dot product between the i
and j columns of DS. Since S is diagonal DS
differs from D only by stretching the coordinate
system
22
Comparing a Query and a Document
A query can be expressed as a vector in the
term-document vector space xq. xqi 1 if
term i is in the query and 0 otherwise. (Ignore
query terms that are not in the term vector
space.) Let pqj be the inner product of the query
xq with document dj in the term-document vector
space. pqj is the jth element in the
product of xq'X.

23
Comparing a Query and a Document
pq1 ... pqj ... pqt xq1 xq2 ... xqt
document dj is column j of X

inner product of query q with document dj
query

pq' xq'X xq'TSD'
xq'T(DS)' similarity(q, dj)
cosine of angle is inner product divided by
lengths of vectors
pqj xq dj
24
Comparing a Query and a Document
In the reading, the authors treat the query as a
pseudo-document in the concept space dq dq
xq'TS-1 Note that S-1 stretches the
vector To compare a query against document j,
they extend the method used to compare document i
with document j. Take the jth element of the
product of dqS and (DS)' This is the jth
element of product of xq'T (DS)' which is the
same expression as before. Note that with their
notation dq is a row vector.
25
Technical Memo Example Query
 
Terms Query
xq human 1 interface 0 computer 0 user 0 system 1
response 0 time 0 EPS 0 survey 0 trees 1 graph 0 m
inors 0
Query "human system interactions on trees" In
term-document space, a query is represented by
xq, a column vector with t elements. In concept
space, a query is represented by dq, a row vector
with k elements.
 
26
Experimental Results
Deerwester, et al. tried latent semantic indexing
on two test collections, MED and CISI, where
queries and relevant judgments were
available. Documents were full text of title and
abstract. Stop list of 439 words (SMART) no
stemming, etc. Comparison with (a) simple
term matching, (b) SMART, (c) Voorhees method.
27
Experimental Results 100 Factors
28
Experimental Results Number of Factors
Write a Comment
User Comments (0)
About PowerShow.com