Iterative residual rescaling: An analysis and generalization of LSI

About This Presentation

Title:

Iterative residual rescaling: An analysis and generalization of LSI

Description:

Rie Kubota Ando & Lillian Lee. ... In the 24th Annual International ACM SIGIR Conference (SIGIR'2001) ... Kappa average precision: Pair-wise average precision: ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 33

Provided by: sil973

Category:

more less

Transcript and Presenter's Notes

Title: Iterative residual rescaling: An analysis and generalization of LSI

1
Iterative residual rescaling An analysis and
generalization of LSI

Rie Kubota Ando Lillian Lee. Iterative residual
rescaling An analysis and generalization of LSI.
In the 24th Annual International ACM SIGIR
Conference (SIGIR'2001), 2001.
Presenter ???

2
Introduction

The disadvantage of VSM
Documents that do not share terms are mapped to
orthogonal vectors even if they are clearly
related.
LSI attempts to overcome this shortcomings by
projecting the term-document matrix onto a
lower-dimensional subspace.

3
Introduction of IRR
Weight

doc
U
VT
A
SVD
term
eigenvalue
eigenvector
eigenvector
rescaling
4
Frobenius norm and matrix 2-norm

Frobenius norm
2-norm

5
Analyzing LSI

Topic-based similarities
C an n-document collection
D m-by-n term-document matrix
k underlying topics (kltn)
Relevance score
for each document and each topic
for each document
True topic-based similarity between and
then we can get a n-by-n matrix S

topic
topic
doc
doc
S
doc
doc
topic
doc
6
The optimum subspace

Give a subspace of , and B
form an orthonormal basis of

7
The optimum subspace

We have m-by-n term-document matrix D
D the
projection of D onto
is

8
The optimum subspace

Deviation matrix
find a subspace such that the entries of
it are small.
The optimum subspace
Optimum error

if optimum error is high, then we cannot expect
the optimum subspace to fully reveal the topic
dominances.
9
The singular value decomposition and LSI

SVD
Gained on the left singular vector by following
observation
be the projection of
onto the span of
let be the residual vector

10
Analysis of LSI
11
Non-uniformity and LSI

A crucial quantity in our analysis is the
dominance of a given topic t

12
Non-uniformity and LSI

Topic mingling
If the topic mingling is high means the
similarity of each document with different topics
is high, then the topics will be fairly difficult
to distinguish.

13
Non-uniformity and LSI

let be the ith largest singular value of
. Then

14
Non-uniformity and LSI

Define
We can get the ratio
the more largest topic dominates the
collection, the higher this ratio will tend to be.

15
Non-uniformity and LSI

Original error
Let denote the VSM space
then as
Root original error

( Input error )
16
Non-uniformity and LSI

Let be the h-dimension LSI subspace
spanned by the first h left singular vectors of D
if
must be close to when the
topic-document distribution is relatively
uniform.

17
Notation for related values

is topic mingling
For
we write
the approximation becomes closer as the
optimum error (or optimum error) becomes smaller.

18
Andos IRR algorithm

IRR algorithm

19
Introduction of IRR
20
Andos IRR algorithm
find the max x which approximate R
21
Andos IRR algorithm
22
Auto-scale method

Automatic scaling factor determination

topic
When approximately single-topic
doc
23
Auto-scale method

Implement auto-scale
We set q to a linear function of f(D)

24
Dimension selection

Stopping criterion
residual ratio (effective for both LSI and
IRR)

25
Evaluation Matrix

Kappa average precision
Pair-wise average precision
the measured similarity for any two
intra-topic documents( share at least one topic)
should be higher than for any two cross-topic
documents which have no topics in common.

Denote the document pair with the jth largest
measure cosine
Non intra-topic probability
26
Evaluation Matrix

Clustering
let C be a cluster-topic contingency table
is the number of documents in cluster
i that relevance to topic j.
define

27
Experimental setting

(1)Choose two TREC topics (can choose more than
two)
(2)Specified seven distribution type
(25,25), (30,20), (35,15), (40,10), (43,7),
(45,5), (46,4)
Each document was relevant to exactly one of the
pre-select topics.
(3)Extracted single-word stemmed terms using
TALENT and removed stop-words.
(4)Create term-document matrix, and
length-normalized the document vector.
(5)implement AUTO-SCALE, set

28
Controlled-distribution results

The chosen scaling factor increases on average as
the non-uniformity goes up.

29
Controlled-distribution results
lowest S(C)
Highest S(C)
30
Controlled-distribution results
31
Conclusion

Provided a new theoretical analysis of LSI.
Showing a precise relationship between LSIs
performance and the uniformity of the underlying
topic-document distribution.
Extend Andos IRR algorithm.
IRR provide a very good performance in comparison
to LSI.