Title: Iterative residual rescaling: An analysis and generalization of LSI
1Iterative residual rescaling An analysis and
generalization of LSI
- Rie Kubota Ando Lillian Lee. Iterative residual
rescaling An analysis and generalization of LSI.
In the 24th Annual International ACM SIGIR
Conference (SIGIR'2001), 2001. - Presenter ???
2Introduction
- The disadvantage of VSM
- Documents that do not share terms are mapped to
orthogonal vectors even if they are clearly
related. - LSI attempts to overcome this shortcomings by
projecting the term-document matrix onto a
lower-dimensional subspace.
3Introduction of IRR
Weight
doc
U
VT
A
SVD
term
eigenvalue
eigenvector
eigenvector
rescaling
4Frobenius norm and matrix 2-norm
5Analyzing LSI
- Topic-based similarities
- C an n-document collection
- D m-by-n term-document matrix
- k underlying topics (kltn)
- Relevance score
- for each document and each topic
- for each document
- True topic-based similarity between and
- then we can get a n-by-n matrix S
topic
topic
doc
doc
S
doc
doc
topic
doc
6The optimum subspace
- Give a subspace of , and B
form an orthonormal basis of -
7The optimum subspace
- We have m-by-n term-document matrix D
- D the
projection of D onto - is
8The optimum subspace
- Deviation matrix
- find a subspace such that the entries of
it are small. - The optimum subspace
- Optimum error
if optimum error is high, then we cannot expect
the optimum subspace to fully reveal the topic
dominances.
9The singular value decomposition and LSI
- SVD
- Gained on the left singular vector by following
observation - be the projection of
onto the span of - let be the residual vector
10Analysis of LSI
11Non-uniformity and LSI
- A crucial quantity in our analysis is the
dominance of a given topic t
12Non-uniformity and LSI
- Topic mingling
- If the topic mingling is high means the
similarity of each document with different topics
is high, then the topics will be fairly difficult
to distinguish.
13Non-uniformity and LSI
-
- let be the ith largest singular value of
. Then
14Non-uniformity and LSI
- Define
- We can get the ratio
- the more largest topic dominates the
collection, the higher this ratio will tend to be.
15Non-uniformity and LSI
- Original error
- Let denote the VSM space
- then as
- Root original error
( Input error )
16Non-uniformity and LSI
- Let be the h-dimension LSI subspace
spanned by the first h left singular vectors of D
- if
- must be close to when the
topic-document distribution is relatively
uniform. -
17Notation for related values
- is topic mingling
- For
- we write
- the approximation becomes closer as the
optimum error (or optimum error) becomes smaller.
18Andos IRR algorithm
19Introduction of IRR
20Andos IRR algorithm
find the max x which approximate R
21Andos IRR algorithm
22Auto-scale method
- Automatic scaling factor determination
topic
When approximately single-topic
doc
23Auto-scale method
- Implement auto-scale
- We set q to a linear function of f(D)
24Dimension selection
- Stopping criterion
- residual ratio (effective for both LSI and
IRR)
25Evaluation Matrix
- Kappa average precision
- Pair-wise average precision
- the measured similarity for any two
intra-topic documents( share at least one topic)
should be higher than for any two cross-topic
documents which have no topics in common.
Denote the document pair with the jth largest
measure cosine
Non intra-topic probability
26Evaluation Matrix
- Clustering
- let C be a cluster-topic contingency table
- is the number of documents in cluster
i that relevance to topic j. - define
27Experimental setting
- (1)Choose two TREC topics (can choose more than
two) - (2)Specified seven distribution type
- (25,25), (30,20), (35,15), (40,10), (43,7),
(45,5), (46,4) - Each document was relevant to exactly one of the
pre-select topics. - (3)Extracted single-word stemmed terms using
TALENT and removed stop-words. - (4)Create term-document matrix, and
length-normalized the document vector. - (5)implement AUTO-SCALE, set
28Controlled-distribution results
- The chosen scaling factor increases on average as
the non-uniformity goes up.
29Controlled-distribution results
lowest S(C)
Highest S(C)
30Controlled-distribution results
31Conclusion
- Provided a new theoretical analysis of LSI.
- Showing a precise relationship between LSIs
performance and the uniformity of the underlying
topic-document distribution. - Extend Andos IRR algorithm.
- IRR provide a very good performance in comparison
to LSI.
32IRR on summarization
doc
sentence
term
turn to
term
U
VT
IRR
Put all document as a query to count the
similarity