Title: Spotting Topics With Singular Value Decomposition
1 Spotting Topics With Singular Value
Decomposition (Charles Nicholas and Randall
Dahlberg) Presented by Ashish P. Ram Khaled
Abbassi
2CONTENTS
- Introduction
- SVD/Related Math/Related Work
- Applying SVD to term-document matrix
- Interpreting Singular Vectors/Spotting Topics
- Conclusion
3Introduction
- Goal How to analyze a large collection of
documents to see what topics are discussed. - Characteristics of the collection
- monolingual or multi-lingual
- very large in size (order of Giga Bytes)
- docs cover different topics by different authors
- Assumptions
- there exists an underlying relation amongst
certain words - This relation can help us give a semantic to
queries and content in docs
4 Collection used documents within
Test consisting of two days of Associated Press
newswire traffic Consisted of 225 news stories
(English) from January 1 and January 2 of
1989. The documents were stored in SGML format.
Terms within the documents
Terms are represented as n-grams of length
5. 96841 n-grams occur in the test collection
of documents (includes stop words) n-grams that
did not occur in English language were ignored
5Sparse Matrix Representation
A term-document matrix is formed from the terms
and documents The terms form the rows and
columns form the documents a cell denotes the
occurrence of a particular term in a
document the matrix is sparse, as all terms do
not occur in all documents Algorithms are fine
tuned to deal with large sparse matrices
6SVD/Related Math/Related Work
Singular Value Decomposition is a computation
intensive, matrix analysis Technique that has a
wide range of applications. Used in the
solution of unconstrained linear least square
problems, matrix rank estimations, canonical
correlation analysis, seismic reflection
tomography, information retrieval, real time
signal processing etc. All these applications
require solutions in the shortest time
possible.
7Related Math
1. Singular Matrix is one whose determinant
evaluates to zero. 1. E.g. 3x 4y 2z
9 x 9y 4z 22 4x 13y 6z
31 Here, the third equations in the sum
of the first two. 1. Hence, there are only
two equations for three unknowns. This
matrix is called singular. Solution? Yes, the
third unknown expressed in terms of the other two
unknowns. giving a lot of possible values for
the third unknown. 2. e.g. if z2x 4y
(suppose), there are different values possible
for z, for the given combination of x and y.
8 Back to problem at hand, we are not aware of
the underlying structure and relations amongst
the terms. Hence, a wide range of relations,
structures are possible amongst them. The
term-document matrix represents many of these
structures. It is upon us to discover the
relevant relations and put the similar ones
closer to one another. This is a typical
problem involving singular Matrices.
9Where Does SVD fit in?
- SVD provides a solution to this, and in doing so,
- It captures all the info in the original array,
without loss. - Reduces the size of the matrix to operate on.
(Deals with non- sparse parts) - Places similar elements closer to each
other. - Allows the reconstruction of the original
matrix, with some loss of precision.
10SVD Technical Details
Example has been worked out in MATLAB The
columns represent documents and the rows
represent terms We take 5 documents and 4
terms Matrix example 3 9 11 2
5 5 3 4 3 1 2 7
5 5 11 17 42 41 22
44 Performing a svd on example, U,S,V
svd(example)
11U 0.1796 -0.6743 -0.4964 0.5164
0.0762 -0.4200 0.8667 0.2582 0.1765
0.6055 0.0472 0.7746 0.9648 0.0479
0.0154 -0.2582 S 81.3105 0
0 0 0 0 7.0254
0 0 0 0 0
4.1521 0 0 0 0
0 0.0000 0
12 V 0.2174 -0.2985 0.7707 -0.5126
0.0833 0.5362 -0.1532 -0.2147 -0.1347
-0.7904 0.5254 -0.5842 -0.2715 0.2365
0.5031 0.2791 0.2098 0.5254 0.7659
-0.1245 0.5579 0.7087 -0.1011
-0.2767 0.3157 VT 0.2174 0.5362
0.5254 0.2791 0.5579 -0.2985 -0.1532
-0.5842 0.2098 0.7087 0.7707 -0.2147
-0.2715 0.5254 -0.1011 -0.5126 -0.1347
0.2365 0.7659 -0.2767 0.0833
-0.7904 0.5031 -0.1245 0.3157
13SVD decomposes the original matrix into linearly
independent components. U , V are orthonormal
matrices. (orthornormal means a a I) S is a
diagonal matrix with singular values of the
example matrix forming the diagonal elements
arranged in descending order. These singular
values are the nonnegative square roots of Eigen
values of example x example SVD allows us to
work with only a small part of the original
matrix, without loss of underlying relations.
Hence, instead of considering the whole matrix of
size 4 x 5, we can consider a smaller part of it.
14This is very helpful in case of large sparse
matrices, as it makes computation easier and
faster to deal with smaller sparse matrices. The
reduction factor k can be described as 1ltk lt m
where m is the rank of the original matrix , in
this case example matrix. Point to be noted is
that rank of the decomposed matrices remains
unchanged. I.e. it is still m. Thus,
re-computing svd with k2, gives us U,S,V
svds(example,2) (as in MATLAB)
15U 0.1796 -0.6743 0.0762 -0.4200
0.1765 0.6055 0.9648 0.0479 S
81.3105 0 0 7.0254 V
0.2174 -0.2985 0.5362 -0.1532 0.5254
-0.5842 0.2791 0.2098 0.5579 0.7087
16As mentioned earlier, there is no loss of
information contained, as the original matrix,
example, can be regenerated to the nearest
approximation by multiplying the U, S, VT U S
V (In MATLAB gives) 4.5884 8.5575
10.4405 3.0828 4.7916 2.2266 3.7726
4.9769 1.1094 1.3639 1.8491 7.0420
5.0531 4.8971 11.0198 16.9508 42.0137
41.0173 21.9665 44.0065 View is that this
new matrix example, is the correct term-document
matrix as it takes into account the underlying
relations. The previous matrix, example is
considered imperfect when compared to the new
one.
17Some characteristics of U, V
1. The matrix U is left singular
i.e. its construction is such that the most
important vectors are to the left of less
important vectors. 2. It can be seen that
the entries in the first column of U are
nonnegative. 3. Similarly, the Matrix V is
right singular Vector. The elements in
both the matrices can be used to spot topics,
when used in certain operations. (discussed
later)
18Applying SVD to term-document matrix
We have a term-document matrix A of size 96841 x
225 Test were carried out with k values of
5, 10, 15 MATLAB and SVDPACK software were used
19Interpreting The Singular Vectors/Spotting Topics
The columns of U are called the Term vector.
The rows of V (V transpose) are called
Document Vectors The elements of the matrices
are mapped to a new geometric space Columns in
U are arranged so that the most important vectors
are to the left of the less important. 1.
Each column in U is basically a linear
combination of terms that tend to occur together
in a consistent fashion throughout the
collection, in the sense that when one term
occurs, the others also do, in a certain
proportion.
201. Each column in U is considered as an axis in
the new space. The rows of V (V transpose) are
right singular and represent documents when
projected into the space. 1. Now, entries in
row i of the matrix S x V (V transpose) are the
document is coordinates in the new space.
21Conclusion
A very good interpretation of SVD results is
provided. From the experiments success, they
have been able to apply SVD to spot topics. No
comparisons with other systems is provided as
there are none to compare to. No test results
with standard collections is provided. E.g.
Trec,etc.