Title: CSE490i Advanced Internet Systems
19/7 Agenda ?Project 1 discussion
?Correlation Analysis ?PCA (LSI)
29/7 Agenda ?Project 1 discussion
?Correlation Analysis ?PCA (LSI)
The first rule of the social fabric that in
times of crisis you protect the vulnerable was
trampled. --David Brooks (NYT 9/4)
Conservative Political Commentator
3Improving Vector Space Ranking
- We will consider two classes of techniques
- Correlation analysis, which looks at correlations
between keywords (and thus effectively computes a
thesaurus based on the word occurrence in the
documents) - Principal Components Analysis (also called Latent
Semantic Indexing) which subsumes correlation
analysis and does dimensionality reduction.
4Correlation/Co-occurrence analysis
- Co-occurrence analysis
- Terms that are related to terms in the original
query may be added to the query. - Two terms are related if they have high
co-occurrence in documents. - Let n be the number of documents
- n1 and n2 be documents containing terms
t1 and t2, - m be the documents having both
t1 and t2 - If t1 and t2 are independent
- If t1 and t2 are correlated
Measure degree of correlation
gtgt if Inversely correlated
5Association Clusters
- Let Mij be the term-document matrix
- For the full corpus (Global)
- For the docs in the set of initial results
(local) - (also sometimes, stems are used instead of terms)
- Correlation matrix C MMT (term-doc Xdoc-term
term-term) -
Un-normalized Association Matrix
Normalized Association Matrix
Nth-Association Cluster for a term tu is the set
of terms tv such that Suv are the n largest
values among Su1, Su2,.Suk
6Example
11 4 6 4 34 11 6 11 26
Correlation Matrix
d1d2d3d4d5d6d7 K1 2 1 0 2 1 1 0 K2 0 0
1 0 2 2 5 K3 1 0 3 0 4 0 0
Normalized Correlation Matrix
1.0 0.097 0.193 0.097 1.0
0.224 0.193 0.224 1.0
1th Assoc Cluster for K2 is K3
7Scalar clusters
Even if terms u and v have low correlations,
they may be transitively correlated (e.g. a
term w has high correlation with u and v).
Consider the normalized association matrix S The
association vector of term u Au is
(Su1,Su2Suk) To measure neighborhood-induced
correlation between terms Take the cosine-theta
between the association vectors of terms u and
v
Nth-scalar Cluster for a term tu is the set of
terms tv such that Suv are the n largest values
among Su1, Su2,.Suk
8Example
Normalized Correlation Matrix
AK1
USER(43) (neighborhood normatrix) 0
(COSINE-METRIC (1.0 0.09756097 0.19354838) (1.0
0.09756097 0.19354838)) 0 returned 1.0 0
(COSINE-METRIC (1.0 0.09756097 0.19354838)
(0.09756097 1.0 0.2244898)) 0 returned
0.22647195 0 (COSINE-METRIC (1.0 0.09756097
0.19354838) (0.19354838 0.2244898 1.0)) 0
returned 0.38323623 0 (COSINE-METRIC
(0.09756097 1.0 0.2244898) (1.0 0.09756097
0.19354838)) 0 returned 0.22647195 0
(COSINE-METRIC (0.09756097 1.0 0.2244898)
(0.09756097 1.0 0.2244898)) 0 returned 1.0 0
(COSINE-METRIC (0.09756097 1.0 0.2244898)
(0.19354838 0.2244898 1.0)) 0 returned
0.43570948 0 (COSINE-METRIC (0.19354838
0.2244898 1.0) (1.0 0.09756097 0.19354838)) 0
returned 0.38323623 0 (COSINE-METRIC
(0.19354838 0.2244898 1.0) (0.09756097 1.0
0.2244898)) 0 returned 0.43570948 0
(COSINE-METRIC (0.19354838 0.2244898 1.0)
(0.19354838 0.2244898 1.0)) 0 returned 1.0
Scalar (neighborhood) Cluster Matrix
1.0 0.226 0.383 0.226 1.0
0.435 0.383 0.435 1.0
1th Scalar Cluster for K2 is still K3
9Metric Clusters
average..
- Let r(ti,tj) be the minimum distance (in terms of
number of separating words) between ti and tj in
any single document (infinity if they never occur
together in a document) - Define cluster matrix Suv 1/r(ti,tj)
Nth-metric Cluster for a term tu is the set of
terms tv such that Suv are the n largest values
among Su1, Su2,.Suk
r(ti,tj) is also useful For proximity queries And
phrase queries
10Similarity Thesaurus
- The similarity thesaurus is based on term to term
relationships rather than on a matrix of
co-occurrence. - obtained by considering that the terms are
concepts in a concept space. - each term is indexed by the documents in which it
appears. - Terms assume the original role of documents while
documents are interpreted as indexing elements
11Motivation
Ki
Kv
Kj
Ka
Kb
Q
12Similarity Thesaurus
- The relationship between two terms ku and kv is
computed as a correlation factor cu,v given by - The global similarity thesaurus is built through
the computation of correlation factor Cu,v for
each pair of indexing terms ku,kv in the
collection - Expensive
- Possible to do incremental updates
Similar to the scalar clusters Idea, but for the
tf/itf weighting Defining the term vector
13Similarity Thesaurus
- Terminology
- t number of terms in the collection
- N number of documents in the collection
- Fi,j frequency of occurrence of the term ki in
the document dj - tj vocabulary of document dj
- itfj inverse term frequency for document dj
- Inverse term frequency for document dj
- To ki is associated a vector
- Where
Idea It is no surprise if Oxford
dictionary Mentions the word!
14Beyond Correlation analysis PCA/LSI
- Suppose I start with documents described in terms
of just two key words, u and v, but then - Add a bunch of new keywords (of the form 2u-3v
4u-v etc), and give the new doc-term matrix to
you. Will you be able to tell that the documents
are really 2-dimensional (in that there are only
two independent keywords)? - Suppose, in the above, I also add a bit of noise
to each of the new terms (i.e. 2u-3vnoise
4u-vnoise etc). Can you now discover that the
documents are really 2-D? - Suppose further, I remove the original keywords,
u and v, from the doc-term matrix, and give you
only the new linearly dependent keywords. Can you
now tell that the documents are 2-dimensional? - Notice that in this last case, the true
dimensions of the data are not even present in
the representation! You have to re-discover the
true dimensions as linear combinations of the
given dimensions.
added
15Data Generation Models
- The fact that keywords in the documents are not
actually independent, and that they have synonymy
and polysemy among them, often manifests itself
as if some malicious oracle mixed up the data as
above. - Need Dimensionality Reduction Techniques
- If the keyword dependence is only linear (as
above), a general polynomial complexity technique
called Principal Components Analysis is able to
do this dimensionality reduction - PCA applied to documents is called Latent
Semantic Indexing - If the dependence is nonlinear, you need
non-linear dimensionality reduction techniques
(such as neural networks) much costlier.
16Visual Example
- Classify Fish
- Length
- Height
17Move Origin
- To center of centroid
- But are these the best axes?
18- Better if one axis accounts for most data
variation - What should we call the red axis? Size (factor)
19Reduce Dimensions
- What if we only consider size
We retain 1.75/2.00 x 100 (87.5) of the
original variation. Thus, by discarding the
yellow axis we lose only 12.5 of the original
information.
20If you can do it for fish, why not to docs?
- We have documents as vectors in the space of
terms - We want to
- Transform the axes so that the new axes are
- Orthonormal (independent axes)
- Can be ordered in terms of the amount of
variation in the documents they capture - Pick top K dimensions (axes) in this ordering
and use these new K dimensions to do the
vector-space similarity ranking - Why?
- Can reduce noise
- Can eliminate dependent variabales
- Can capture synonymy and polysemy
- How?
- SVD (Singular Value Decomposition)
21What happens if you multiply a vector by a matrix?
- In general, when you multiply a vector by a
matrix, the vector gets scaled as well as
rotated - ..except when the vector happens to be in the
direction of one of the eigen vectors of the
matrix - .. in which case it only gets scaled (stretched)
- A (symmetric square) matrix has all real eigen
values, and the values give an indication of the
amount of stretching that is done for vectors in
that direction - The eigen vectors of the matrix define a new
ortho-normal space - You can model the multiplication of a general
vector by the matrix in terms of - First decompose the general vector into its
projections in the eigen vector directions - ..which means just take the dot product of the
vector with the (unit) eigen vector - Then multiply the projections by the
corresponding eigen valuesto get the new vector. - This explains why power method converges to
principal eigen vector.. - ..since if a vector has a non-zero projection in
the principal eigen vector direction, then
repeated multiplication will keep stretching the
vector in that direction, so that eventually all
other directions vanish by comparison..
added
22SVD, Rank and Dimensionality
- Suppose we did SVD on a doc-term matrix d-t, and
took the top-k eigen values and reconstructed the
matrix d-tk. We know - d-tk has rank k (since we zeroed out all the
other eigen values when we reconstructed d-tk) - There is no k-rank matrix M such that d-t M
lt d-t d-tk - In other words d-tk is the best rank-k
(dimension-k) approximation to d-t! - This is the guarantee given by SVD!
- Rank of a matrix M is defined as the size of the
largest square sub-matrix of M which has a
non-zero determinant. - The rank of a matrix M is also equal to the
number of non-zero singular values it has - Rank of M is related to the true dimensionality
of M. If you add a bunch of rows to M that are
linear combinations of the existing rows of M,
the rank of the new matrix will still be the same
as the rank of M. - Distance between two equi-sized matrices M and
M M-M is defined as the sum of the squares
of the differences between the corresponding
entries (Sum (muv-muv)2) - Will be equal to zero when M M
Optional
23Terms and Docs as vectors in factor space
In addition to doc-doc similarity, We can
compute term-term distance
Document vector
Term vector
If terms are independent, the T-T similarity
matrix would be diagonal If it is not
diagonal, we can use the correlations to
add related terms to the query But can
also ask the question Are there
independent dimensions which define
the space where terms docs are
vectors ?
24Overview of Latent Semantic Indexing
Eigen Slide
factor-factor (ve sqrt of eigen values of
d-td-tor d-td-t both same)
Doc-factor (eigen vectors of d-td-t)
(term-factor)T (eigen vectors of d-td-t)
Term
Term
dt
df
dfk
dtk
tft
ff
tfkt
doc
ffk
Þ
doc
fxt
dxt
dxf
fxf
dxk
kxk
kxt
dxt
Reduce Dimensionality Throw out low-order rows
and columns
Recreate Matrix Multiply to produce approximate
term- document matrix. dtk is a k-rank
matrix That is closest to dt
Singular Value Decomposition Convert
doc-term matrix into 3matrices D-F, F-F,
T-F Where DFFFTF gives the Original matrix back
25t1 database t2SQL t3index t4regression t5like
lihood t6linear
F-F
D-F
6 singular values (positive sqrt of eigen
values of MM or MM)
T-F
Eigen vectors of MM (Principal document
directions)
Eigen vectors of MM (Principal term directions)
26t1 database t2SQL t3index t4regression t5like
lihood t6linear
For the database/regression example
Suppose D1 is a new Doc containing database 50
times and D2 contains SQL 50 times
27LSI Ranking
- Given a query
- Either add query also as a document in the D-T
matrix and do the svd OR - Convert query vector (separately) to the LSI
space - DFqFFqTF
- this is the weighted query document in LSI space
- Reduce dimensionality as needed
- Do the vector-space similarity in the LSI space
28Using LSI
- Can be used on the entire corpus
- First compute the SVD of the entire corpus
- Store first k columns of the dfff matrix
dfffk - Keep the tf matrix handy
- When a new query q comes, take the k columns of
qtf - Compute the vector similarity between qtfk and
all rows of dfffk, rank the documents and
return
- Can be used as a way of clustering the results
returned by normal vector space ranking - Take some top 50 or 100 of the documents returned
by some ranking (e.g. vector ranking) - Do LSI on these documents
- Take the first k columns of the resulting dfff
matrix - Each row in this matrix is the representation of
the original documents in the reduced space. - Cluster the documents in this reduced space (We
will talk about clustering later) - MANJARA did this
- We will need fast SVD computation algorithms for
this. MANJARA folks developed approximate
algorithms for SVD
Added based on class discussion
29SVD Computation complexity
- For an mxn matrix SVD computation is
- O( km2nkn3) complexity
- k4 and k22 for best algorithms
- Approximate algorithms that exploit the sparsity
of M are available (and being developed)
30Bunch of Facts about SVD
- Relation between SVD and Eigen value
decomposition - Eigen value decomp is defined only for square
matrices - Only square symmetric matrices have real-valued
eigen values - SVD is defined for all matrices
- Given a matrix M, we consider the eigen
decomposion of the correlation matrices MMT and
MTM. SVD is the eigen vectors of MMT positive
square roots of eigen values of MMT eigen
vectors of MTM - Both MMT and MTM are symmetric (they are
correlation matrices) - They both will have the same eigen values
- Unless M is symmetric, MMT and MTM are different
- So, in general their eigen vectors will be
different (although their eigen values are same) - Since SVD is defined in terms of the eigen values
and vectors of the Correlation matrices of a
matrix, the eigen values will always be real
valued (even if the matrix M is not symmetric). - In general, the SVD decomposition of a matrix M
equals its eigen decomposition only if M is both
square and symmetric
Added based on the discussion in the class
Optional
31Ignore beyond this slide (hidden)
32Yet another Example
U (9x7) 0.3996 -0.1037 0.5606
-0.3717 -0.3919 -0.3482 0.1029
0.4180 -0.0641 0.4878 0.1566 0.5771
0.1981 -0.1094 0.3464 -0.4422
-0.3997 -0.5142 0.2787 0.0102 -0.2857
0.1888 0.4615 0.0049 -0.0279
-0.2087 0.4193 -0.6629 0.3602
0.3776 -0.0914 0.1596 -0.2045 -0.3701
-0.1023 0.4075 0.3622 -0.3657
-0.2684 -0.0174 0.2711 0.5676
0.2750 0.1667 -0.1303 0.4376 0.3844
-0.3066 0.1230 0.2259 -0.3096
-0.3579 0.3127 -0.2406 -0.3122 -0.2611
0.2958 -0.4232 0.0277 0.4305
-0.3800 0.5114 0.2010 S (7x7)
3.9901 0 0 0
0 0 0 0
2.2813 0 0 0
0 0 0 0
1.6705 0 0 0 0
0 0 0 1.3522
0 0 0 0
0 0 0 1.1818 0
0 0 0 0
0 0 0.6623 0
0 0 0 0 0
0 0.6487 V (7x8) 0.2917 -0.2674
0.3883 -0.5393 0.3926 -0.2112 -0.4505
0.3399 0.4811 0.0649 -0.3760
-0.6959 -0.0421 -0.1462 0.1889
-0.0351 -0.4582 -0.5788 0.2211
0.4247 0.4346 -0.0000 -0.0000
-0.0000 -0.0000 0.0000 -0.0000 0.0000
0.6838 -0.1913 -0.1609 0.2535
0.0050 -0.5229 0.3636 0.4134
0.5716 -0.0566 0.3383 0.4493 0.3198
-0.2839 0.2176 -0.5151 -0.4369
0.1694 -0.2893 0.3161 -0.5330
0.2791 -0.2591 0.6442 0.1593 -0.1648
0.5455 0.2998
T
This happens to be a rank-7 matrix -so only 7
dimensions required
Singular values Sqrt of Eigen values of AAT
33Formally, this will be the rank-k (2) matrix that
is closest to M in the matrix norm sense
DF (9x7) 0.3996 -0.1037 0.5606
-0.3717 -0.3919 -0.3482 0.1029
0.4180 -0.0641 0.4878 0.1566 0.5771
0.1981 -0.1094 0.3464 -0.4422
-0.3997 -0.5142 0.2787 0.0102 -0.2857
0.1888 0.4615 0.0049 -0.0279
-0.2087 0.4193 -0.6629 0.3602
0.3776 -0.0914 0.1596 -0.2045 -0.3701
-0.1023 0.4075 0.3622 -0.3657
-0.2684 -0.0174 0.2711 0.5676
0.2750 0.1667 -0.1303 0.4376 0.3844
-0.3066 0.1230 0.2259 -0.3096
-0.3579 0.3127 -0.2406 -0.3122 -0.2611
0.2958 -0.4232 0.0277 0.4305
-0.3800 0.5114 0.2010 FF (7x7)
3.9901 0 0 0
0 0 0 0
2.2813 0 0 0
0 0 0 0
1.6705 0 0 0 0
0 0 0 1.3522
0 0 0 0
0 0 0 1.1818 0
0 0 0 0
0 0 0.6623 0
0 0 0 0 0
0 0.6487 TF(7x8) 0.2917 -0.2674
0.3883 -0.5393 0.3926 -0.2112 -0.4505
0.3399 0.4811 0.0649 -0.3760
-0.6959 -0.0421 -0.1462 0.1889
-0.0351 -0.4582 -0.5788 0.2211
0.4247 0.4346 -0.0000 -0.0000
-0.0000 -0.0000 0.0000 -0.0000 0.0000
0.6838 -0.1913 -0.1609 0.2535
0.0050 -0.5229 0.3636 0.4134
0.5716 -0.0566 0.3383 0.4493 0.3198
-0.2839 0.2176 -0.5151 -0.4369
0.1694 -0.2893 0.3161 -0.5330
0.2791 -0.2591 0.6442 0.1593 -0.1648
0.5455 0.2998
DF2 (9x2) 0.3996 -0.1037 0.4180
-0.0641 0.3464 -0.4422 0.1888
0.4615 0.3602 0.3776 0.4075
0.3622 0.2750 0.1667 0.2259
-0.3096 0.2958 -0.4232 FF2 (2x2)
3.9901 0 0 2.2813 TF2 (8x2)
0.2917 -0.2674 0.3399 0.4811
0.1889 -0.0351 -0.0000 -0.0000
0.6838 -0.1913 0.4134 0.5716
0.2176 -0.5151 0.2791 -0.2591
T
DF2FF2TF2T will be a 9x8 matrix That
approximates original matrix
34What should be the value of k?
df2ff2tf2T
5 components ignored
K2
DffftfT
df7ff 7 tf7T
df4ff4tf4T
K4
3 components ignored
df6ff6tf6T
K6
One component ignored
35Coordinate transformation inherent in LSI
Doc rep
T-D T-FF-F(D-F)T
Mapping of keywords into LSI space is given by
T-FF-F
Mapping of a doc dw1.wk into LSI space is
given by dT-F(F-F)-1
For k2, the mapping is
The base-keywords of The doc are first mapped To
LSI keywords and Then differentially weighted By
F-F-1
LSx
LSy
1.5944439 -0.2365708 1.6678618
-0.14623132 1.3821706 -1.0087909 0.7533309
1.05282 1.4372339 0.86141896 1.6259657
0.82628685 1.0972775 0.38029274 0.90136355
-0.7062905 1.1802715 -0.96544623
controllability observability realization feedback
controller observer Transfer function polynomial
matrices
LSIy
ch3
controller
LSIx
controllability
36Querying
T-F
To query for feedback controller, the query
vector would be q 0 0 0 1
1 0 0 0 0' (' indicates
transpose), since feedback and controller are
the 4-th and 5-th terms in the index, and no
other terms are selected. Let q be the query
vector. Then the document-space vector
corresponding to q is given by
q'TF(2)inv(FF(2) ) Dq For the feedback
controller query vector, the result is
Dq 0.1376 0.3678 To find the
best document match, we compare the Dq vector
against all the document vectors in the
2-dimensional V2 space. The document vector that
is nearest in direction to Dq is the best match.
The cosine values for the eight document
vectors and the query vector are -0.3747
0.9671 0.1735 -0.9413 0.0851 0.9642
-0.7265 -0.3805
F-F
D-F
Centroid of the terms In the query (with scaling)
-0.37 0.967 0.173
-0.94 0.08 0.96 -0.72 -0.38
37FF is a diagonal Matrix. So, its inverse Is
diagonal too. Diagonal Matrices are symmetric
38Variations in the examples ?
- DB-Regression example
- Started with D-T matrix
- Used the term axes as T-F and the doc rep as
D-FF-F - Q is converted into qT-F
- Chapter/Medline etc examples
- Started with T-D matrix
- Used term axes as T-FFF and doc rep as D-F
- Q is converted to qT-FFF-1
We will stick to this convention
39Medline data from Berrys paper
40Within .40 threshold
K is the number of singular values used