Yet another Example

About This Presentation

Title:

Yet another Example

Description:

Mapping of a doc d=[w1....wk] into. LSI space is given by d'T-F(F-F)-1. The base ... Create a hash table for all terms in the collection. tj pointer to I(tj) ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 13

Provided by: bert197

Learn more at: https://rakaposhi.eas.asu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Yet another Example

1
Yet another Example
U (9x7)     0.3996   -0.1037    0.5606
-0.3717   -0.3919   -0.3482    0.1029
0.4180   -0.0641    0.4878    0.1566    0.5771
0.1981   -0.1094     0.3464   -0.4422
-0.3997   -0.5142    0.2787    0.0102   -0.2857
    0.1888    0.4615    0.0049   -0.0279
-0.2087    0.4193   -0.6629     0.3602
0.3776   -0.0914    0.1596   -0.2045   -0.3701
-0.1023     0.4075    0.3622   -0.3657
-0.2684   -0.0174    0.2711    0.5676
0.2750    0.1667   -0.1303    0.4376    0.3844
-0.3066    0.1230     0.2259   -0.3096
-0.3579    0.3127   -0.2406   -0.3122   -0.2611
    0.2958   -0.4232    0.0277    0.4305
-0.3800    0.5114    0.2010 S (7x7)
3.9901         0         0         0
0         0         0          0
2.2813         0         0         0
0         0          0         0
1.6705         0         0         0         0
         0         0         0    1.3522
0         0         0          0
0         0         0    1.1818         0
0          0         0         0
0         0    0.6623         0
0         0         0         0         0
0    0.6487 V (7x8)     0.2917   -0.2674
0.3883   -0.5393    0.3926   -0.2112   -0.4505
    0.3399    0.4811    0.0649   -0.3760
-0.6959   -0.0421   -0.1462     0.1889
-0.0351   -0.4582   -0.5788    0.2211
0.4247    0.4346    -0.0000   -0.0000
-0.0000   -0.0000    0.0000   -0.0000    0.0000
    0.6838   -0.1913   -0.1609    0.2535
0.0050   -0.5229    0.3636     0.4134
0.5716   -0.0566    0.3383    0.4493    0.3198
-0.2839     0.2176   -0.5151   -0.4369
0.1694   -0.2893    0.3161   -0.5330
0.2791   -0.2591    0.6442    0.1593   -0.1648
0.5455    0.2998

T
This happens to be a rank-7 matrix -so only 7
dimensions required
Singular values Sqrt of Eigen values of AAT
2
Formally, this will be the rank-k (2) matrix that
is closest to M in the matrix norm sense
U (9x7)     0.3996   -0.1037    0.5606
-0.3717   -0.3919   -0.3482    0.1029
0.4180   -0.0641    0.4878    0.1566    0.5771
0.1981   -0.1094     0.3464   -0.4422
-0.3997   -0.5142    0.2787    0.0102   -0.2857
    0.1888    0.4615    0.0049   -0.0279
-0.2087    0.4193   -0.6629     0.3602
0.3776   -0.0914    0.1596   -0.2045   -0.3701
-0.1023     0.4075    0.3622   -0.3657
-0.2684   -0.0174    0.2711    0.5676
0.2750    0.1667   -0.1303    0.4376    0.3844
-0.3066    0.1230     0.2259   -0.3096
-0.3579    0.3127   -0.2406   -0.3122   -0.2611
    0.2958   -0.4232    0.0277    0.4305
-0.3800    0.5114    0.2010 S (7x7)
3.9901         0         0         0
0         0         0          0
2.2813         0         0         0
0         0          0         0
1.6705         0         0         0         0
         0         0         0    1.3522
0         0         0          0
0         0         0    1.1818         0
0          0         0         0
0         0    0.6623         0
0         0         0         0         0
0    0.6487 V (7x8)     0.2917   -0.2674
0.3883   -0.5393    0.3926   -0.2112   -0.4505
    0.3399    0.4811    0.0649   -0.3760
-0.6959   -0.0421   -0.1462     0.1889
-0.0351   -0.4582   -0.5788    0.2211
0.4247    0.4346    -0.0000   -0.0000
-0.0000   -0.0000    0.0000   -0.0000    0.0000
    0.6838   -0.1913   -0.1609    0.2535
0.0050   -0.5229    0.3636     0.4134
0.5716   -0.0566    0.3383    0.4493    0.3198
-0.2839     0.2176   -0.5151   -0.4369
0.1694   -0.2893    0.3161   -0.5330
0.2791   -0.2591    0.6442    0.1593   -0.1648
0.5455    0.2998
U2 (9x2)     0.3996   -0.1037     0.4180
-0.0641     0.3464   -0.4422     0.1888
0.4615     0.3602    0.3776     0.4075
0.3622     0.2750    0.1667     0.2259
-0.3096     0.2958   -0.4232 S2 (2x2)
3.9901         0          0    2.2813 V2 (8x2)
    0.2917   -0.2674     0.3399    0.4811
    0.1889   -0.0351    -0.0000   -0.0000
0.6838   -0.1913     0.4134    0.5716
0.2176   -0.5151     0.2791   -0.2591
T
U2S2V2 will be a 9x8 matrix That approximates
original matrix
3
What should be the value of k?
U2S2V2T
5 components ignored
K2
USVT
U7S7V7T
U4S4V4T
K4
3 components ignored
U6S6V6T
K6
One component ignored
4
Coordinate transformation inherent in LSI
Doc rep
T-D T-FF-F(D-F)T
Mapping of keywords into LSI space is given by
T-FF-F
Mapping of a doc dw1.wk into LSI space is
given by dT-F(F-F)-1
For k2, the mapping is
The base-keywords of The doc are first mapped To
LSI keywords and Then differentially weighted By
S-1
LSx
LSy
1.5944439 -0.2365708 1.6678618
-0.14623132 1.3821706 -1.0087909 0.7533309
1.05282 1.4372339 0.86141896 1.6259657
0.82628685 1.0972775 0.38029274 0.90136355
-0.7062905 1.1802715 -0.96544623
controllability observability realization feedback
controller observer Transfer function polynomial
matrices
LSIy
ch3
controller
LSIx
controllability
5
FF is a diagonal Matrix. So, its inverse Is
diagonal too. Diagonal Matrices are symmetric
6
Querying
T-F
To query for feedback controller, the query
vector would be q 0     0     0     1
1     0     0     0     0' (' indicates
transpose), since feedback and controller are
the 4-th and 5-th terms in the index, and no
other terms are selected. Let q be the query
vector. Then the document-space vector
corresponding to q is given by
q'TF(2)inv(FF(2) ) Dq For the feedback
controller query vector, the result is
Dq 0.1376    0.3678 To find the
best document match, we compare the Dq vector
against all the document vectors in the
2-dimensional V2 space. The document vector that
is nearest in direction to Dq is the best match.
The cosine values for the eight document
vectors and the query vector are    -0.3747
0.9671    0.1735   -0.9413    0.0851    0.9642
-0.7265   -0.3805
F-F
D-F
Centroid of the terms In the query (with scaling)
-0.37    0.967    0.173
-0.94    0.08     0.96   -0.72   -0.38
7
Variations in the examples ?

DB-Regression example
Started with D-T matrix
Used the term axes as T-F and the doc rep as
D-FF-F
Q is converted into qT-F

Chapter/Medline etc examples
Started with T-D matrix
Used term axes as T-FFF and doc rep as D-F
Q is converted to qT-FFF-1

We will stick to this convention
8
Medline data from Berrys paper
9
Within .40 threshold
K is the number of singular values used
10
Query Expansion
Add terms that are closely related to the query
terms to improve precision and recall. Two
variants Local ? only analyze the
closeness among the set of
documents that are returned Global ?
Consider all the documents in the corpus
a priori How to decide closely
related terms? THESAURI!! -- Hand-coded
thesauri (Roget and his brothers) --
Automatically generated thesauri
--Correlation based (association, nearness)
--Similarity based (terms as vectors
in doc space)
11
Correlation/Co-occurrence analysis

Co-occurrence analysis
Terms that are related to terms in the original
query may be added to the query.
Two terms are related if they have high
co-occurrence in documents.
Let n be the number of documents
n1 and n2 be documents containing terms
t1 and t2,
m be the documents having both
t1 and t2
If t1 and t2 are independent
If t1 and t2 are correlated

gtgt if Inversely correlated
Measure degree of correlation
12
Association Clusters

Let Mij be the term-document matrix
For the full corpus (Global)
For the docs in the set of initial results
(local)
(also sometimes, stems are used instead of terms)
Correlation matrix C MMT (term-doc Xdoc-term
term-term)

Un-normalized Association Matrix
Normalized Association Matrix
Nth-Association Cluster for a term tu is the set
of terms tv such that Suv are the n largest
values among Su1, Su2,.Suk
13
Example
11 4 6 4 34 11 6 11 26
Correlation Matrix
d1d2d3d4d5d6d7 K1 2 1 0 2 1 1 0 K2 0 0
1 0 2 2 5 K3 1 0 3 0 4 0 0
Normalized Correlation Matrix
1.0 0.097 0.193 0.097 1.0
0.224 0.193 0.224 1.0
1th Assoc Cluster for K2 is K3
14
Scalar clusters
Even if terms u and v have low correlations,
they may be transitively correlated (e.g. a
term w has high correlation with u and v).
Consider the normalized association matrix S The
association vector of term u Au is
(Su1,Su2Suk) To measure neighborhood-induced
correlation between terms Take the cosine-theta
between the association vectors of terms u and
v
Nth-scalar Cluster for a term tu is the set of
terms tv such that Suv are the n largest values
among Su1, Su2,.Suk
15
Example
Normalized Correlation Matrix
AK1
USER(43) (neighborhood normatrix) 0
(COSINE-METRIC (1.0 0.09756097 0.19354838) (1.0
0.09756097 0.19354838)) 0 returned 1.0 0
(COSINE-METRIC (1.0 0.09756097 0.19354838)
(0.09756097 1.0 0.2244898)) 0 returned
0.22647195 0 (COSINE-METRIC (1.0 0.09756097
0.19354838) (0.19354838 0.2244898 1.0)) 0
returned 0.38323623 0 (COSINE-METRIC
(0.09756097 1.0 0.2244898) (1.0 0.09756097
0.19354838)) 0 returned 0.22647195 0
(COSINE-METRIC (0.09756097 1.0 0.2244898)
(0.09756097 1.0 0.2244898)) 0 returned 1.0 0
(COSINE-METRIC (0.09756097 1.0 0.2244898)
(0.19354838 0.2244898 1.0)) 0 returned
0.43570948 0 (COSINE-METRIC (0.19354838
0.2244898 1.0) (1.0 0.09756097 0.19354838)) 0
returned 0.38323623 0 (COSINE-METRIC
(0.19354838 0.2244898 1.0) (0.09756097 1.0
0.2244898)) 0 returned 0.43570948 0
(COSINE-METRIC (0.19354838 0.2244898 1.0)
(0.19354838 0.2244898 1.0)) 0 returned 1.0
Scalar (neighborhood) Cluster Matrix
1.0 0.226 0.383 0.226 1.0
0.435 0.383 0.435 1.0
1th Scalar Cluster for K2 is still K3
16
Metric Clusters
average..

Let r(ti,tj) be the minimum distance (in terms of
number of separating words) between ti and tj in
any single document (infinity if they never occur
together in a document)
Define cluster matrix Suv 1/r(ti,tj)

Nth-metric Cluster for a term tu is the set of
terms tv such that Suv are the n largest values
among Su1, Su2,.Suk
R(ti,tj) is also useful For proximity queries And
phrase queries
17
Similarity Thesaurus

The similarity thesaurus is based on term to term
relationships rather than on a matrix of
co-occurrence.
obtained by considering that the terms are
concepts in a concept space.
each term is indexed by the documents in which it
appears.
Terms assume the original role of documents while
documents are interpreted as indexing elements

18
Motivation
Ki
Kv
Kj
Ka
Kb
Q
19
Similarity Thesaurus

Terminology
t number of terms in the collection
N number of documents in the collection
Fi,j frequency of occurrence of the term ki in
the document dj
tj vocabulary of document dj
itfj inverse term frequency for document dj

Inverse term frequency for document dj
To ki is associated a vector
Where

Idea It is no surprise if Oxford
dictionary Mentions the word!
20
Similarity Thesaurus

The relationship between two terms ku and kv is
computed as a correlation factor cu,v given by
The global similarity thesaurus is built through
the computation of correlation factor Cu,v for
each pair of indexing terms ku,kv in the
collection
Expensive
Possible to do incremental updates

Similar to the scalar clusters Idea, but for the
tf/itf weighting Defining the term vector
21
Frontier
22
Computing an Example

Let (Mij) be given by the matrix
Compute the matrices (K), (S), and (D)t

23
(No Transcript)
24
If we retain only the 'size' variable we would
retain 1.75/2.00 x 100 (87.5) of the original
variation. Thus, if we discard the second axis
we would lose 12.5 of the original information.

25
(No Transcript)
26
Insight through Principal Components Analysis
KL Transform Neural Networks Dimensionality
Reduction
27
Indexing and Retrieval Issues
28
Efficient Retrieval (1)

Document-term matrix
t1 t2 . . . tj . . .
tm nf
d1 w11 w12 . . . w1j . . .
w1m 1/d1
d2 w21 w22 . . . w2j . . .
w2m 1/d2
. . . . . . .
. . . . . . .
di wi1 wi2 . . . wij . . .
wim 1/di
. . . . . . .
. . . . . . .
dn wn1 wn2 . . . wnj . . .
wnm 1/dn
wij is the weight of term tj in document di
Most wijs will be zero.

29
Naïve retrieval

Consider query q (q1, q2, , qj, , qn), nf
1/q.
How to evaluate q (i.e., compute the similarity
between q and every document)?
Method 1 Compare q with every document directly.
document data structure
di ((t1, wi1), (t2, wi2), . . ., (tj, wij), .
. ., (tm, wim ), 1/di)
Only terms with positive weights are kept.
Terms are in alphabetic order.
query data structure
q ((t1, q1), (t2, q2), . . ., (tj, qj), . .
., (tm, qm ), 1/q)

30
Naïve retrieval

Method 1 Compare q with documents directly
(cont.)
Algorithm
initialize all sim(q, di) 0
for each document di (i 1, , n)
for each term tj (j 1, , m)
if tj appears in both q and di
sim(q, di) qj ?wij
sim(q, di) sim(q, di) ?(1/q)
?(1/di)
sort documents in descending similarities
and
display the top k to the user

31
Inverted Files

Observation Method 1 is not efficient as most
non-zero entries in the document-term matrix need
to be accessed.
Method 2 Use Inverted File Index
Several data structures
For each term tj, create a list (inverted file
list) that contains all document ids that have
tj.
I(tj) (d1, w1j), (d2, w2j), , (di,
wij), , (dn, wnj)
di is the document id number of the ith document.
Only entries with non-zero weights should be
kept.

32
Inverted files

Method 2 Use Inverted File Index (continued)
Several data structures
Normalization factors of documents are
pre-computed and stored in an array nfi stores
1/di.
Create a hash table for all terms in the
collection.
. . . . . .
tj pointer to I(tj)
. . . . . .
Inverted file lists are typically stored on disk.
The number of distinct terms is usually very
large.

33
Querying
To query for database index, the query vector
would be q 1 0 1 0 0 0 since database and
index are the 1st and 3rd terms in the index, and
no other terms are selected. Let q be the query
vector. Then the document-space vector
corresponding to q is given by
q'U2inv(S2) Dq To find the best
document match, we compare the Dq vector against
all the document vectors in the 2-dimensional doc
space. The document vector that is nearest in
direction to Dq is the best match. The cosine
values for the eight document vectors and the
query vector are    -0.3747    0.9671
0.1735   -0.9413    0.0851    0.9642   -0.7265
-0.3805
Centroid of the terms In the query (with scaling)

Write a Comment

User Comments (0)