?(today - PowerPoint PPT Presentation

About This Presentation
Title:

?(today

Description:

1 is due 9/16 (but will be collected 9/23) Homework 2 socket will open this week ... Decal Parking. Hayden Library Transcripts. Scholarship Admissions. Demo ... – PowerPoint PPT presentation

Number of Views:15
Avg rating:3.0/5.0
Slides: 37
Provided by: supp176
Category:
Tags: decal | today

less

Transcript and Presenter's Notes

Title: ?(today


1
9/9
  • ?(today thursday)correlation analysis LSI
  • ?(wed Friday 10301145 BY210)
  • dictionaries and indexing
  • ?Homework 1 is due 9/16 (but will be collected
    9/23)
  • ?Homework 2 socket will open this week
  • ?Project part A will be released
  • ?Rao will not be around next week however we can
    use the class time for reviews and such if
    needed.

Remember Don't print hidden slides!!
2
So many ways things can go wrong
  • Reasons that ideal effectiveness hard to achieve
  • Document representation loses information.
  • Users inability to describe queries precisely.
  • Similarity function used not be good enough.
  • Importance/weight of a term in representing a
    document and query may be inaccurate
  • Same term may have multiple meanings and
    different terms may have similar meanings.

Query expansion Relevance feedback
LSI Co-occurrence analysis
3
Improving Vector Space Ranking
  • We will consider three techniques
  • Relevance feedbackwhich tries to improve the
    query quality ?Done.
  • Correlation analysis, which looks at correlations
    between keywords (and thus effectively computes a
    thesaurus based on the word occurrence in the
    documents) to do query elaboration
  • Principal Components Analysis (also called Latent
    Semantic Indexing) which subsumes correlation
    analysis and does dimensionality reduction.

4
Some improvements
  • Query expansion techniques (for 1)
  • relevance feedback
  • co-occurrence analysis (local and global
    thesauri)
  • Improving the quality of terms (2), (3) and
    (5).
  • Latent Semantic Indexing
  • Phrase-detection

5
Relevance Feedback
  • Main Idea
  • Modify existing query based on relevance
    judgements
  • Extract terms from relevant documents and add
    them to the query
  • and/or re-weight the terms already in the query
  • Two main approaches
  • Automatic (psuedo-relevance feedback)
  • Users select relevant documents
  • Users/system select terms from an
    automatically-generated list

6
Relevance Feedback
  • Usually do both
  • expand query with new terms
  • re-weight terms in query
  • There are many variations
  • usually positive weights for terms from relevant
    docs
  • sometimes negative weights for terms from
    non-relevant docs
  • Remove terms ONLY in non-relevant documents

7
Relevance Feedback for Vector Model
In the ideal case where we know the relevant
Documents a priori
Cr Set of documents that are truly relevant to
Q N Total number of documents
8
Rocchio Method
Qo is initial query. Q1 is the query after one
iteration Dr are the set of relevant docs Dn are
the set of irrelevant docs Alpha 1 Beta.75,
Gamma.25 typically.
Other variations possible, but performance
similar
9
Rocchio/Vector Illustration
Q0 retrieval of information (0.7,0.3) D1
information science (0.2,0.8) D2
retrieval systems (0.9,0.1) Q
½Q0 ½ D1 (0.45,0.55) Q ½Q0 ½ D2
(0.80,0.20)
10
Example Rocchio Calculation
Relevant docs
Non-rel doc
Original Query
Constants
Rocchio Calculation
Resulting feedback query
11
Rocchio Method
  • Rocchio automatically
  • re-weights terms
  • adds in new terms (from relevant docs)
  • have to be careful when using negative terms
  • Rocchio is not a machine learning algorithm
  • Most methods perform similarly
  • results heavily dependent on test collection
  • Machine learning methods are proving to work
    better than standard IR approaches like Rocchio

12
Rocchio is just one approach for relevance
feedback
  • Relevance feedbackin the most general
    termsinvolves learning.
  • Given a set of known relevant and known
    irrelevant documents, learn relevance metric so
    as to predict whether a new doc is relevant or
    not
  • Essentially a classification problem!
  • Can use any classification learning technique
    (e.g. naïve bayes, neural nets, support vector
    m/c etc.)
  • Viewed this way, Rocchio is just a simple
    classification method
  • That summarizes positive examples by the positive
    centroid, negative examples by the negative
    centroid, and assumes the most compact
    description of the relevance metric is vector
    difference between the two centroids.

13
Using Relevance Feedback
  • Known to improve results
  • in TREC-like conditions (no user involved)
  • What about with a user in the loop?
  • How might you measure this?
  • Precision/Recall figures for the unseen
    documents need to be computed

14
Correlation/Co-occurrence analysis
  • Co-occurrence analysis
  • Terms that are related to terms in the original
    query may be added to the query.
  • Two terms are related if they have high
    co-occurrence in documents.
  • Let n be the number of documents
  • n1 and n2 be documents containing terms
    t1 and t2,
  • m be the documents having both
    t1 and t2
  • If t1 and t2 are independent
  • If t1 and t2 are correlated

Measure degree of correlation
gtgt if Inversely correlated
15
Terms and Docs as mutually dependent vectors in
In addition to doc-doc similarity, We can
compute term-term distance
Document vector
Term vector
If terms are independent, the T-T similarity
matrix would be diagonal If it is not
diagonal, we can use the correlations to
add related terms to the query But can
also ask the question Are there
independent dimensions which define
the space where terms docs are
vectors ?
16
Association Clusters
  • Let Mij be the term-document matrix
  • For the full corpus (Global)
  • For the docs in the set of initial results
    (local)
  • (also sometimes, stems are used instead of terms)
  • Correlation matrix C MMT (term-doc Xdoc-term
    term-term)

Un-normalized Association Matrix
Normalized Association Matrix
Nth-Association Cluster for a term tu is the set
of terms tv such that Suv are the n largest
values among Su1, Su2,.Suk
17
Example
11 4 6 4 34 11 6 11 26
Correlation Matrix
d1d2d3d4d5d6d7 K1 2 1 0 2 1 1 0 K2 0 0
1 0 2 2 5 K3 1 0 3 0 4 0 0
Normalized Correlation Matrix
1.0 0.097 0.193 0.097 1.0
0.224 0.193 0.224 1.0
1th Assoc Cluster for K2 is K3
18
Scalar clusters
200 docs with bush and iraq 200 docs with Iraq
and saddam Is bush close to Saddam?
Even if terms u and v have low correlations,
they may be transitively correlated (e.g. a
term w has high correlation with u and v).
Consider the normalized association matrix S The
association vector of term u Au is
(Su1,Su2Suk) To measure neighborhood-induced
correlation between terms Take the cosine-theta
between the association vectors of terms u and
v
Nth-scalar Cluster for a term tu is the set of
terms tv such that Suv are the n largest values
among Su1, Su2,.Suk
19
Using Query Logs instead of Document Corpus
  • Correlation analysis can also be done based on
    query logs (i.e., the log of queries posed by the
    users).
  • Instead of doc-term matrix, you will start with
    query-term matrix
  • Query log correlation analysis has close
    parallels to the idea of collaborative filtering
    (the idea by which recommendation systems work..)

20
Example
Normalized Correlation Matrix
AK1
USER(43) (neighborhood normatrix) 0
(COSINE-METRIC (1.0 0.09756097 0.19354838) (1.0
0.09756097 0.19354838)) 0 returned 1.0 0
(COSINE-METRIC (1.0 0.09756097 0.19354838)
(0.09756097 1.0 0.2244898)) 0 returned
0.22647195 0 (COSINE-METRIC (1.0 0.09756097
0.19354838) (0.19354838 0.2244898 1.0)) 0
returned 0.38323623 0 (COSINE-METRIC
(0.09756097 1.0 0.2244898) (1.0 0.09756097
0.19354838)) 0 returned 0.22647195 0
(COSINE-METRIC (0.09756097 1.0 0.2244898)
(0.09756097 1.0 0.2244898)) 0 returned 1.0 0
(COSINE-METRIC (0.09756097 1.0 0.2244898)
(0.19354838 0.2244898 1.0)) 0 returned
0.43570948 0 (COSINE-METRIC (0.19354838
0.2244898 1.0) (1.0 0.09756097 0.19354838)) 0
returned 0.38323623 0 (COSINE-METRIC
(0.19354838 0.2244898 1.0) (0.09756097 1.0
0.2244898)) 0 returned 0.43570948 0
(COSINE-METRIC (0.19354838 0.2244898 1.0)
(0.19354838 0.2244898 1.0)) 0 returned 1.0
Scalar (neighborhood) Cluster Matrix
1.0 0.226 0.383 0.226 1.0
0.435 0.383 0.435 1.0
1th Scalar Cluster for K2 is still K3
21
On the database/statistics example
Scalar Clusters
t1 database t2SQL t3index t4regression t5like
lihood t6linear
1.0000 0.9604 0.8240 0.0847 0.1459
0.1136 0.9604 1.0000 0.9245
0.0388 0.1063 0.0660 0.8240 0.9245
1.0000 0.0465 0.1174 0.0655 0.0847
0.0388 0.0465 1.0000 0.8972 0.8459
0.1459 0.1063 0.1174 0.8972 1.0000
0.8946 0.1136 0.0660 0.0655
0.8459 0.8946 1.0000
Notice that index became much closer to database
3679 2391 1308 238
302 273 2391 1807
953 0 123 63
1308 953 536 32
87 27 238 0
32 3277 1584
1573 302 123 87
1584 972 887 273
63 27 1573 887
1423
Association Clusters
1.0000 0.7725 0.4499 0.0354 0.0694
0.0565 0.7725 1.0000 0.6856
0 0.0463 0.0199 0.4499 0.6856
1.0000 0.0085 0.0612 0.0140 0.0354
0 0.0085 1.0000 0.5944
0.5030 0.0694 0.0463 0.0612 0.5944
1.0000 0.5882 0.0565 0.0199 0.0140
0.5030 0.5882 1.0000
22
Metric Clusters
average..
  • Let r(ti,tj) be the minimum distance (in terms of
    number of separating words) between ti and tj in
    any single document (infinity if they never occur
    together in a document)
  • Define cluster matrix Suv 1/r(ti,tj)

Nth-metric Cluster for a term tu is the set of
terms tv such that Suv are the n largest values
among Su1, Su2,.Suk
r(ti,tj) is also useful For proximity queries And
phrase queries
23
Similarity Thesaurus
  • The similarity thesaurus is based on term to term
    relationships rather than on a matrix of
    co-occurrence.
  • obtained by considering that the terms are
    concepts in a concept space.
  • each term is indexed by the documents in which it
    appears.
  • Terms assume the original role of documents while
    documents are interpreted as indexing elements

24
Motivation
Ki
Kv
Kj
Ka
Kb
Q
25
Similarity Thesaurus
  • The relationship between two terms ku and kv is
    computed as a correlation factor cu,v given by
  • The global similarity thesaurus is built through
    the computation of correlation factor Cu,v for
    each pair of indexing terms ku,kv in the
    collection
  • Expensive
  • Possible to do incremental updates

Similar to the scalar clusters Idea, but for the
tf/itf weighting Defining the term vector
26
Similarity Thesaurus
  • Terminology
  • t number of terms in the collection
  • N number of documents in the collection
  • Fi,j frequency of occurrence of the term ki in
    the document dj
  • tj vocabulary of document dj
  • itfj inverse term frequency for document dj
  • Inverse term frequency for document dj
  • To ki is associated a vector
  • Where

Idea It is no surprise if Oxford
dictionary Mentions the word!
27
Beyond Correlation analysis PCA/LSI
  • Suppose I start with documents described in terms
    of just two key words, u and v, but then
  • Add a bunch of new keywords (of the form 2u-3v
    4u-v etc), and give the new doc-term matrix to
    you. Will you be able to tell that the documents
    are really 2-dimensional (in that there are only
    two independent keywords)?
  • Suppose, in the above, I also add a bit of noise
    to each of the new terms (i.e. 2u-3vnoise
    4u-vnoise etc). Can you now discover that the
    documents are really 2-D?
  • Suppose further, I remove the original keywords,
    u and v, from the doc-term matrix, and give you
    only the new linearly dependent keywords. Can you
    now tell that the documents are 2-dimensional?
  • Notice that in this last case, the true
    dimensions of the data are not even present in
    the representation! You have to re-discover the
    true dimensions as linear combinations of the
    given dimensions.
  • Which means the current terms themselves are
    vectors in the original space..

added
28
PCA/LSI continued
  • The fact that keywords in the documents are not
    actually independent, and that they have synonymy
    and polysemy among them, often manifests itself
    as if some malicious oracle mixed up the data as
    above.
  • Need Dimensionality Reduction Techniques
  • If the keyword dependence is only linear (as
    above), a general polynomial complexity technique
    called Principal Components Analysis is able to
    do this dimensionality reduction
  • PCA applied to documents is called Latent
    Semantic Indexing
  • If the dependence is nonlinear, you need
    non-linear dimensionality reduction techniques
    (such as neural networks) much costlier.

29
Visual Example
  • Data on Fish
  • Length
  • Height

30
Move Origin
  • To center of centroid
  • But are these the best axes?

31
  • Better if one axis accounts for most data
    variation
  • What should we call the red axis? Size (factor)

32
Reduce Dimensions
  • What if we only consider size

We retain 1.75/2.00 x 100 (87.5) of the
original variation. Thus, by discarding the
yellow axis we lose only 12.5 of the original
information.
33
LSI as a special case of LDA
  • Dimensionality reduction (or feature selection)
    is typically done in the context of specific
    classification tasks
  • We want to pick dimensions (or features) that
    maximally differentiate across classes, while
    having minimal variance within any given class
  • When doing dimensionality reduction w.r.t a
    classification task, we need to focus on
    dimensions that
  • Increase variance across classes
  • and reduce variance within each class
  • Doing this is called LDA (linear discriminant
    analysis)
  • LSIas givenis insensitive to any particular
    classification task and only focuses on data
    variance
  • LSI is a special case of LDA where each point
    defines its own class
  • This makes sense since relevant vs.
    irrelevant documents are query dependent

In the example above, the Red line corresponds to
the Dimension with most data variance However,
the green line corresponds To the axis that does
a better job of Capturing the class variance
(assuming That the two different blobs
correspond To the different classes)
34
9/11
In remembrance of all the lives and liberties
lost to the wars by and on terror
Guernica,Picasso
35
9/11Project Part A assignedWill send mail about
tomorrows make-up class
36
Project A - Overview
  • Create a search engine
  • Vector Space Model
  • Tf-Idf weights
  • wij tf(i,j) idf(i)
  • Cosine Similarity
  • Sim(q,dj) ? wij wiq / dj q
  • Hint to improve performance, consider
    pre-computing dj for all documents rather than
    doing it each time you compute the similarity of
    a document

37
Project A Lucene API
  • Lucene - inverted index functionality
  • Note Use the version provided on the course
    website
  • Useful Methods
  • IndexReader Class
  • terms() returns the lexicon of terms in the
    index
  • numDocs() of docs in the index
  • docFreq(Term t) of docs containing term t
  • termDocs(Term t) returns docs containing term t
  • termPositions(Term t) returns docs containing
    term t along with the positions optional

38
Project A Deliverables
  • A write up explaining your algorithm and brief
    evaluation of its performance.
  • Hardcopy showing the top 10 documents ranked by
    Vector Space model for the 10 example queries
    below.
  • Hardcopy of your code with comments.
  • Example Queries
  • Fall semester Grades
  • SRC Newsletter
  • Decal Parking
  • Hayden Library Transcripts
  • Scholarship Admissions
  • Demo

39
If you can do it for fish, why not to docs?
  • We have documents as vectors in the space of
    terms
  • We want to
  • Transform the axes so that the new axes are
  • Orthonormal (independent axes)
  • Notice that the new fish axes are uncorrelated..
  • Can be ordered in terms of the amount of
    variation in the documents they capture
  • Pick top K dimensions (axes) in this ordering
    and use these new K dimensions to do the
    vector-space similarity ranking
  • Why?
  • Can reduce noise
  • Can eliminate dependent variabales
  • Can capture synonymy and polysemy
  • How?
  • SVD (Singular Value Decomposition)

40
SVD is the Solution, but what is the problem?
  • What we want to do Given M of rank R, find a
    matrix M of rank R lt R such that M-M is
    the smallest
  • Using multi-variable calculus, it can be shown
    that the solution is related to Eigen
    decomposition
  • More specifically, Singular Value Decomposition,
    of a matrix
  • SVD of a matrix dt is three matrices df, ff, tf
    such that
  • Mdffftf
  • df is the eigen vectors of dtdt
  • tf is the eigen vectors of dtdt
  • ff is a diagonal matrix whose diagonal values are
    the ve square roots of eigen vectors of dtdt
    or dtdt
  • Rank of a matrix M is defined as the size of the
    largest square sub-matrix of M which has a
    non-zero determinant.
  • The rank of a matrix M is also equal to the
    number of non-zero singular values it has
  • Rank of M is related to the true dimensionality
    of M. If you add a bunch of rows to M that are
    linear combinations of the existing rows of M,
    the rank of the new matrix will still be the same
    as the rank of M.
  • Distance between two equi-sized matrices M and
    M M-M is defined as the sum of the squares
    of the differences between the corresponding
    entries (Sum (muv-muv)2)
  • Will be equal to zero when M M

41
Bunch of Facts about SVD
  • Relation between SVD and Eigen value
    decomposition
  • Eigen value decomp is defined only for square
    matrices
  • Only square symmetric matrices have real-valued
    eigen values
  • PCA (principle component analysis) is normally
    done on correlation matrices which are square
    symmetric (think of d-d or t-t matrices).
  • SVD is defined for all matrices
  • Given a matrix dt, we consider the eigen
    decomposion of the correlation matrices d-d
    (dtdt) and tt (dtdt). SVD is
  • (a) the eigen vectors of d-d (2) positive square
    roots of eigen values of dd or tt (3) eigen
    vectors of tt
  • Both dd and tt are symmetric (they are
    correlation matrices)
  • They both will have the same eigen values
  • Unless M is symmetric, MMT and MTM are different
  • So, in general their eigen vectors will be
    different (although their eigen values are same)
  • Since SVD is defined in terms of the eigen values
    and vectors of the Correlation matrices of a
    matrix, the eigen values will always be real
    valued (even if the matrix M is not symmetric).
  • In general, the SVD decomposition of a matrix M
    equals its eigen decomposition only if M is both
    square and symmetric

42
Rank and Dimensionality
  • What we want to do Given M of rank R, find a
    matrix M of rank R lt R such that M-M is
    the smallest
  • If you do a bit of calculus of variations, you
    will find that the solution is related to Eigen
    decomposition
  • More specifically, Singular Value Decomposition,
    of a matrix
  • Suppose we did SVD on a doc-term matrix d-t, and
    took the top-k eigen values and reconstructed the
    matrix d-tk. We know
  • d-tk has rank k (since we zeroed out all the
    other eigen values when we reconstructed d-tk)
  • There is no k-rank matrix M such that d-t M
    lt d-t d-tk
  • In other words d-tk is the best rank-k
    (dimension-k) approximation to d-t!
  • This is the guarantee given by SVD!
  • Rank of a matrix M is defined as the size of the
    largest square sub-matrix of M which has a
    non-zero determinant.
  • The rank of a matrix M is also equal to the
    number of non-zero singular values it has
  • Rank of M is related to the true dimensionality
    of M. If you add a bunch of rows to M that are
    linear combinations of the existing rows of M,
    the rank of the new matrix will still be the same
    as the rank of M.
  • Distance between two equi-sized matrices M and
    M M-M is defined as the sum of the squares
    of the differences between the corresponding
    entries (Sum (muv-muv)2)
  • Will be equal to zero when M M

Note that because the LSI dimensions are
uncorrelated, finding the best k LSI dimensions
is the same as sorting the dimensions in terms of
their individual varianc (i.e., corresponding
singualr values), and picking top-k
43
Rank and Dimensionality 2
  • Suppose we did SVD on a doc-term matrix d-t, and
    took the top-k eigen values and reconstructed the
    matrix d-tk. We know
  • d-tk has rank k (since we zeroed out all the
    other eigen values when we reconstructed d-tk)
  • There is no k-rank matrix M such that d-t M
    lt d-t d-tk
  • In other words d-tk is the best rank-k
    (dimension-k) approximation to d-t!
  • This is the guarantee given by SVD!
  • Rank of a matrix M is defined as the size of the
    largest square sub-matrix of M which has a
    non-zero determinant.
  • The rank of a matrix M is also equal to the
    number of non-zero singular values it has
  • Rank of M is related to the true dimensionality
    of M. If you add a bunch of rows to M that are
    linear combinations of the existing rows of M,
    the rank of the new matrix will still be the same
    as the rank of M.
  • Distance between two equi-sized matrices M and
    M M-M is defined as the sum of the squares
    of the differences between the corresponding
    entries (Sum (muv-muv)2)
  • Will be equal to zero when M M

Note that because the LSI dimensions are
uncorrelated, finding the best k LSI dimensions
is the same as sorting the dimensions in terms of
their individual varianc (i.e., corresponding
singualr values), and picking top-k
44
What happens if you multiply a vector by a matrix?
  • In general, when you multiply a vector by a
    matrix, the vector gets scaled as well as
    rotated
  • ..except when the vector happens to be in the
    direction of one of the eigen vectors of the
    matrix
  • .. in which case it only gets scaled (stretched)
  • A (symmetric square) matrix has all real eigen
    values, and the values give an indication of the
    amount of stretching that is done for vectors in
    that direction
  • The eigen vectors of the matrix define a new
    ortho-normal space
  • You can model the multiplication of a general
    vector by the matrix in terms of
  • First decompose the general vector into its
    projections in the eigen vector directions
  • ..which means just take the dot product of the
    vector with the (unit) eigen vector
  • Then multiply the projections by the
    corresponding eigen valuesto get the new vector.
  • This explains why power method converges to
    principal eigen vector..
  • ..since if a vector has a non-zero projection in
    the principal eigen vector direction, then
    repeated multiplication will keep stretching the
    vector in that direction, so that eventually all
    other directions vanish by comparison..

Optional
45
Terms and Docs as vectors in factor space
In addition to doc-doc similarity, We can
compute term-term distance
Document vector
Term vector
If terms are independent, the T-T similarity
matrix would be diagonal If it is not
diagonal, we can use the correlations to
add related terms to the query But can
also ask the question Are there
independent dimensions which define
the space where terms docs are
vectors ?
46
Overview of Latent Semantic Indexing
Eigen Slide
factor-factor (ve sqrt of eigen values of
d-td-tor d-td-t both same)
Doc-factor (eigen vectors of d-td-t)
(term-factor)T (eigen vectors of d-td-t)
Term
Term
dt
df
dfk
dtk
tft
ff
tfkt
doc
ffk
Þ
doc
fxt
dxt
dxf
fxf
dxk
kxk
kxt
dxt
Reduce Dimensionality Throw out low-order rows
and columns
Recreate Matrix Multiply to produce approximate
term- document matrix. dtk is a k-rank
matrix That is closest to dt
Singular Value Decomposition Convert
doc-term matrix into 3matrices D-F, F-F,
T-F Where DFFFTF gives the Original matrix back
47
t1 database t2SQL t3index t4regression t5like
lihood t6linear
F-F
D-F
6 singular values (positive sqrt of eigen
values of dd or tt)
T-F
Eigen vectors of dd (dtdt) (Principal document
directions)
Eigen vectors of tt (dtdt) (Principal term
directions)
48
t1 database t2SQL t3index t4regression t5like
lihood t6linear
For the database/regression example
Suppose D1 is a new Doc containing database 50
times and D2 contains SQL 50 times
49
Visualizing the Loss
26 18 10 0 1 1 25 17
9 1 2 1 15 11 6 -1
0 0 7 5 3 0 0 0
44 31 17 0 2 1 2 0
0 19 10 11 1 -1 0 24
12 14 2 0 0 16 8 9
2 -1 0 39 20 22 4 2
1 20 11 12
Reconstruction (rounded) With 2 LSI dimensions
Rank2 Variance loss 7.5
Rank6
24 20 10 -1 2 3 32 10
7 1 0 1 12 15 7 -1
1 0 6 6 3 0 1 0
43 32 17 0 3 0 2 0
0 17 10 15 0 0 1 32
13 0 3 -1 0 20 8 1
0 1 0 37 21 26 7 -1
-1 15 10 22
Rank4 Variance loss 1.4
Reconstruction (rounded) With 4 LSI dimensions
50
LSI Ranking
  • Given a query
  • Either add query also as a document in the D-T
    matrix and do the svd OR
  • Convert query vector (separately) to the LSI
    space
  • DFqFFqTF
  • this is the weighted query document in LSI space
  • Reduce dimensionality as needed
  • Do the vector-space similarity in the LSI space

51
Using LSI
  • Can be used on the entire corpus
  • First compute the SVD of the entire corpus
  • Store first k columns of the dfff matrix
    dfffk
  • Keep the tf matrix handy
  • When a new query q comes, take the k columns of
    qtf
  • Compute the vector similarity between qtfk and
    all rows of dfffk, rank the documents and
    return
  • Can be used as a way of clustering the results
    returned by normal vector space ranking
  • Take some top 50 or 100 of the documents returned
    by some ranking (e.g. vector ranking)
  • Do LSI on these documents
  • Take the first k columns of the resulting dfff
    matrix
  • Each row in this matrix is the representation of
    the original documents in the reduced space.
  • Cluster the documents in this reduced space (We
    will talk about clustering later)
  • MANJARA did this
  • We will need fast SVD computation algorithms for
    this. MANJARA folks developed approximate
    algorithms for SVD

52
SVD Computation complexity
  • For an mxn matrix SVD computation is
  • O( km2nkn3) complexity
  • k4 and k22 for best algorithms
  • Approximate algorithms that exploit the sparsity
    of M are available (and being developed)

53
SummaryWhat LSI can do
  • LSI analysis effectively does
  • Dimensionality reduction
  • Noise reduction
  • Exploitation of redundant data
  • Correlation analysis and Query expansion (with
    related words)
  • Any one of the individual effects can be achieved
    with simpler techniques (see scalar clustering
    etc). But LSI does all of them together.

54
LSI (dimensionality reduction) vs. Feature
Selection
  • Before reducing dimensions, LSI first finds a new
    basis (coordinate axes) and then selects a subset
    of them
  • Good because the original axes may be too
    correlated to find top-k subspaces containing
    most variance
  • Bad because the new dimensions may not have any
    significance to the user
  • What are the two dimensions of the database
    example?
  • Something like 0.44database0.33sql..
  • An alternative is to select a subset of the
    original features themselves
  • Advantage is that the selected features are
    readily understandable by the users (to the
    extent they understood the original features).
  • Disadvantage is that as we saw in the Fish
    example, all the original dimensions may have
    about the same variance, while a (linear)
    combination of them might capture much more
    variation.
  • Another disadvantage is that since original
    features, unlike LSI features, may be correlated,
    finding the best subset of k features is not the
    same as sorting individual features in terms of
    the variance they capture and taking the top-K
    (as we could do with LSI)

55
LSI as a special case of LDA
  • Dimensionality reduction (or feature selection)
    is typically done in the context of specific
    classification tasks
  • We want to pick dimensions (or features) that
    maximally differentiate across classes, while
    having minimal variance within any given class
  • When doing dimensionality reduction w.r.t a
    classification task, we need to focus on
    dimensions that
  • Increase variance across classes
  • and reduce variance within each class
  • Doing this is called LDA (linear discriminant
    analysis)
  • LSIas givenis insensitive to any particular
    classification task and only focuses on data
    variance
  • LSI is a special case of LDA where each point
    defines its own class
  • Interestingly, LDA is also related to eigen
    values.

In the example above, the Red line corresponds to
the Dimension with most data variance However,
the green line corresponds To the axis that does
a better job of Capturing the class variance
(assuming That the two different blobs
correspond To the different classes)
56
LSI vs. Nonlinear dimensionality reduction
  • LSI only captures linear correlations
  • It cannot capture non-linear dependencies between
    original dimensions
  • E.g. if the data points are all falling on a
    simple manifold (e.g. a circle in the example
    below),
  • Then, the features are non-linearly correlated
    (here X2Y2c)
  • LSI analysis cant reduce dimensionality here
  • One idea is to use techniques such as neural nets
    or manifold learning techniques
  • Anothersimpleridea is to consider first blowing
    up the dimensionality of the data by introducing
    new axes that are nonlinear combinations of
    existing ones (e.g. X2, Y2, sqrt(xy) etc.)
  • We can now capture linear correlations across
    these nonlinear dimensions by doing LSI in this
    enlarged space, and map the k important
    dimensions found back to original space.
  • So, in order to reduce dimensions, we first
    increase them (talk about crazy!)
  • A way of doing this implicitly is kernel trick..

Advanced Optional
57
Ignore beyond this slide (hidden)
58
Yet another Example
U (9x7)     0.3996   -0.1037    0.5606  
-0.3717   -0.3919   -0.3482    0.1029    
0.4180   -0.0641    0.4878    0.1566    0.5771   
0.1981   -0.1094     0.3464   -0.4422  
-0.3997   -0.5142    0.2787    0.0102   -0.2857
    0.1888    0.4615    0.0049   -0.0279  
-0.2087    0.4193   -0.6629     0.3602   
0.3776   -0.0914    0.1596   -0.2045   -0.3701  
-0.1023     0.4075    0.3622   -0.3657  
-0.2684   -0.0174    0.2711    0.5676    
0.2750    0.1667   -0.1303    0.4376    0.3844  
-0.3066    0.1230     0.2259   -0.3096  
-0.3579    0.3127   -0.2406   -0.3122   -0.2611
    0.2958   -0.4232    0.0277    0.4305  
-0.3800    0.5114    0.2010 S (7x7)    
3.9901         0         0         0        
0         0         0          0   
2.2813         0         0         0        
0         0          0         0   
1.6705         0         0         0         0
         0         0         0    1.3522        
0         0         0          0        
0         0         0    1.1818         0        
0          0         0         0        
0         0    0.6623         0         
0         0         0         0         0        
0    0.6487 V (7x8)     0.2917   -0.2674   
0.3883   -0.5393    0.3926   -0.2112   -0.4505
    0.3399    0.4811    0.0649   -0.3760  
-0.6959   -0.0421   -0.1462     0.1889  
-0.0351   -0.4582   -0.5788    0.2211   
0.4247    0.4346    -0.0000   -0.0000  
-0.0000   -0.0000    0.0000   -0.0000    0.0000
    0.6838   -0.1913   -0.1609    0.2535   
0.0050   -0.5229    0.3636     0.4134   
0.5716   -0.0566    0.3383    0.4493    0.3198  
-0.2839     0.2176   -0.5151   -0.4369   
0.1694   -0.2893    0.3161   -0.5330    
0.2791   -0.2591    0.6442    0.1593   -0.1648   
0.5455    0.2998

T
This happens to be a rank-7 matrix -so only 7
dimensions required
Singular values Sqrt of Eigen values of AAT
59
Formally, this will be the rank-k (2) matrix that
is closest to M in the matrix norm sense
DF (9x7)     0.3996   -0.1037    0.5606  
-0.3717   -0.3919   -0.3482    0.1029    
0.4180   -0.0641    0.4878    0.1566    0.5771   
0.1981   -0.1094     0.3464   -0.4422  
-0.3997   -0.5142    0.2787    0.0102   -0.2857
    0.1888    0.4615    0.0049   -0.0279  
-0.2087    0.4193   -0.6629     0.3602   
0.3776   -0.0914    0.1596   -0.2045   -0.3701  
-0.1023     0.4075    0.3622   -0.3657  
-0.2684   -0.0174    0.2711    0.5676    
0.2750    0.1667   -0.1303    0.4376    0.3844  
-0.3066    0.1230     0.2259   -0.3096  
-0.3579    0.3127   -0.2406   -0.3122   -0.2611
    0.2958   -0.4232    0.0277    0.4305  
-0.3800    0.5114    0.2010 FF (7x7)    
3.9901         0         0         0        
0         0         0          0   
2.2813         0         0         0        
0         0          0         0   
1.6705         0         0         0         0
         0         0         0    1.3522        
0         0         0          0        
0         0         0    1.1818         0        
0          0         0         0        
0         0    0.6623         0         
0         0         0         0         0        
0    0.6487 TF(7x8)     0.2917   -0.2674   
0.3883   -0.5393    0.3926   -0.2112   -0.4505
    0.3399    0.4811    0.0649   -0.3760  
-0.6959   -0.0421   -0.1462     0.1889  
-0.0351   -0.4582   -0.5788    0.2211   
0.4247    0.4346    -0.0000   -0.0000  
-0.0000   -0.0000    0.0000   -0.0000    0.0000
    0.6838   -0.1913   -0.1609    0.2535   
0.0050   -0.5229    0.3636     0.4134   
0.5716   -0.0566    0.3383    0.4493    0.3198  
-0.2839     0.2176   -0.5151   -0.4369   
0.1694   -0.2893    0.3161   -0.5330    
0.2791   -0.2591    0.6442    0.1593   -0.1648   
0.5455    0.2998
DF2 (9x2)     0.3996   -0.1037     0.4180  
-0.0641     0.3464   -0.4422     0.1888   
0.4615     0.3602    0.3776     0.4075   
0.3622     0.2750    0.1667     0.2259  
-0.3096     0.2958   -0.4232 FF2 (2x2)    
3.9901         0          0    2.2813 TF2 (8x2)
    0.2917   -0.2674     0.3399    0.4811
    0.1889   -0.0351    -0.0000   -0.0000    
0.6838   -0.1913     0.4134    0.5716    
0.2176   -0.5151     0.2791   -0.2591
T
DF2FF2TF2T will be a 9x8 matrix That
approximates original matrix
60
What should be the value of k?
df2ff2tf2T
5 components ignored
K2
DffftfT
df7ff 7 tf7T
df4ff4tf4T
K4
3 components ignored
df6ff6tf6T
K6
One component ignored
61
Coordinate transformation inherent in LSI
Doc rep
T-D T-FF-F(D-F)T
Mapping of keywords into LSI space is given by
T-FF-F
Mapping of a doc dw1.wk into LSI space is
given by dT-F(F-F)-1
For k2, the mapping is
The base-keywords of The doc are first mapped To
LSI keywords and Then differentially weighted By
F-F-1
LSx
LSy
1.5944439 -0.2365708 1.6678618
-0.14623132 1.3821706 -1.0087909 0.7533309
1.05282 1.4372339 0.86141896 1.6259657
0.82628685 1.0972775 0.38029274 0.90136355
-0.7062905 1.1802715 -0.96544623
controllability observability realization feedback
controller observer Transfer function polynomial
matrices
LSIy
ch3
controller
LSIx
controllability
62
Querying
T-F
To query for feedback controller, the query
vector would be q 0     0     0     1    
1     0     0     0     0'  (' indicates
transpose), since feedback and controller are
the 4-th and 5-th terms in the index, and no
other terms are selected.  Let q be the query
vector.  Then the document-space vector
corresponding to q is given by
q'TF(2)inv(FF(2) ) Dq For the feedback
controller query vector, the result is
Dq 0.1376    0.3678 To find the
best document match, we compare the Dq vector
against all the document vectors in the
2-dimensional V2 space.  The document vector that
is nearest in direction to Dq is the best match. 
  The cosine values for the eight document
vectors and the query vector are    -0.3747   
0.9671    0.1735   -0.9413    0.0851    0.9642  
-0.7265   -0.3805     
F-F
D-F
Centroid of the terms In the query (with scaling)
-0.37    0.967    0.173   
-0.94    0.08     0.96   -0.72   -0.38
63
FF is a diagonal Matrix. So, its inverse Is
diagonal too. Diagonal Matrices are symmetric
64
Variations in the examples ?
  • DB-Regression example
  • Started with D-T matrix
  • Used the term axes as T-F and the doc rep as
    D-FF-F
  • Q is converted into qT-F
  • Chapter/Medline etc examples
  • Started with T-D matrix
  • Used term axes as T-FFF and doc rep as D-F
  • Q is converted to qT-FFF-1

We will stick to this convention
65
Medline data from Berrys paper
66
Within .40 threshold
K is the number of singular values used
Write a Comment
User Comments (0)
About PowerShow.com