Matrix Decomposition Methods in Information Retrieval

About This Presentation

Title:

Matrix Decomposition Methods in Information Retrieval

Description:

Brown University. www.cs.brown.edu/people/th (& Chief Scientist, RecomMind Inc. ... CS Department, Brown University, Providence RI, th_at_cs.brown.edu ... – PowerPoint PPT presentation

Number of Views:594

Avg rating:3.0/5.0

Slides: 80

Provided by: CSBR

Category:

more less

Transcript and Presenter's Notes

Title: Matrix Decomposition Methods in Information Retrieval

1
Matrix Decomposition Methods in Information
Retrieval

Thomas Hofmann
Department of Computer Science
Brown University
www.cs.brown.edu/people/th
( Chief Scientist, RecomMind Inc.)

In collaboration with Jan Puzicha, UC Berkeley
RecomMind David Cohen, CMU Burning Glass
2
Overview

Introduction A Brief History of Mechanical IR
Latent Semantic Analysis
Probabilistic Latent Semantic Analysis
Learning (from) Hyperlink Graphs
Collaborative Filtering
Future Work and Conclusion

Introduction A Brief History of Mechanical IR

3
4
Memex As we may think.

Vannevar Bush (1945)
The idea of an easily accessible, individually
configurable storehouse of knowledge, the
beginning of the literature on mechanized
information retrieval
Consider a future device for individual use,
which is a sort of mechanized private file and
library. It needs a name, and to coin one at
random, memex will do. A memex is a device in
which an individual stores all his books,
records, and communications, and which is
mechanized so that it may be consulted with
exceeding speed and flexibility. It is an
enlarged intimate supplement to his memory.
The world has arrived at an age of cheap complex
devices of great reliability and something is
bound to come of it.

5
Memex As we may think.

Vannevar Bush (1945)
The civilizational challenge
The difficulty seems to be, not so much that we
publish unduly in view of the extent and variety
of present day interests, but rather that
publication has been extended far beyond our
present ability to make real use of the record.
The summation of human experience is being
expanded at a prodigious rate, and the means we
use for threading through the consequent maze to
the momentarily important item is the same as was
used in the days of square-rigged ships.

V. Bush, As we may think, Atlantic Monthly, 176
(1945), pp.101-108
6
The Thesaurus Approach

Hans Peter Luhn (1957, 1961)
Words of similar or related meaning are grouped
into notional families
Encoding of documents in terms of notional
elements
Matching by measuring the degree of notional
similarity
A common language for annotating documents, key
word in context (KWIC) indexing.
the faculty of interpretation is beyond the
talent of machines.
Statistical cues extracted by machines to assist
human indexer vocabulary method to detecting
similarities.

H.P. Luhn, A statistical approach to mechanical
literature searching, New York, IBM Research
Center, 1957. H.P. Luhn, The Automatic
Derivation of Information Retrieval Encodements
from Machine- Readable Text, Information
Retrieval and Machine Translation, 3(2),
pp.1021-1028, 1961
7
To Punch or not to punch

T. Joyce R.M. Needham (1958)
Lattices hierarchies of search terms
As in other systems, the documents are
represented by holes in punched cards which
represent the various terms, and in addition,
when a hole is punched in any term card, all the
terms at higher levels of the lattice are
also punched.
The postcoordinate revolution card sorting at
search time!
Investigations to lessen the physical work
are continuing.

T. Joyce R.M. Needham, The Thesaurus Approach
to Information Retrieval, American
Documentation, 9, pp. 192-197, 1958.
8
Term Associations

Lauren B. Doyle (1962)
Unusual co-occurrences of pairs of words
associations of words in text
Statistical testing Chi-square and Pearson
correlation coefficient to determine pairwise
correlations
Term association maps for interactive retrieval
Today semantic maps

L.B. Doyle, Indexing and Abstracting by
Association, Unisys Corporation, 1962.
9
Probabilistic Relevance Model

M.E. Maron J.L. Kuhns (1960)
S.E. Roberston K. Sparck Jones (1976)
Various models, e.g., binary independence model
Problem how to estimate these conditional
probabilities?

10
Vector Space Model

Gerard Salton (1960/70)
Instead of indexing documents by selected index
terms, preserve (almost) all terms in automatic
indexing
Represent documents by a high-dimensional vector.
Each term can be associated with a weight
Geometrical interpretation

G. Salton, The SMART Retrieval System
Experiments in Automatic Document Processing,
1971.
11
Term-Document Matrix
W terms in vocabulary
D documents in database
intelligence
Texas Instruments said it has developed the first
32-bit computer chip designed specifically for
artificial intelligence applications ...
term-document matrix
intelligence
artificial
interest
artifact
t
d
...
1
0
0
...
...
2

12
Documents in Inner Space
similarity between document and query
cosine of angle between query and document(s)

Retrieval method
rank documents according to similarity with query
term weighting schemes, for example, TFIDF
used in SMART system and many successor systems,
high popularity

13
Advantages of the Vector Space Model

No subjective selection of index terms
Partial matching of queries and documents
(dealing with the case where no document contains
all search terms)
Ranking according to similarity score (dealing
with large result sets)
Term weighting schemes (improves retrieval
performance)
Various extensions
Document clustering
Relevance feedback (modifying query vector)
Geometric foundation

14
2. Latent Semantic Analysis
14
15
Limitations of the Vector Space Model

Dimensionality
Vector space representation is high-dimensional
(several 10-100K).
Learning and estimation has to deal with curse of
dimensionality.
Sparseness
Document vectors are typically very sparse.
Cosine similarity can be noisy and inaccurate.
Semantics
The inner product can only match occurrences of
exactly the same terms.
The vector representation does not capture
semantic relations between words.
Independence
Bag-of-Words Representation
Unable to capture phrases and semantic/syntactic
regularities

16
The Lost Meaning of Words

Ambiguity and association in natural language
Polysemy Words often have a multitude of
meanings and different types of usage (more
urgent for very heterogeneous collections).
The vector space model is unable to discriminate
between different meanings of the same word.
Synonymy Different terms may have an identical
or a similar meaning (weaker words indicating
the same topic).
No associations between words are made in the
vector space representation.

17
Polysemy and Context

Document similarity on single word level
polysemy and context

18
Latent Semantic Analysis

General idea
Map documents (and terms) to a low-dimensional
representation.
Design a mapping such that the low-dimensional
space reflects semantic associations (latent
semantic space).
Compute document similarity based on the inner
product in the latent semantic space.
Goals
Similar terms map to similar location in low
dimensional space.
Noise reduction by dimension reduction.

19
LSA Matrix Decomposition by SVD

Dimension reduction by singular value
decomposition of term-document matrix

word frequencies (possibly transformed)

Document length normalization
Sublinear transformation (e.g., log)
Global term weight

original td matrix
reconstructed td matrix
term/document vectors
thresholded singular values
L2 optimal approximation
20
Background SVD

Singular Value Decomposition, definition
orthonormal columns
diagonal with singular values (ordered)
Properties
Existence uniqueness
Thresholding small singular values yields an
optimal low-rank approximation (in the sense of
the Frobenius norm)

21
SVD and PCA

If (!) the rows of would be shifted such that
their mean is zero, then
Then, one would essentially perform a projection
on the principal axis defined by the columns of
Yet, this would destroy the sparseness of the
term-document matrix (and consequently might hurt
the performance of SVD methods)

22
Canonical Analysis

Hirschfield 1935, Hotelling 1936, Fisher 1940
Correlation analysis for contingency tables

23
Canoncial Correspondence Analysis

Correspondence Analysis (as a method of scaling)
Guttman 1941, Torgerson 1958, Benzecri 1969, Hill
1974
Whitaker 1967 gradient analysis

reciprocal averaging

solutions unit vectors and scores of canonical
analysis
SVD of rescaled matrix with entries

(not exactly what is done in LSA)
24
Semantic Inner Product / Kernel

Similarity inner product in lower dimensional
space
For given decomposition, additional documents or
queries can be mapped to semantic space
(folding-in)
Notice that
Hence, for new document/query q

lower dimensional document representation
25
Term Associations from LSA
Term 2
Concept
Term 1
(taken from slide by S. Dumais)
26
LSA Discussion

pros
Low-dimensional document representation is able
to capture synonyms.
Noise removal and robustness by dimension
reduction
Experimentally advantages over naïve vector
space model
cons
Formally L2 norm is inappropriate as a
distance function for count vectors
(reconstruction may contain negative entries)
Conceptually
Problem of polysemy is not addressed principle
of linear superposition, no active disambiguation
Context of terms is not taken into account.
Directions in latent space are hard to interpret.
No probabilistic model of term occurrences.
ad hoc selection of the number of dimensions,
...

27
Features of IR Methods
Features VSM LSA

Quantitative relevance score yes yes
Partial query matching yes yes
Document similarity yes yes
Word correlations, synonyms no yes
Low-dimensional representation no yes
Notional families, concepts no not really
Dealing with polysemy no no
Probabilistic model no no
Sparse representation yes no
28
3. Probabilistic Latent Semantic Analysis
28
29
Documents as Information Sources

real document empirical probability distrib. ?
relative frequencies

D documents in database
W words in vocabulary

ideal document (memoryless) information source

other documents
30
Information Source Models in IR

Bayes rule probability of relevance of document
w.r.t. query

prior probability of relevance

Query translation model

Probability that q is generated from d

Probability that query term is generated

Language model
Translation model
J. Ponte W.B. Croft, A Language Model Approach
to Information Retrieval, SIGIR 1998. A. Berger
J. Lafferty, Information Retrieval as
Statistical Translation, SIGIR 1999.
31
Probabilistic Latent Semantic Analysis

How can we learn document-specific language
models? Sparseness problem, even for unigrams.
Probabilistic dimension reduction techniques to
overcome data sparseness problem.
Factor analysis for count data factors ? concepts

T. Hofmann, Probabilistic Latent Semantic
Analysis, UAI 1999.
32
PLSA Graphical Model
33
PLSA Graphical Model
P(zd)
z
w
c(d)
N
34
PLSA Graphical Model
P(zd)
z
w
c(d)
N
35
PLSA Graphical Model
shared by all words in a document
P(zd)
shared by all documents in collection
z
w
c(d)
N
36
Probabilistic Latent Semantic Space

documents are represented as points in low-
dimensional sub-simplex (dimensionality reduction
for probability distributions)

embedding
spanned
simplex

sub-simplex
0

KL-divergence projection, not orthogonal

37
Positive Matrix Decomposition

mixture decomposition in matrix notation

constraints
Non-negativity of all matrices
Normalization according to L1-norm
(no orthogonality)

D.D. Lee H.S. Seung, Learning the parts of
objects by non-negative matrix factorization,
Nature, 1999.
38
Positive Matrix Decomposition SVD

mixture decomposition in matrix notation

compare to

probabilistic approach vs. linear algebra
decomposition
conditional independence assumption replaces
outer product
class-conditional distributions replace
left/right eigenvectors
maximum likelihood instead of minimum L2 norm
criterion

39
Expectation Maximization Algorithm

Maximizing log-likelihood by (tempered) EM
iterations
E-step (posterior probabilities of latent
variables)
M-step (max. of expected complete log-likelihood)

probability that a term occurrence w within d is
explained by topic z
40
Example Science Magazine Papers

Dataset with approx. 12K papers from Science
Magazine
Selected concepts from model with K200

41
Example TDT1 news stories

TDT1 document collection with approx. 16,000
news stories (Reuters, CNN, years 1994/95)
results based on decomposition with 128 concepts
2 main factors for flight and love (most
probable words)

love
flight
home family like just kids mother life happy frien
ds cnn
film movie music new best hollywood love actor en
tertainment star
plane airport crash flight safety aircraft air pas
senger board airline
space shuttle mission astronauts launch station cr
ew nasa satellite earth
probability P(wz)
42
Folding-in a Document/Query

TDT1 collection approx. 16,000 news stories
PLSA model with 128 dimensions
Query keywords aid food medical people UN war
4 most probable factors for query
Track posteriors for every key word

un bosnian serbs bosnia serb sarajevo nato peaceke
ep. nations peace bihac war
iraq iraqui sanctions kuwait un council gulf sadda
m baghdad hussein resolution border
refugees aid rwanda relief people camps zaire camp
food rwandan un goma
building city people rescue buildings workers kobe
victims area earthquake disaster missing
4 selected factors with their most probable
keywords
43
Folding-in a Document/Query
iraq iraqui sanctions kuwait un council gulf sadda
m baghdad hussein resolution border
refugees aid rwanda relief people camps zaire camp
food rwandan un goma
building city people rescue buildings workers kobe
victims area earthquake disaster missing
un bosnian serbs bosnia serb sarajevo nato peaceke
ep. nations peace bihac war
Iteration 1
Posterior probabilites
44
Folding-in a Document/Query
iraq iraqui sanctions kuwait un council gulf sadda
m baghdad hussein resolution border
refugees aid rwanda relief people camps zaire camp
food rwandan un goma
building city people rescue buildings workers kobe
victims area earthquake disaster missing
un bosnian serbs bosnia serb sarajevo nato peaceke
ep. nations peace bihac war
Iteration 2
Posterior probabilites
45
Folding-in a Document/Query
iraq iraqui sanctions kuwait un council gulf sadda
m baghdad hussein resolution border
refugees aid rwanda relief people camps zaire camp
food rwandan un goma
building city people rescue buildings workers kobe
victims area earthquake disaster missing
un bosnian serbs bosnia serb sarajevo nato peaceke
ep. nations peace bihac war
Iteration 5
Posterior probabilites
46
Folding-in a Document/Query
iraq iraqui sanctions kuwait un council gulf sadda
m baghdad hussein resolution border
refugees aid rwanda relief people camps zaire camp
food rwandan un goma
building city people rescue buildings workers kobe
victims area earthquake disaster missing
un bosnian serbs bosnia serb sarajevo nato peaceke
ep. nations peace bihac war
Iteration ?
Posterior probabilites
47
Experiments Precison-Recall
4 test collections (each with approx.1000- 3500
docs)
48
Experimental Results TFIDF
Average Precision-Recall
49
Experimental Results TFIDF
Relative Gain in Average PR
50
From Probabilistic Models to Kernels The Fisher
Kernel

Use idea of a Fisher kernel
Main idea Derive a kernel or similarity function
from a generative model
How do ML estimates of parameters change, around
a point in sample space?
Derive Fisher scores from model
Kernel/similarity function

T. Jaakkola D. Haussler, Exploiting Generative
Models for Discriminative Training, NIPS 1999.
51
Semantic Kernel from PLSA Outline

Outline of the technical derivation
Parameterize multinomials by variance stabilizing
parameters (square-root parameterization)
Assume information orthogonality of parameters
for different multinomials (approximation).
In each block, an isometric embedding with
constant Fisher information is obtained.
(Inversion problem for information matrix is
circumvented)
and the result

52
Semantic Kernel from PLSA Result
K1 essentially reduces to Vector Space Model (!)
53
Text Categorization SVM with PLSA

standard text collection Reuters21578 (5 main
categories) with standard kernel and PLSA kernel
(Fisher kernel)
substantial improvement, if additional unlabeled
documents are available

54
Latent Class Analysis Example

document collection with approx. 1,400 abstracts
on clustering (INSPEC 1991-1997),
preprocessing stemming, stop word list
4 main factors (K128) for term SEGMENT (most
probable words)

imag SEGMENT textur color tissu brain slice cluste
r mri volum
video sequenc motion frame scene SEGMENT shot imag
cluster visual
constraint line match locat imag geometr impos SEG
MENT fundament recogn
speaker speech recogni signal train HMM sourc spea
kerindep. SEGMENT sound
image segmentation
motion segmentation
line matching
speech recognition
55
Document Similarity Example (1)
image speech video line
relative similarity (VSM) 1.4 relative
similarity (PLSA) 0.7
Unknown-multiple signal source clustering problem
using ergodic HMM and applied to speaker
classification. The authors consider signals
originated from a sequence of sources. More
specifically, the problems of segmenting such
signals and relating the segments to their
sources are addressed. This issue has wide
applications in many fields. The report describes
a resolution method that is based on an ergodic
hidden Markov model (HMM), in which each HMM
state corresponds to a signal source.
0.0002 0.6689 0.0455 0.0000
56
Document Similarity Example (2)
Blatt, M. Wiseman, S. Domany, E. Clustering
data through an analogy to the Potts model A new
approach for clustering is proposed. This method
is based on an analogy to a physical model the
ferromagnetic Potts model at thermal equilibrium
is used as an analog computer for this hard
optimization problem. We do not assume any
structure of the underlying distribution of the
data. Phase space of the Potts model is divided
into three regions ferromagnetic,
super-paramagnetic and paramagnetic phases. The
region of interest is that corresponding to the
super-paramagnetic one, where domains of aligned
spins appear. The range of temperatures where
these structures are stable is indicated by
relative similarity (VSM) 1.0 relative
similarity (PLSA) 0.5
McCalpin, J.P. Nishenko, S.P. Holocene
paleoseismicity, temporal clustering, and
probabilities of future large (Mgt7) earthquakes
on the Wasatch fault zone, Utah. The chronology
of Mgt7 paleoearthquakes on the central five
segments of the Wasatch fault zone (WFZ) contains
16 earthquakes in the past 5500 years with an
average repeat time of 350 years. Four of the
central five segments ruptured between 620or-30
and 1230or-60 calendar years B.P. The remaining
segment (Brigham City segment) has not ruptured
in the past 2120or-100 years. Comparison of the
WFZ space-time diagram of paleoearthquakes with
synthetic paleoseismic histories indicates that
the observed temporal clusters and gaps have
about an equal probability (depending on model
assumptions) of reflecting random coincidence as
opposed to intersegment contagion. Regional
seismicity suggests
57
Features of IR Methods
Features LSA PLSA

Quantitative relevance score yes yes
Partial query matching yes yes
Document similarity yes yes
Word correlations, synonyms yes yes
Low-dimensional representation yes yes
Notional families, concepts not really yes
Dealing with polysemy no yes
Probabilistic model no yes
Sparse representation no yes
58
4. Learning (from) Hyperlink Graphs
58
59
The Importance of Hyperlinks in IR

Hyperlinks provide latent human annotation
Hyperlinks represent an implicit endorsement of
the page being pointed to
Social structures are reflected in the Web graph
(cyber/virtual/Web communities)
Link structure allows assessment of page
authority
goes beyond content-based analysis
potentially discriminates between high and low
quality sites

60
HITS (Hyperlink Induced Topic Search)

Jon Kleinberg and the Smart group (IBM)
HITS
Retrieve a subset of Web pages, based on
query-based search result set context graph
Extract hyperlink graph of pages in subset
Rescoring method with hubs- and authority weights
using the adjacency matrix of a Web subgraph
Solution left/right eigenvectors (SVD)

Authority scores
Hub scores

q
p

J. Kleinberg, Authoritative Sources in a
Hyperlinked Environment, 1998.
61
Learning a Semantic Model of the Web

Making sense of the text
Probabilistic latent semantic analysis
Automatically identifies concepts and topics.
Making sense of the link structure
Probabilistic graph model, i.e., predictive model
for additional links/nodes based on existing ones
Centered around the notion of Web communities
Probabilistic version of HITS
Enables to predict the existence of hyperlinks
estimate the entropy of the Web graph

62
Finding Web Communities
Web Community densely connected bipartite
subgraph
Target nodes
Source nodes
identical
63
Decomposing the Web Graph
Links (probabilistically) belong to exactly one
community. Nodes may belong to multiple
communities.
64
Linking Hyperlinks and Content

PLSA and PHITS (probabilistic HITS) can be
combined into one joint decomposition model

65
Ulysses Webs Space, War, and Genius (no heros
wanted)

Decomposition of a base set generated from
Altavista with query Ulysses
Combined decomposition based on links and text

grant 0.019197 s 0.017092 ulysses
0.013781 online 0.006809 war 0.006619 school
0.005966 poetry 0.005762 president
0.005259 civil 0.005065 www.lib.siu.edu/projects
/usgrant/ 0.019358 www.whitehouse.gov /WH/glimpse
/presidents /ug18.html 0.017598 saints.css.edu/mke
lsey /gppg.html 0.015838
page 0.020032 ulysses 0.013361 new 0.010455 web
0.009060 site 0.009009 joyce 0.008430 net
0.007799 teachers 0.007236 information
0.007170 http//www.purchase.edu /Joyce/Ulysses.h
tm 0.008469 http//www.bibliomania.com /Fiction/jo
yce/ulysses /index.html 0.007274
http//teachers.net /chatroom/ 0.005082
ulysses 0.022082 space 0.015334 page
0.013885 home 0.011904 nasa 0.008915 science
0.007417 solar 0.007143 esa 0.006757 mission
0.006090 ulysses.jpl.nasa.gov/
0.028583 helio.estec.esa.nl/ulysses
0.026384 www.sp.ph.ic.ak.uk/ Ulysses 0.026384
D. Cohn T. Hofmann, The Missing Link, NIPS
2001.
66
5. Collaborative Filtering
66
67
Personalized Information Filtering
Users/ Customers
Judgement/ Selection
likes has seen
68
Predicting Preferences and Actions
User Profile Dr. Strangeloves Three Colors
Blue Fargo Pretty Woman Movie?
Rating?

.
69
Collaborative and Content-Based Filtering

Collaborative/social filtering
Properties of persons or similarities between
persons are used to improve predictions.
Makes use of user profile data
Formally starting point is sparse matrix with
user ratings
Content-based filtering
properties of objects or similarities between
objects are used to improve predictions

70
PLSA for Predicting User Ratings
Multi-valued (or real-valued) rating
z
v
preference v is independent of person u, given
latent state z community-based variant
y
u

Each user is represented by a specific
probability distribution
Analogy to IR userdocument, itemsterms

71
PLSA vs. Memory-Based Approaches

Standard approach memory-based
Given active user, compute correlation with all
user profiles in the data base (e.g., Pearson)
Transform correlation into relative weight and
perform a weighted prediction over neighbors
PLSA
Explicitly decomposes preferences interests are
inherently multi-dimensional, no global
similarity function used (!)
Probabilistic model
Data mining interest groups

72
EachMovie Data Set (I)

EachMovie gt40K users, gt1.6K movies, gt2M votes
Experimental evaluation comparison with
memory-based method (competitive), leave-one-out
protocol
Prediction accuracy

73
EachMovie Data Set (II)