Title: Matrix Decomposition Methods in Information Retrieval
1Matrix Decomposition Methods in Information
Retrieval
- Thomas Hofmann
- Department of Computer Science
- Brown University
- www.cs.brown.edu/people/th
- ( Chief Scientist, RecomMind Inc.)
In collaboration with Jan Puzicha, UC Berkeley
RecomMind David Cohen, CMU Burning Glass
2Overview
- Introduction A Brief History of Mechanical IR
- Latent Semantic Analysis
- Probabilistic Latent Semantic Analysis
- Learning (from) Hyperlink Graphs
- Collaborative Filtering
- Future Work and Conclusion
3- Introduction A Brief History of Mechanical IR
3
4Memex As we may think.
- Vannevar Bush (1945)
- The idea of an easily accessible, individually
configurable storehouse of knowledge, the
beginning of the literature on mechanized
information retrieval - Consider a future device for individual use,
which is a sort of mechanized private file and
library. It needs a name, and to coin one at
random, memex will do. A memex is a device in
which an individual stores all his books,
records, and communications, and which is
mechanized so that it may be consulted with
exceeding speed and flexibility. It is an
enlarged intimate supplement to his memory. - The world has arrived at an age of cheap complex
devices of great reliability and something is
bound to come of it.
5Memex As we may think.
- Vannevar Bush (1945)
- The civilizational challenge
- The difficulty seems to be, not so much that we
publish unduly in view of the extent and variety
of present day interests, but rather that
publication has been extended far beyond our
present ability to make real use of the record.
The summation of human experience is being
expanded at a prodigious rate, and the means we
use for threading through the consequent maze to
the momentarily important item is the same as was
used in the days of square-rigged ships.
V. Bush, As we may think, Atlantic Monthly, 176
(1945), pp.101-108
6The Thesaurus Approach
- Hans Peter Luhn (1957, 1961)
- Words of similar or related meaning are grouped
into notional families - Encoding of documents in terms of notional
elements - Matching by measuring the degree of notional
similarity - A common language for annotating documents, key
word in context (KWIC) indexing. - the faculty of interpretation is beyond the
talent of machines. - Statistical cues extracted by machines to assist
human indexer vocabulary method to detecting
similarities.
H.P. Luhn, A statistical approach to mechanical
literature searching, New York, IBM Research
Center, 1957. H.P. Luhn, The Automatic
Derivation of Information Retrieval Encodements
from Machine- Readable Text, Information
Retrieval and Machine Translation, 3(2),
pp.1021-1028, 1961
7To Punch or not to punch
- T. Joyce R.M. Needham (1958)
- Lattices hierarchies of search terms
- As in other systems, the documents are
represented by holes in punched cards which
represent the various terms, and in addition,
when a hole is punched in any term card, all the
terms at higher levels of the lattice are
also punched. - The postcoordinate revolution card sorting at
search time! - Investigations to lessen the physical work
are continuing.
T. Joyce R.M. Needham, The Thesaurus Approach
to Information Retrieval, American
Documentation, 9, pp. 192-197, 1958.
8Term Associations
- Lauren B. Doyle (1962)
- Unusual co-occurrences of pairs of words
associations of words in text - Statistical testing Chi-square and Pearson
correlation coefficient to determine pairwise
correlations - Term association maps for interactive retrieval
- Today semantic maps
L.B. Doyle, Indexing and Abstracting by
Association, Unisys Corporation, 1962.
9Probabilistic Relevance Model
- M.E. Maron J.L. Kuhns (1960)
- S.E. Roberston K. Sparck Jones (1976)
- Various models, e.g., binary independence model
- Problem how to estimate these conditional
probabilities?
10Vector Space Model
- Gerard Salton (1960/70)
- Instead of indexing documents by selected index
terms, preserve (almost) all terms in automatic
indexing - Represent documents by a high-dimensional vector.
- Each term can be associated with a weight
- Geometrical interpretation
G. Salton, The SMART Retrieval System
Experiments in Automatic Document Processing,
1971.
11Term-Document Matrix
W terms in vocabulary
D documents in database
intelligence
Texas Instruments said it has developed the first
32-bit computer chip designed specifically for
artificial intelligence applications ...
term-document matrix
intelligence
artificial
interest
artifact
t
d
...
1
0
0
...
...
2
12Documents in Inner Space
similarity between document and query
cosine of angle between query and document(s)
- Retrieval method
- rank documents according to similarity with query
- term weighting schemes, for example, TFIDF
- used in SMART system and many successor systems,
high popularity
13Advantages of the Vector Space Model
- No subjective selection of index terms
- Partial matching of queries and documents
(dealing with the case where no document contains
all search terms) - Ranking according to similarity score (dealing
with large result sets) - Term weighting schemes (improves retrieval
performance) - Various extensions
- Document clustering
- Relevance feedback (modifying query vector)
- Geometric foundation
142. Latent Semantic Analysis
14
15Limitations of the Vector Space Model
- Dimensionality
- Vector space representation is high-dimensional
(several 10-100K). - Learning and estimation has to deal with curse of
dimensionality. - Sparseness
- Document vectors are typically very sparse.
- Cosine similarity can be noisy and inaccurate.
- Semantics
- The inner product can only match occurrences of
exactly the same terms. - The vector representation does not capture
semantic relations between words. - Independence
- Bag-of-Words Representation
- Unable to capture phrases and semantic/syntactic
regularities
16The Lost Meaning of Words
- Ambiguity and association in natural language
- Polysemy Words often have a multitude of
meanings and different types of usage (more
urgent for very heterogeneous collections). - The vector space model is unable to discriminate
between different meanings of the same word. - Synonymy Different terms may have an identical
or a similar meaning (weaker words indicating
the same topic). - No associations between words are made in the
vector space representation.
17Polysemy and Context
- Document similarity on single word level
polysemy and context
18Latent Semantic Analysis
- General idea
- Map documents (and terms) to a low-dimensional
representation. - Design a mapping such that the low-dimensional
space reflects semantic associations (latent
semantic space). - Compute document similarity based on the inner
product in the latent semantic space. - Goals
- Similar terms map to similar location in low
dimensional space. - Noise reduction by dimension reduction.
19LSA Matrix Decomposition by SVD
- Dimension reduction by singular value
decomposition of term-document matrix
word frequencies (possibly transformed)
- Document length normalization
- Sublinear transformation (e.g., log)
- Global term weight
original td matrix
reconstructed td matrix
term/document vectors
thresholded singular values
L2 optimal approximation
20Background SVD
- Singular Value Decomposition, definition
- orthonormal columns
- diagonal with singular values (ordered)
- Properties
- Existence uniqueness
- Thresholding small singular values yields an
optimal low-rank approximation (in the sense of
the Frobenius norm)
21SVD and PCA
- If (!) the rows of would be shifted such that
their mean is zero, then - Then, one would essentially perform a projection
on the principal axis defined by the columns of - Yet, this would destroy the sparseness of the
term-document matrix (and consequently might hurt
the performance of SVD methods)
22Canonical Analysis
- Hirschfield 1935, Hotelling 1936, Fisher 1940
- Correlation analysis for contingency tables
23Canoncial Correspondence Analysis
- Correspondence Analysis (as a method of scaling)
- Guttman 1941, Torgerson 1958, Benzecri 1969, Hill
1974 - Whitaker 1967 gradient analysis
- solutions unit vectors and scores of canonical
analysis - SVD of rescaled matrix with entries
(not exactly what is done in LSA)
24Semantic Inner Product / Kernel
- Similarity inner product in lower dimensional
space - For given decomposition, additional documents or
queries can be mapped to semantic space
(folding-in) - Notice that
- Hence, for new document/query q
lower dimensional document representation
25Term Associations from LSA
Term 2
Concept
Term 1
(taken from slide by S. Dumais)
26LSA Discussion
- pros
- Low-dimensional document representation is able
to capture synonyms. - Noise removal and robustness by dimension
reduction - Experimentally advantages over naïve vector
space model - cons
- Formally L2 norm is inappropriate as a
distance function for count vectors
(reconstruction may contain negative entries) - Conceptually
- Problem of polysemy is not addressed principle
of linear superposition, no active disambiguation - Context of terms is not taken into account.
- Directions in latent space are hard to interpret.
- No probabilistic model of term occurrences.
- ad hoc selection of the number of dimensions,
...
27Features of IR Methods
Features VSM LSA
Quantitative relevance score yes yes
Partial query matching yes yes
Document similarity yes yes
Word correlations, synonyms no yes
Low-dimensional representation no yes
Notional families, concepts no not really
Dealing with polysemy no no
Probabilistic model no no
Sparse representation yes no
283. Probabilistic Latent Semantic Analysis
28
29Documents as Information Sources
- real document empirical probability distrib. ?
relative frequencies
D documents in database
W words in vocabulary
- ideal document (memoryless) information source
other documents
30Information Source Models in IR
- Bayes rule probability of relevance of document
w.r.t. query
prior probability of relevance
- Probability that q is generated from d
- Probability that query term is generated
Language model
Translation model
J. Ponte W.B. Croft, A Language Model Approach
to Information Retrieval, SIGIR 1998. A. Berger
J. Lafferty, Information Retrieval as
Statistical Translation, SIGIR 1999.
31Probabilistic Latent Semantic Analysis
- How can we learn document-specific language
models? Sparseness problem, even for unigrams. - Probabilistic dimension reduction techniques to
overcome data sparseness problem. - Factor analysis for count data factors ? concepts
T. Hofmann, Probabilistic Latent Semantic
Analysis, UAI 1999.
32PLSA Graphical Model
33PLSA Graphical Model
P(zd)
z
w
c(d)
N
34PLSA Graphical Model
P(zd)
z
w
c(d)
N
35PLSA Graphical Model
shared by all words in a document
P(zd)
shared by all documents in collection
z
w
c(d)
N
36Probabilistic Latent Semantic Space
- documents are represented as points in low-
dimensional sub-simplex (dimensionality reduction
for probability distributions)
embedding
spanned
simplex
sub-simplex
0
- KL-divergence projection, not orthogonal
37Positive Matrix Decomposition
- mixture decomposition in matrix notation
- constraints
- Non-negativity of all matrices
- Normalization according to L1-norm
- (no orthogonality)
D.D. Lee H.S. Seung, Learning the parts of
objects by non-negative matrix factorization,
Nature, 1999.
38Positive Matrix Decomposition SVD
- mixture decomposition in matrix notation
compare to
- probabilistic approach vs. linear algebra
decomposition - conditional independence assumption replaces
outer product - class-conditional distributions replace
left/right eigenvectors - maximum likelihood instead of minimum L2 norm
criterion
39Expectation Maximization Algorithm
- Maximizing log-likelihood by (tempered) EM
iterations - E-step (posterior probabilities of latent
variables) - M-step (max. of expected complete log-likelihood)
probability that a term occurrence w within d is
explained by topic z
40Example Science Magazine Papers
- Dataset with approx. 12K papers from Science
Magazine - Selected concepts from model with K200
41Example TDT1 news stories
- TDT1 document collection with approx. 16,000
news stories (Reuters, CNN, years 1994/95) - results based on decomposition with 128 concepts
- 2 main factors for flight and love (most
probable words)
love
flight
home family like just kids mother life happy frien
ds cnn
film movie music new best hollywood love actor en
tertainment star
plane airport crash flight safety aircraft air pas
senger board airline
space shuttle mission astronauts launch station cr
ew nasa satellite earth
probability P(wz)
42Folding-in a Document/Query
- TDT1 collection approx. 16,000 news stories
- PLSA model with 128 dimensions
- Query keywords aid food medical people UN war
- 4 most probable factors for query
- Track posteriors for every key word
un bosnian serbs bosnia serb sarajevo nato peaceke
ep. nations peace bihac war
iraq iraqui sanctions kuwait un council gulf sadda
m baghdad hussein resolution border
refugees aid rwanda relief people camps zaire camp
food rwandan un goma
building city people rescue buildings workers kobe
victims area earthquake disaster missing
4 selected factors with their most probable
keywords
43Folding-in a Document/Query
iraq iraqui sanctions kuwait un council gulf sadda
m baghdad hussein resolution border
refugees aid rwanda relief people camps zaire camp
food rwandan un goma
building city people rescue buildings workers kobe
victims area earthquake disaster missing
un bosnian serbs bosnia serb sarajevo nato peaceke
ep. nations peace bihac war
Iteration 1
Posterior probabilites
44Folding-in a Document/Query
iraq iraqui sanctions kuwait un council gulf sadda
m baghdad hussein resolution border
refugees aid rwanda relief people camps zaire camp
food rwandan un goma
building city people rescue buildings workers kobe
victims area earthquake disaster missing
un bosnian serbs bosnia serb sarajevo nato peaceke
ep. nations peace bihac war
Iteration 2
Posterior probabilites
45Folding-in a Document/Query
iraq iraqui sanctions kuwait un council gulf sadda
m baghdad hussein resolution border
refugees aid rwanda relief people camps zaire camp
food rwandan un goma
building city people rescue buildings workers kobe
victims area earthquake disaster missing
un bosnian serbs bosnia serb sarajevo nato peaceke
ep. nations peace bihac war
Iteration 5
Posterior probabilites
46Folding-in a Document/Query
iraq iraqui sanctions kuwait un council gulf sadda
m baghdad hussein resolution border
refugees aid rwanda relief people camps zaire camp
food rwandan un goma
building city people rescue buildings workers kobe
victims area earthquake disaster missing
un bosnian serbs bosnia serb sarajevo nato peaceke
ep. nations peace bihac war
Iteration ?
Posterior probabilites
47Experiments Precison-Recall
4 test collections (each with approx.1000- 3500
docs)
48Experimental Results TFIDF
Average Precision-Recall
49Experimental Results TFIDF
Relative Gain in Average PR
50From Probabilistic Models to Kernels The Fisher
Kernel
- Use idea of a Fisher kernel
- Main idea Derive a kernel or similarity function
from a generative model - How do ML estimates of parameters change, around
a point in sample space? - Derive Fisher scores from model
- Kernel/similarity function
T. Jaakkola D. Haussler, Exploiting Generative
Models for Discriminative Training, NIPS 1999.
51Semantic Kernel from PLSA Outline
- Outline of the technical derivation
- Parameterize multinomials by variance stabilizing
parameters (square-root parameterization) - Assume information orthogonality of parameters
for different multinomials (approximation). - In each block, an isometric embedding with
constant Fisher information is obtained.
(Inversion problem for information matrix is
circumvented) - and the result
52Semantic Kernel from PLSA Result
K1 essentially reduces to Vector Space Model (!)
53Text Categorization SVM with PLSA
- standard text collection Reuters21578 (5 main
categories) with standard kernel and PLSA kernel
(Fisher kernel) - substantial improvement, if additional unlabeled
documents are available
54Latent Class Analysis Example
- document collection with approx. 1,400 abstracts
on clustering (INSPEC 1991-1997),
preprocessing stemming, stop word list - 4 main factors (K128) for term SEGMENT (most
probable words)
imag SEGMENT textur color tissu brain slice cluste
r mri volum
video sequenc motion frame scene SEGMENT shot imag
cluster visual
constraint line match locat imag geometr impos SEG
MENT fundament recogn
speaker speech recogni signal train HMM sourc spea
kerindep. SEGMENT sound
image segmentation
motion segmentation
line matching
speech recognition
55Document Similarity Example (1)
image speech video line
relative similarity (VSM) 1.4 relative
similarity (PLSA) 0.7
Unknown-multiple signal source clustering problem
using ergodic HMM and applied to speaker
classification. The authors consider signals
originated from a sequence of sources. More
specifically, the problems of segmenting such
signals and relating the segments to their
sources are addressed. This issue has wide
applications in many fields. The report describes
a resolution method that is based on an ergodic
hidden Markov model (HMM), in which each HMM
state corresponds to a signal source.
0.0002 0.6689 0.0455 0.0000
56 Document Similarity Example (2)
Blatt, M. Wiseman, S. Domany, E. Clustering
data through an analogy to the Potts model A new
approach for clustering is proposed. This method
is based on an analogy to a physical model the
ferromagnetic Potts model at thermal equilibrium
is used as an analog computer for this hard
optimization problem. We do not assume any
structure of the underlying distribution of the
data. Phase space of the Potts model is divided
into three regions ferromagnetic,
super-paramagnetic and paramagnetic phases. The
region of interest is that corresponding to the
super-paramagnetic one, where domains of aligned
spins appear. The range of temperatures where
these structures are stable is indicated by
relative similarity (VSM) 1.0 relative
similarity (PLSA) 0.5
McCalpin, J.P. Nishenko, S.P. Holocene
paleoseismicity, temporal clustering, and
probabilities of future large (Mgt7) earthquakes
on the Wasatch fault zone, Utah. The chronology
of Mgt7 paleoearthquakes on the central five
segments of the Wasatch fault zone (WFZ) contains
16 earthquakes in the past 5500 years with an
average repeat time of 350 years. Four of the
central five segments ruptured between 620or-30
and 1230or-60 calendar years B.P. The remaining
segment (Brigham City segment) has not ruptured
in the past 2120or-100 years. Comparison of the
WFZ space-time diagram of paleoearthquakes with
synthetic paleoseismic histories indicates that
the observed temporal clusters and gaps have
about an equal probability (depending on model
assumptions) of reflecting random coincidence as
opposed to intersegment contagion. Regional
seismicity suggests
57Features of IR Methods
Features LSA PLSA
Quantitative relevance score yes yes
Partial query matching yes yes
Document similarity yes yes
Word correlations, synonyms yes yes
Low-dimensional representation yes yes
Notional families, concepts not really yes
Dealing with polysemy no yes
Probabilistic model no yes
Sparse representation no yes
584. Learning (from) Hyperlink Graphs
58
59The Importance of Hyperlinks in IR
- Hyperlinks provide latent human annotation
- Hyperlinks represent an implicit endorsement of
the page being pointed to - Social structures are reflected in the Web graph
(cyber/virtual/Web communities) - Link structure allows assessment of page
authority - goes beyond content-based analysis
- potentially discriminates between high and low
quality sites
60HITS (Hyperlink Induced Topic Search)
- Jon Kleinberg and the Smart group (IBM)
- HITS
- Retrieve a subset of Web pages, based on
query-based search result set context graph - Extract hyperlink graph of pages in subset
- Rescoring method with hubs- and authority weights
using the adjacency matrix of a Web subgraph - Solution left/right eigenvectors (SVD)
Authority scores
Hub scores
q
p
J. Kleinberg, Authoritative Sources in a
Hyperlinked Environment, 1998.
61Learning a Semantic Model of the Web
- Making sense of the text
- Probabilistic latent semantic analysis
- Automatically identifies concepts and topics.
- Making sense of the link structure
- Probabilistic graph model, i.e., predictive model
for additional links/nodes based on existing ones - Centered around the notion of Web communities
- Probabilistic version of HITS
- Enables to predict the existence of hyperlinks
estimate the entropy of the Web graph
62Finding Web Communities
Web Community densely connected bipartite
subgraph
Target nodes
Source nodes
identical
63Decomposing the Web Graph
Links (probabilistically) belong to exactly one
community. Nodes may belong to multiple
communities.
64Linking Hyperlinks and Content
- PLSA and PHITS (probabilistic HITS) can be
combined into one joint decomposition model
65Ulysses Webs Space, War, and Genius (no heros
wanted)
- Decomposition of a base set generated from
Altavista with query Ulysses - Combined decomposition based on links and text
grant 0.019197 s 0.017092 ulysses
0.013781 online 0.006809 war 0.006619 school
0.005966 poetry 0.005762 president
0.005259 civil 0.005065 www.lib.siu.edu/projects
/usgrant/ 0.019358 www.whitehouse.gov /WH/glimpse
/presidents /ug18.html 0.017598 saints.css.edu/mke
lsey /gppg.html 0.015838
page 0.020032 ulysses 0.013361 new 0.010455 web
0.009060 site 0.009009 joyce 0.008430 net
0.007799 teachers 0.007236 information
0.007170 http//www.purchase.edu /Joyce/Ulysses.h
tm 0.008469 http//www.bibliomania.com /Fiction/jo
yce/ulysses /index.html 0.007274
http//teachers.net /chatroom/ 0.005082
ulysses 0.022082 space 0.015334 page
0.013885 home 0.011904 nasa 0.008915 science
0.007417 solar 0.007143 esa 0.006757 mission
0.006090 ulysses.jpl.nasa.gov/
0.028583 helio.estec.esa.nl/ulysses
0.026384 www.sp.ph.ic.ak.uk/ Ulysses 0.026384
D. Cohn T. Hofmann, The Missing Link, NIPS
2001.
665. Collaborative Filtering
66
67Personalized Information Filtering
Users/ Customers
Judgement/ Selection
likes has seen
68Predicting Preferences and Actions
User Profile Dr. Strangeloves Three Colors
Blue Fargo Pretty Woman Movie?
Rating?
.
69Collaborative and Content-Based Filtering
- Collaborative/social filtering
- Properties of persons or similarities between
persons are used to improve predictions. - Makes use of user profile data
- Formally starting point is sparse matrix with
user ratings - Content-based filtering
- properties of objects or similarities between
objects are used to improve predictions
70PLSA for Predicting User Ratings
Multi-valued (or real-valued) rating
z
v
preference v is independent of person u, given
latent state z community-based variant
y
u
- Each user is represented by a specific
probability distribution - Analogy to IR userdocument, itemsterms
71PLSA vs. Memory-Based Approaches
- Standard approach memory-based
- Given active user, compute correlation with all
user profiles in the data base (e.g., Pearson) - Transform correlation into relative weight and
perform a weighted prediction over neighbors - PLSA
- Explicitly decomposes preferences interests are
inherently multi-dimensional, no global
similarity function used (!) - Probabilistic model
- Data mining interest groups
72EachMovie Data Set (I)
- EachMovie gt40K users, gt1.6K movies, gt2M votes
- Experimental evaluation comparison with
memory-based method (competitive), leave-one-out
protocol - Prediction accuracy
73EachMovie Data Set (II)
74EachMovie Data Set (III)
- Ranking score exponential fall-off of weights
with position in recommendation list
75Interests Group, Each Movie
76Des-Interests Group, Each Movie
776. Open Problems Conclusions
77
78Scalability of Matrix Decomposition
- RecomMind Inc., Retrieval Engine
- gt1M documents
- gt50K vocabulary
- gt1K concepts
- Internet Archive (www.archive.org)
- Large-scale Web experiments, gt10M sites
79Conclusion Matrix Decomposition
- Enables semantic document indexing concepts,
notional families - Increased robustness in information retrieval
- Text/data mining finding regularities patterns
- Improved categorization by providing more
suitable document representations - Probabilistic nature of models allows the use of
formal inference - Very versatile term-document matrix, adjacency
matrix, rating matrix, etc.
80Open Problems
- Conceptual
- Bayesian model learning and model combination
- Distributed learning of latent class models
- Relational Bayesian networks (Koller et al.)
- Principled ways to exploit sparseness in
algorithm design - Beyond bag-of-words models (string kernels,
bigram language models) - Applications
- Combining content filtering with collaborative
filtering - Personalized information retrieval
- Interactive retrieval using extracted structure
- Multimedia retrieval
- New application domains