Title: Probabilistic Latent Semantic Analysis
1Probabilistic Latent Semantic Analysis
- Thomas Hofmann
- Presented by
- Vortrag Quang Lam Nguyen
- Basierend auf Mummoorthy Murugesan, Cs 6901
- Background
- Model Fitting
- Basic I Maximum Likelihood Estimation
- Basic II EM Algorithm
- Basic III Over fitting
- Experimental Results
- Conclusion
3Background (1/2)
Probabilistic Latent Semantic Analysis and
Latent Semantic Analysis
- Latent present but not evident, hidden
- Semantic meaning
- Hidden meaning of terms, and their
occurrences in documents
4Background (2/2)
N dimensions lexical space
Du hast nicht alle Tassen im Schrank
KltN dimensions semantic (latent) space
Du bist verrückt
5The Setting
- Set of N documents
- Dd_1, ,d_N
- Set of M words
- Ww_1, ,w_M
- Set of K Latent classes
- Zz_1, ,z_K
6Latent Semantic Indexing (1/2)
- Term-Document-matrix A of size N M to represent
the frequency counts - Singular Value Decomposition (SVD)
- A(nm) U(nn) E(nm) V(mm)
- Keep only k eigen values from E
- A(nm) U(nk) E(kk) V(km)
- A A
- Term represented by k factors or a vector in
k-dimensional space - Terms with common meaning mapped to same
7Latent Semantic Indexing (2/2)
- LSI puts documents together even if they dont
have common words - Disadvantages
- Statistical foundation is missing
- PLSA addresses this concern!
8Probabilistic Latent Semantic Analysis
- Overview
- Aspect Model
- Model fitting with EM and TEM
- Basic I Maximum Likelihood Estimation
- Basic II EM Algorithm
- Basic III Over fitting
9PLSA Overview
- Automated Document Indexing and Information
- Identification of Latent Classes using an
Expectation Maximization (EM) Algorithm
- Shown to solve
- Polysemy and Synonymy
- Has a better statistical foundation than LSA
10PLSA Aspect Model (1/3)
- Aspect Model
- Document is a mixture of underlying (latent) K
aspects - Each aspect is represented by a distribution of
words p(wz)
11Aspect Model (2/3)
- Latent Variable model for general co-occurrence
data - Associate each observation (w,d) with a class
variable z ? Zz_1,,z_K
- Generative Model predicting words
- Select a doc with probability P(d)
- Pick a latent class z with probability P(zd)
- Generate a word w with probability p(wz)
12Aspect Model (3/3)
- To get the joint probability model
- (d,w) assumed to be independent
- Now, we have to compute P(z), P(zd), P(wz). We
are given just documents(d) and words(w).
13Basic I Maximum Likelihood Estimation
- Probability model based on real data
- ? it has to be fit ? Model Fitting
- Tuning free Parameters of the model to provide an
optimal fit to real-world data - Parameters in a way that make the data more
likely than other values would do it - Prerequisite correct parameters are known!
14Basic II EM Algorithm (1/2)
- Maximum Likelihood Estimation
- BUT correct parameters not known
- FOR they depend on unknown properties!
- Iterative
- 1. Expectation Step
- 2. Maximization Step
15Basic II EM Algorithm (2/2)
- E-Step (Expectation)
- Hidden parameters estimated - expectation of the
likelihood function is calculated with the
current parameter values - M-Step (Maximization)
- Determine the actual parameters -
- Find the parameters that maximizes the likelihood
function (Maximum Likelihood Estimation)
16Model fitting
- We have the equation for log-likelihood function
from the aspect model, and we need to maximize
it. - Expectation Maximization ( EM) is used for this
17E-Step Model Fitting (2/2)
- It is the probability that a word w occurring in
a document d, is explained by aspect z - (based on some calculations)
18M Step Model Fitting (3/3)
- All these equations use p(zd,w) calculated in E
Step - Converges to local maximum of the likelihood
19Basic III Over fitting
- Trade off between Predictive performance on the
training data and Unseen new data - Actual aim predict correct output for UNSEEN
data, too -gt generalization - Problem may adjust to very specific random
features of the training data too much -gt over
fitting - ? Tempered EM
20TEM (Tempered EM)
- Introduce control parameter ß
- ß starts from the value of 1, and decreases
- Similar to Simulated Annealing
- ß as temperature variable
21Choosing ß
- It defines
- Underfit Vs Overfit
- Simple solution using held-out data (part of
training data) - Using the training data for ß starting from 1
- Test the model with held-out data
- If improvement, continue with the same ß
- If no improvement, ß nß where nlt1
22Experimental Results
- Perplexity Comparison
- Polysemy
- Information Retrieval
23Perplexity Comparison (1/2)
- What is perplexity?
- Indicator for the quality of probability models
- Less surpised by test example
- High probability will give lower perplexity, thus
good predictions
24Perplexity Comparison (2/2)
- Segment occurring in two different contexts are
identified (image, sound)
26Information Retrieval
- For natural Language Queries, simple term
matching does not work effectively - Ambiguous terms
- Same Queries vary due to personal styles
- Latent semantic indexing
- Creates this latent semantic space (hidden
27Comparing PLSA and LSA
- LSA and PLSA perform dimensionality reduction
- In LSA, by keeping only K singular values
- In PLSA, by having K aspects
- Comparison to SVD
- U Matrix related to P(dz) (doc to aspect)
- V Matrix related to P(zw) (aspect to term)
- E Matrix related to P(z) (aspect strength)
- The main difference is the way the approximation
is done - PLSA generates a model (aspect model) and
maximizes its predictive power - Selecting the proper value of K is heuristic in
LSA - Model selection in statistics can determine
optimal K in PLSA
- PLSI consistently outperforms LSI in the
experiments - Precision gain is 100 compared to baseline
method in some cases - PLSA has statistical theory to support it, and
thus better than LSA.