Title: Probabilistic Latent Semantic Analysis
1Probabilistic Latent Semantic Analysis
- Thomas Hofmann
- Presented by
- Vortrag Quang Lam Nguyen
- Basierend auf Mummoorthy Murugesan, Cs 6901
2Outline
- Background
- LSA
- PLSA
- Model Fitting
- Basic I Maximum Likelihood Estimation
- Basic II EM Algorithm
- Basic III Over fitting
- Experimental Results
- Conclusion
3Background (1/2)
Probabilistic Latent Semantic Analysis and
Latent Semantic Analysis
- Latent present but not evident, hidden
- Semantic meaning
- Hidden meaning of terms, and their
occurrences in documents
4Background (2/2)
N dimensions lexical space
Sport
Polysemy
Synonymy
Muskelkater
Kater
Polysemy
Auto
Bank
Bank
Du hast nicht alle Tassen im Schrank
Wagen
KltN dimensions semantic (latent) space
Du bist verrückt
Bank
Auto
Bank
Park
Wagen
Einzahlung
Kater
Muskelkater
Sport
5The Setting
- Set of N documents
- Dd_1, ,d_N
- Set of M words
- Ww_1, ,w_M
- Set of K Latent classes
- Zz_1, ,z_K
6Latent Semantic Indexing (1/2)
- Term-Document-matrix A of size N M to represent
the frequency counts - Singular Value Decomposition (SVD)
- A(nm) U(nn) E(nm) V(mm)
- Keep only k eigen values from E
- A(nm) U(nk) E(kk) V(km)
- A A
- Term represented by k factors or a vector in
k-dimensional space - Terms with common meaning mapped to same
direction
7Latent Semantic Indexing (2/2)
- LSI puts documents together even if they dont
have common words - Disadvantages
- Statistical foundation is missing
- PLSA addresses this concern!
8Probabilistic Latent Semantic Analysis
- Overview
- Aspect Model
- Model fitting with EM and TEM
- Basic I Maximum Likelihood Estimation
- Basic II EM Algorithm
- Basic III Over fitting
9PLSA Overview
- Automated Document Indexing and Information
retrieval
- Identification of Latent Classes using an
Expectation Maximization (EM) Algorithm
- Shown to solve
- Polysemy and Synonymy
- Has a better statistical foundation than LSA
10PLSA Aspect Model (1/3)
- Aspect Model
- Document is a mixture of underlying (latent) K
aspects - Each aspect is represented by a distribution of
words p(wz)
11Aspect Model (2/3)
- Latent Variable model for general co-occurrence
data - Associate each observation (w,d) with a class
variable z ? Zz_1,,z_K
- Generative Model predicting words
- Select a doc with probability P(d)
- Pick a latent class z with probability P(zd)
- Generate a word w with probability p(wz)
P(d)
P(zd)
P(wz)
d
z
w
12Aspect Model (3/3)
- To get the joint probability model
- (d,w) assumed to be independent
- Now, we have to compute P(z), P(zd), P(wz). We
are given just documents(d) and words(w).
13Basic I Maximum Likelihood Estimation
- Probability model based on real data
- ? it has to be fit ? Model Fitting
- Tuning free Parameters of the model to provide an
optimal fit to real-world data - Parameters in a way that make the data more
likely than other values would do it - Prerequisite correct parameters are known!
14Basic II EM Algorithm (1/2)
- Maximum Likelihood Estimation
- BUT correct parameters not known
- FOR they depend on unknown properties!
- Iterative
- 1. Expectation Step
- 2. Maximization Step
15Basic II EM Algorithm (2/2)
- E-Step (Expectation)
- Hidden parameters estimated - expectation of the
likelihood function is calculated with the
current parameter values - M-Step (Maximization)
- Determine the actual parameters -
- Find the parameters that maximizes the likelihood
function (Maximum Likelihood Estimation)
16Model fitting
- We have the equation for log-likelihood function
from the aspect model, and we need to maximize
it. - Expectation Maximization ( EM) is used for this
purpose
17E-Step Model Fitting (2/2)
- It is the probability that a word w occurring in
a document d, is explained by aspect z - (based on some calculations)
18M Step Model Fitting (3/3)
- All these equations use p(zd,w) calculated in E
Step - Converges to local maximum of the likelihood
function
19Basic III Over fitting
- Trade off between Predictive performance on the
training data and Unseen new data - Actual aim predict correct output for UNSEEN
data, too -gt generalization - Problem may adjust to very specific random
features of the training data too much -gt over
fitting - ? Tempered EM
20TEM (Tempered EM)
- Introduce control parameter ß
- ß starts from the value of 1, and decreases
- Similar to Simulated Annealing
- ß as temperature variable
21Choosing ß
- It defines
- Underfit Vs Overfit
- Simple solution using held-out data (part of
training data) - Using the training data for ß starting from 1
- Test the model with held-out data
- If improvement, continue with the same ß
- If no improvement, ß nß where nlt1
22Experimental Results
- Perplexity Comparison
- Polysemy
- Information Retrieval
23Perplexity Comparison (1/2)
- What is perplexity?
- Indicator for the quality of probability models
- Less surpised by test example
- High probability will give lower perplexity, thus
good predictions
24Perplexity Comparison (2/2)
25Polysemy
- Segment occurring in two different contexts are
identified (image, sound)
26Information Retrieval
- For natural Language Queries, simple term
matching does not work effectively - Ambiguous terms
- Same Queries vary due to personal styles
- Latent semantic indexing
- Creates this latent semantic space (hidden
meaning)
27Comparing PLSA and LSA
- LSA and PLSA perform dimensionality reduction
- In LSA, by keeping only K singular values
- In PLSA, by having K aspects
- Comparison to SVD
- U Matrix related to P(dz) (doc to aspect)
- V Matrix related to P(zw) (aspect to term)
- E Matrix related to P(z) (aspect strength)
- The main difference is the way the approximation
is done - PLSA generates a model (aspect model) and
maximizes its predictive power - Selecting the proper value of K is heuristic in
LSA - Model selection in statistics can determine
optimal K in PLSA
28Conclusion
- PLSI consistently outperforms LSI in the
experiments - Precision gain is 100 compared to baseline
method in some cases - PLSA has statistical theory to support it, and
thus better than LSA.