Probabilistic Latent Semantic Analysis - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Probabilistic Latent Semantic Analysis

Description:

LSI puts documents together even if they don't have common words if. The docs share frequently co-occurring terms. Disadvantages: Statistical foundation is missing ... – PowerPoint PPT presentation

Number of Views:679

Avg rating:3.0/5.0

Slides: 31

Provided by: mmur5

Category:

more less

Transcript and Presenter's Notes

Title: Probabilistic Latent Semantic Analysis

1
Probabilistic Latent Semantic Analysis

Thomas Hofmann
Presented by
Mummoorthy Murugesan
Cs 690I, 03/27/2007

2
Outline

Latent Semantic Analysis
A gentle review

Why we need PLSA
Indexing
Information Retrieval

Construction of PLSI
Aspect Model
EM
Tempered EM

Experiments to the effectiveness of PLSI

3
The Setting

Set of N documents
Dd_1, ,d_N
Set of M words
Ww_1, ,w_M
Set of K Latent classes
Zz_1, ,z_K
A Matrix of size N M to represent the frequency
counts

4
Latent Semantic Indexing(1/4)

Latent present but not evident, hidden
Semantic meaning

Hidden meaning of terms, and their
occurrences in documents

5
Latent Semantic Indexing(2/4)

For natural Language Queries, simple term
matching does not work effectively
Ambiguous terms
Same Queries vary due to personal styles
Latent semantic indexing
Creates this latent semantic space (hidden
meaning)

6
Latent Semantic Indexing (3/4)

Singular Value Decomposition (SVD)

A(nm) U(nn) E(nm) V(mm)

Keep only k eigen values from E
A(nm) U(nk) E(kk) V(km)

Convert terms and documents to points in
k-dimensional space

7
Latent Semantic Indexing (4/4)

LSI puts documents together even if they dont
have common words if
The docs share frequently co-occurring terms
Disadvantages
Statistical foundation is missing
PLSA addresses this concern!

8
Probabilistic Latent Semantic Analysis

Automated Document Indexing and Information
retrieval

Identification of Latent Classes using an
Expectation Maximization (EM) Algorithm

Shown to solve
Polysemy
Java could mean coffee and also the PL Java
Cricket is a game and also an insect
Synonymy
computer, pc, desktop all could mean the
same

Has a better statistical foundation than LSA

9
PLSA

Aspect Model
Tempered EM
Experiment Results

10
PLSA Aspect Model

Aspect Model
Document is a mixture of underlying (latent) K
aspects
Each aspect is represented by a distribution of
words p(wz)
Model fitting with Tempered EM

11
Aspect Model

Latent Variable model for general co-occurrence
data
Associate each observation (w,d) with a class
variable z ? Zz_1,,z_K

Generative Model
Select a doc with probability P(d)
Pick a latent class z with probability P(zd)
Generate a word w with probability p(wz)

P(d)
P(zd)
P(wz)
d
z
w
12
Aspect Model

To get the joint probability model

(d,w) assumed to be independent

Using Bayes rule

14
Advantages of this model over Documents Clustering

Documents are not related to a single cluster
(i.e. aspect )
For each z, P(zd) defines a specific mixture of
factors
This offers more flexibility, and produces
effective modeling
Now, we have to compute P(z), P(zd), P(wz).
We are given just documents(d) and words(w).

15
Model fitting with Tempered EM

We have the equation for log-likelihood function
from the aspect model, and we need to maximize
it.
Expectation Maximization ( EM) is used for this
purpose
To avoid overfitting, tempered EM is proposed

16
EM Steps

E-Step
Expectation step where expectation of the
likelihood function is calculated with the
current parameter values
M-Step
Update the parameters with the calculated
posterior probabilities
Find the parameters that maximizes the likelihood
function

17
E Step

It is the probability that a word w occurring in
a document d, is explained by aspect z
(based on some calculations)

18
M Step

All these equations use p(zd,w) calculated in E
Step
Converges to local maximum of the likelihood
function

19
Over fitting

Trade off between Predictive performance on the
training data and Unseen new data
Must prevent the model to over fit the training
data
Propose a change to the E-Step
Reduce the effect of fitting as we do more steps

20
TEM (Tempered EM)

Introduce control parameter ß
ß starts from the value of 1, and decreases

21
Simulated Annealing

Alternate healing and cooling of materials to
make them attain a minimum internal energy state
reduce defects
This process is similar to Simulated Annealing
ß acts a temperature variable
As the value of ß decreases, the effect of
re-estimations dont affect the expectation
calculations

22
Choosing ß