Probabilistic Latent Semantic Analysis - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Probabilistic Latent Semantic Analysis

Description:

LSI puts documents together even if they don't have common words if. The docs share frequently co-occurring terms. Disadvantages: Statistical foundation is missing ... – PowerPoint PPT presentation

Number of Views:679
Avg rating:3.0/5.0
Slides: 31
Provided by: mmur5
Category:

less

Transcript and Presenter's Notes

Title: Probabilistic Latent Semantic Analysis


1
Probabilistic Latent Semantic Analysis
  • Thomas Hofmann
  • Presented by
  • Mummoorthy Murugesan
  • Cs 690I, 03/27/2007

2
Outline
  • Latent Semantic Analysis
  • A gentle review
  • Why we need PLSA
  • Indexing
  • Information Retrieval
  • Construction of PLSI
  • Aspect Model
  • EM
  • Tempered EM
  • Experiments to the effectiveness of PLSI

3
The Setting
  • Set of N documents
  • Dd_1, ,d_N
  • Set of M words
  • Ww_1, ,w_M
  • Set of K Latent classes
  • Zz_1, ,z_K
  • A Matrix of size N M to represent the frequency
    counts

4
Latent Semantic Indexing(1/4)
  • Latent present but not evident, hidden
  • Semantic meaning
  • Hidden meaning of terms, and their
    occurrences in documents

5
Latent Semantic Indexing(2/4)
  • For natural Language Queries, simple term
    matching does not work effectively
  • Ambiguous terms
  • Same Queries vary due to personal styles
  • Latent semantic indexing
  • Creates this latent semantic space (hidden
    meaning)

6
Latent Semantic Indexing (3/4)
  • Singular Value Decomposition (SVD)
  • A(nm) U(nn) E(nm) V(mm)
  • Keep only k eigen values from E
  • A(nm) U(nk) E(kk) V(km)
  • Convert terms and documents to points in
    k-dimensional space

7
Latent Semantic Indexing (4/4)
  • LSI puts documents together even if they dont
    have common words if
  • The docs share frequently co-occurring terms
  • Disadvantages
  • Statistical foundation is missing
  • PLSA addresses this concern!

8
Probabilistic Latent Semantic Analysis
  • Automated Document Indexing and Information
    retrieval
  • Identification of Latent Classes using an
    Expectation Maximization (EM) Algorithm
  • Shown to solve
  • Polysemy
  • Java could mean coffee and also the PL Java
  • Cricket is a game and also an insect
  • Synonymy
  • computer, pc, desktop all could mean the
    same
  • Has a better statistical foundation than LSA

9
PLSA
  • Aspect Model
  • Tempered EM
  • Experiment Results

10
PLSA Aspect Model
  • Aspect Model
  • Document is a mixture of underlying (latent) K
    aspects
  • Each aspect is represented by a distribution of
    words p(wz)
  • Model fitting with Tempered EM

11
Aspect Model
  • Latent Variable model for general co-occurrence
    data
  • Associate each observation (w,d) with a class
    variable z ? Zz_1,,z_K
  • Generative Model
  • Select a doc with probability P(d)
  • Pick a latent class z with probability P(zd)
  • Generate a word w with probability p(wz)

P(d)
P(zd)
P(wz)
d
z
w
12
Aspect Model
  • To get the joint probability model
  • (d,w) assumed to be independent

13
  • Using Bayes rule

14
Advantages of this model over Documents Clustering
  • Documents are not related to a single cluster
    (i.e. aspect )
  • For each z, P(zd) defines a specific mixture of
    factors
  • This offers more flexibility, and produces
    effective modeling
  • Now, we have to compute P(z), P(zd), P(wz).
    We are given just documents(d) and words(w).

15
Model fitting with Tempered EM
  • We have the equation for log-likelihood function
    from the aspect model, and we need to maximize
    it.
  • Expectation Maximization ( EM) is used for this
    purpose
  • To avoid overfitting, tempered EM is proposed

16
EM Steps
  • E-Step
  • Expectation step where expectation of the
    likelihood function is calculated with the
    current parameter values
  • M-Step
  • Update the parameters with the calculated
    posterior probabilities
  • Find the parameters that maximizes the likelihood
    function

17
E Step
  • It is the probability that a word w occurring in
    a document d, is explained by aspect z
  • (based on some calculations)

18
M Step
  • All these equations use p(zd,w) calculated in E
    Step
  • Converges to local maximum of the likelihood
    function

19
Over fitting
  • Trade off between Predictive performance on the
    training data and Unseen new data
  • Must prevent the model to over fit the training
    data
  • Propose a change to the E-Step
  • Reduce the effect of fitting as we do more steps

20
TEM (Tempered EM)
  • Introduce control parameter ß
  • ß starts from the value of 1, and decreases

21
Simulated Annealing
  • Alternate healing and cooling of materials to
    make them attain a minimum internal energy state
    reduce defects
  • This process is similar to Simulated Annealing
    ß acts a temperature variable
  • As the value of ß decreases, the effect of
    re-estimations dont affect the expectation
    calculations

22
Choosing ß
  • How to choose a proper ß?
  • It defines
  • Underfit Vs Overfit
  • Simple solution using held-out data (part of
    training data)
  • Using the training data for ß starting from 1
  • Test the model with held-out data
  • If improvement, continue with the same ß
  • If no improvement, ß lt- nß where nlt1

23
Perplexity Comparison(1/4)
  • Perplexity Log-averaged inverse probability on
    unseen data
  • High probability will give lower perplexity, thus
    good predictions
  • MED data

24
Topic Decomposition(2/4)
  • Abstracts of 1568 documents
  • Clustering 128 latent classes
  • Shows word stems for
  • the same word power
  • as p(wz)
  • Power1 Astronomy
  • Power2 - Electricals

25
Polysemy(3/4)
  • Segment occurring in two different contexts are
    identified (image, sound)

26
Information Retrieval(4/4)
  • MED 1033 docs
  • CRAN 1400 docs
  • CACM 3204 docs
  • CISI 1460 docs
  • Reporting only the best results with K varying
    from 32, 48, 64, 80, 128
  • PLSI model takes the average across all models
    at different K values

27
Information Retrieval (4/4)
  • Cosine Similarity is the baseline
  • In LSI, query vector(q) is multiplied to get the
    reduced space vector
  • In PLSI, p(zd) and p(zq). In EM iterations,
    only P(zq) is adapted

28
Precision-Recall results(4/4)
29
Comparing PLSA and LSA
  • LSA and PLSA perform dimensionality reduction
  • In LSA, by keeping only K singular values
  • In PLSA, by having K aspects
  • Comparison to SVD
  • U Matrix related to P(dz) (doc to aspect)
  • V Matrix related to P(zw) (aspect to term)
  • E Matrix related to P(z) (aspect strength)
  • The main difference is the way the approximation
    is done
  • PLSA generates a model (aspect model) and
    maximizes its predictive power
  • Selecting the proper value of K is heuristic in
    LSA
  • Model selection in statistics can determine
    optimal K in PLSA

30
Conclusion
  • PLSI consistently outperforms LSI in the
    experiments
  • Precision gain is 100 compared to baseline
    method in some cases
  • PLSA has statistical theory to support it, and
    thus better than LSA.
Write a Comment
User Comments (0)
About PowerShow.com