Bayesian Learning for Latent Semantic Analysis

1 / 48

About This Presentation

Title:

Bayesian Learning for Latent Semantic Analysis

Description:

... for Latent Semantic Analysis. Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu ... Chia-Sheng Wu, 'Bayesian Latent Semantic Analysis for Text Categorization and ... –

Number of Views:67

Avg rating:3.0/5.0

Slides: 49

Provided by: XUAN87

Category:

more less

Transcript and Presenter's Notes

Title: Bayesian Learning for Latent Semantic Analysis

1
Bayesian Learning for Latent Semantic Analysis

Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu

Presenter Hsuan-Sheng Chiu
2
Reference

Chia-Sheng Wu, Bayesian Latent Semantic Analysis
for Text Categorization and Information
Retrieval, 2005
Q. Huo and C.-H. Lee, On-line adaptive learning
of the continuous density hidden Markov model
based on approximate recursive Bayes estimate,
1997

3
Outline

Introduction
PLSA
ML (Maximum Likelihood)
MAP (Maximum A Posterior)
QB (Quasi-Bayes)
Experiments
Conclusions

4
Introduction

LSA vs. PLSA
Linear algebra and probability
Semantic space and latent topics
Batch learning vs. Incremental learning

5
PLSA

PLSA is a general machine learning technique,
which adopts the aspect model to represent the
co-occurrence data.
Topics (hidden variables)
Corpus (document-word pairs)

6
PLSA

Assume that di and wj are independent
conditionally on the mixture of associated topic
zk
Joint probability

7
ML PLSA

Log likelihood of Y
ML estimation

8
ML PLSA

Maximization

9
ML PLSA

Complete data
Incomplete data
EM (Expectation-Maximization) Algorithm
E-step
M-step

10
ML PLSA

E-Step

11
ML PLSA

Auxiliary function
And

12
ML PLSA

M-step
Lagrange multiplier

13
ML PLSA

Differentiation
New parameter estimation

14
MAP PLSA

Estimation by Maximizing the posteriori
probability
Definition of prior distribution
Dirichlet density
Prior density

Kronecker delta
Assume and are independent
15
MAP PLSA

Consider prior density
Maximum a Posteriori

16
MAP PLSA

E-step
expectation
Auxiliary function

17
MAP PLSA

M-step
Lagrange multiplier

18
MAP PLSA

Auxiliary function

19
MAP PLSA

Differentiation
New parameter estimation

20
QB PLSA

It needs to update continuously for an online
information system.
Estimation by maximize the posteriori
probability
Posterior density is approximated by the closest
tractable prior density with hyperparameters
As compared to MAP PLSA, the key difference using
QB PLSA is due to the updating of
hyperparameters.

21
QB PLSA

Conjugate prior
In Bayesian probability theory, a conjugate prior
is a prior distribution which has the property
that the posterior distribution is the same type
of distribution.
A close-form solution
A reproducible prior/posteriori pair for
incremental learning

22
QB PLSA

Hyperparameter a

23
QB PLSA

After careful arrangement, exponential of
posteriori expectation function can be expressed
A reproducible prior/posterior pair is generated
to build the updating mechanism of hyperparameters

24
Initial Hyperparameters

A open issue in Bayesian learning
If the initial prior knowledge is too strong or
after a lot of adaptation data have been
incrementally processed, the new adaptation data
usually have only a small impact on parameters
updating in incremental training.

25
Experiments

MED Corpus
1033 medical abstracts with 30 queries
7014 unique terms
433 abstracts for ML training
600 abstracts for MAP or QB training
Query subset for testing
K8
Reuters-21578
4270 documents for training
2925 for QB learning
2790 documents for testing
13353 unique words
10 categories

26
Experiments
27
Experiments
28
Experiments
29
Conclusions

This paper presented an adaptive text modeling
and classification approach for PLSA based
information system.
Future work
Extension of PLSA for bigram or trigram will be
explored.
Application for spoken document classification
and retrieval

30
Discriminative Maximum Entropy Language Model for
Speech Recognition

Chuang-Hua Chueh, To-Chang Chien and Jen-Tzung
Chien

Presenter Hsuan-Sheng Chiu
31
Reference

R. Rosenfeld, S. F. Chen and X. Zhu,
Whole-sentence exponential language models a
vehicle for linguistic statistical integration,
2001
W.H. Tsai, An Initial Study on Language Model
Estimation and Adaptation Techniques for Mandarin
Large Vocabulary Continuous Speech Recognition,
2005

32
Outline

Introduction
Whole-sentence exponential model
Discriminative ME language model
Experiment
Conclusions

33
Introduction

Language model
Statistical n-gram model
Latent semantic language model
Structured language model
Based on maximum entropy principle, we can
integrate different features to establish optimal
probability distribution.

34
Whole-Sentence Exponential Model

Traditional method
Exponential form
Usage
When used for speech recognition, the model is
not suitable for the first pass of the
recognizer, and should be used to re-score N-best
lists.

35
Whole-Sentence ME Language Model

Expectation of feature function
Empirical
Actual
Constraint

36
Whole-Sentence ME Language Model

To Solve the constrained optimization problem

37
GIS algorithm
38
Discriminative ME Language Model

In general, ME can be considered as a maximum
likelihood model using log-linear distribution.
Propose a Discriminative language model based on
whole-sentence ME model (DME)

39
Discriminative ME Language Model

Acoustic features for ME estimation
Sentence-level log-likelihood ratio of competing
and target sentences
Feature weight parameter
Namely, we activate feature parameter to be one
for those speech signals observed in training
database

40
Discriminative ME Language Model

New estimation
Upgrade to discriminative linguistic parameters

41
Discriminative ME Language Model
42
Experiment

Corpus TCC300
32 mixtures
12 Mel-frequency cepstral coefficients
1 log-energy and first derivation
4200 sentences for training, 450 for testing
Corpus Academia Sinica CKIP balanced corpus
Five million words
Vocabulary 32909 words

43
Experiment
44
Conclusions

A new ME language model integrating linguistic
and acoustic features for speech recognition
The derived ME language model was inherent with
discriminative power.
DME model involved a constrained optimization
procedure and was powerful for knowledge
integration.

45
Relation between DME and MMI