Title: Bayesian Learning for Latent Semantic Analysis
1Bayesian Learning for Latent Semantic Analysis
- Jen-Tzung Chien, Meng-Sun Wu and Chia-Sheng Wu
Presenter Hsuan-Sheng Chiu
2Reference
- Chia-Sheng Wu, Bayesian Latent Semantic Analysis
for Text Categorization and Information
Retrieval, 2005 - Q. Huo and C.-H. Lee, On-line adaptive learning
of the continuous density hidden Markov model
based on approximate recursive Bayes estimate,
1997
3Outline
- Introduction
- PLSA
- ML (Maximum Likelihood)
- MAP (Maximum A Posterior)
- QB (Quasi-Bayes)
- Experiments
- Conclusions
4Introduction
- LSA vs. PLSA
- Linear algebra and probability
- Semantic space and latent topics
- Batch learning vs. Incremental learning
5PLSA
- PLSA is a general machine learning technique,
which adopts the aspect model to represent the
co-occurrence data. - Topics (hidden variables)
- Corpus (document-word pairs)
6PLSA
- Assume that di and wj are independent
conditionally on the mixture of associated topic
zk - Joint probability
7ML PLSA
- Log likelihood of Y
- ML estimation
8ML PLSA
9ML PLSA
- Complete data
- Incomplete data
- EM (Expectation-Maximization) Algorithm
- E-step
- M-step
10ML PLSA
11ML PLSA
12ML PLSA
- M-step
- Lagrange multiplier
13ML PLSA
- Differentiation
- New parameter estimation
14MAP PLSA
- Estimation by Maximizing the posteriori
probability - Definition of prior distribution
- Dirichlet density
- Prior density
Kronecker delta
Assume and are independent
15MAP PLSA
- Consider prior density
- Maximum a Posteriori
16MAP PLSA
- E-step
- expectation
- Auxiliary function
17MAP PLSA
- M-step
- Lagrange multiplier
18MAP PLSA
19MAP PLSA
- Differentiation
- New parameter estimation
20QB PLSA
- It needs to update continuously for an online
information system. - Estimation by maximize the posteriori
probability - Posterior density is approximated by the closest
tractable prior density with hyperparameters - As compared to MAP PLSA, the key difference using
QB PLSA is due to the updating of
hyperparameters. -
21QB PLSA
- Conjugate prior
- In Bayesian probability theory, a conjugate prior
is a prior distribution which has the property
that the posterior distribution is the same type
of distribution. - A close-form solution
- A reproducible prior/posteriori pair for
incremental learning
22QB PLSA
23QB PLSA
- After careful arrangement, exponential of
posteriori expectation function can be expressed - A reproducible prior/posterior pair is generated
to build the updating mechanism of hyperparameters
24Initial Hyperparameters
- A open issue in Bayesian learning
- If the initial prior knowledge is too strong or
after a lot of adaptation data have been
incrementally processed, the new adaptation data
usually have only a small impact on parameters
updating in incremental training.
25Experiments
- MED Corpus
- 1033 medical abstracts with 30 queries
- 7014 unique terms
- 433 abstracts for ML training
- 600 abstracts for MAP or QB training
- Query subset for testing
- K8
- Reuters-21578
- 4270 documents for training
- 2925 for QB learning
- 2790 documents for testing
- 13353 unique words
- 10 categories
26Experiments
27Experiments
28Experiments
29Conclusions
- This paper presented an adaptive text modeling
and classification approach for PLSA based
information system. - Future work
- Extension of PLSA for bigram or trigram will be
explored. - Application for spoken document classification
and retrieval
30Discriminative Maximum Entropy Language Model for
Speech Recognition
- Chuang-Hua Chueh, To-Chang Chien and Jen-Tzung
Chien
Presenter Hsuan-Sheng Chiu
31Reference
- R. Rosenfeld, S. F. Chen and X. Zhu,
Whole-sentence exponential language models a
vehicle for linguistic statistical integration,
2001 - W.H. Tsai, An Initial Study on Language Model
Estimation and Adaptation Techniques for Mandarin
Large Vocabulary Continuous Speech Recognition,
2005
32Outline
- Introduction
- Whole-sentence exponential model
- Discriminative ME language model
- Experiment
- Conclusions
33Introduction
- Language model
- Statistical n-gram model
- Latent semantic language model
- Structured language model
- Based on maximum entropy principle, we can
integrate different features to establish optimal
probability distribution.
34Whole-Sentence Exponential Model
- Traditional method
- Exponential form
- Usage
- When used for speech recognition, the model is
not suitable for the first pass of the
recognizer, and should be used to re-score N-best
lists.
35Whole-Sentence ME Language Model
- Expectation of feature function
- Empirical
- Actual
- Constraint
36Whole-Sentence ME Language Model
- To Solve the constrained optimization problem
37GIS algorithm
38Discriminative ME Language Model
- In general, ME can be considered as a maximum
likelihood model using log-linear distribution. - Propose a Discriminative language model based on
whole-sentence ME model (DME)
39Discriminative ME Language Model
- Acoustic features for ME estimation
- Sentence-level log-likelihood ratio of competing
and target sentences - Feature weight parameter
- Namely, we activate feature parameter to be one
for those speech signals observed in training
database
40Discriminative ME Language Model
- New estimation
- Upgrade to discriminative linguistic parameters
41Discriminative ME Language Model
42Experiment
- Corpus TCC300
- 32 mixtures
- 12 Mel-frequency cepstral coefficients
- 1 log-energy and first derivation
- 4200 sentences for training, 450 for testing
- Corpus Academia Sinica CKIP balanced corpus
- Five million words
- Vocabulary 32909 words
43Experiment
44Conclusions
- A new ME language model integrating linguistic
and acoustic features for speech recognition - The derived ME language model was inherent with
discriminative power. - DME model involved a constrained optimization
procedure and was powerful for knowledge
integration.
45Relation between DME and MMI
- MMI criterion
- Modified MMI criterion
- Express ME model as ML model
46Relation between DME and MMI
47Relation between DME and MMI
48Relation between DME and MMI