Latent Dirichlet Allocation - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Latent Dirichlet Allocation

Description:

... (wn|zn, ), a multinomial probability conditioned on the ... LDA is a simple model and is readily extended to continuous data or other non-multinomial data. ... – PowerPoint PPT presentation

Number of Views:191
Avg rating:3.0/5.0
Slides: 33
Provided by: XUAN87
Category:

less

Transcript and Presenter's Notes

Title: Latent Dirichlet Allocation


1
Latent Dirichlet Allocation
  • Presenter Hsuan-Sheng Chiu

2
Reference
  • D. M. Blei, A. Y. Ng and M. I. Jordan, Latent
    Dirichlet allocation, Journal of Machine
    Learning Research, vol. 3, no. 5, pp. 993-1022,
    2003.

3
Outline
  • Introduction
  • Notation and terminology
  • Latent Dirichlet allocation
  • Relationship with other latent variable models
  • Inference and parameter estimation
  • Discussion

4
Introduction
  • We consider with the problem of modeling text
    corpora and other collections of discrete data
  • To find short description of the members a
    collection
  • Significant process in IR
  • tf-idf scheme (Salton and McGill, 1983)
  • Latent Semantic Indexing (LSI, Deerwester et al.,
    1990)
  • Probabilistic LSI (pLSI, aspect model, Hofmann,
    1999)

5
Introduction (cont.)
  • Problem of pLSI
  • Incomplete Provide no probabilistic model at the
    level of documents
  • The number of parameters in the model grows
    linear with the size of the corpus
  • It is not clear how to assign probability to a
    document outside of the training data
  • Exchangeability bag of words

6
Notation and terminology
  • A word is the basic unit of discrete data ,from
    vocabulary indexed by 1,,V. The vth word is
    represented by a V-vector w such that wv 1 and
    wu 0 for u?v
  • A document is a sequence of N words denote by w
    (w1,w2,,wN)
  • A corpus is a collection of M documents denoted
    by D w1,w2,,wM

7
Latent Dirichlet allocation
  • Latent Dirichlet allocation (LDA) is a generative
    probabilistic model of a corpus.
  • Generative process for each document w in a
    corpus D
  • 1. Choose N Poisson(?)
  • 2. Choose ? Dir(a)
  • 3. For each of the N words wn
  • Choose a topic zn Multinomial(?)
  • Choose a word wn from p(wnzn, ß), a multinomial
    probability conditioned on the topic zn
  • ßij is a a element of kV matrix p(wj 1 zi
    1)

8
Latent Dirichlet allocation (cont.)
  • Representation of a document generation

? Dir(a) ? z1,z2,,zk
ß(z) ?w1,w2,,wn
z1 z2 zN
w1 w2 wN

w
N Poisson
9
Latent Dirichlet allocation (cont.)
  • Several simplifying assumptions
  • 1. The dimensionality k of Dirichlet distribution
    is known and fixed
  • 2. The word probabilities ß is fixed quantity
    that is to be estimated
  • 3. Document length N is independent of all the
    other data generating variable ? and z
  • A k-dimensional Dirichlet random variable ? can
    take values in the (k-1)-simplex

http//www.answers.com/topic/dirichlet-distributio
n
10
Latent Dirichlet allocation (cont.)
  • Simplex

The above figures show the graphs for the
n-simplexes with n 2 to 7. (from mathworld,
http//mathworld.wolfram.com/Simplex.html)
11
Latent Dirichlet allocation (cont.)
  • The joint distribution of a topic ?, and a set of
    N topic z, and a set of N words w
  • Marginal distribution of a document
  • Probability of a corpus

12
Latent Dirichlet allocation (cont.)
  • There are three levels to LDA representation
  • aß are corpus-level parameters
  • ?d are document-level variables
  • zdn, wdn are word-level variables

Refer to as hierarchical models, conditionally
independent hierarchical models and parametric
empirical Bayes models
13
Latent Dirichlet allocation (cont.)
  • LDA and exchangeability
  • A finite set of random variables z1,,zN is
    said exchangeable if the joint distribution is
    invariant to permutation (pis a permutation)
  • A infinite sequence of random variables is
    infinitely exchangeable if every finite
    subsequence is exchangeable
  • De Finettis representation theorem states that
    the joint distribution of an infinitely
    exchangeable sequence of random variables is as
    if a random parameter were drawn from some
    distribution and then the random variables in
    question were independent and identically
    distributed, conditioned on that parameter
  • http//en.wikipedia.org/wiki/De_Finetti's_theorem

14
Latent Dirichlet allocation (cont.)
  • In LDA, we assume that words are generated by
    topics (by fixed conditional distributions) and
    that those topics are infinitely exchangeable
    within a document

15
Latent Dirichlet allocation (cont.)
  • A continuous mixture of unigrams
  • By marginalizing over the hidden topic variable
    z, we can understand LDA as a two-level model
  • Generative process for a document w
  • 1. choose ? Dir(a)
  • 2. For each of the N word wn
  • Choose a word wn from p(wn?, ß)
  • Marginal distribution od a document

16
Latent Dirichlet allocation (cont.)
  • The distribution on the (V-1)-simplex is attained
    with only kkV parameters.

17
Relationship with other latent variable models
  • Unigram model
  • Mixture of unigrams
  • Each document is generated by first choosing a
    topic z and then generating N words independently
    form conditional multinomial
  • k-1 parameters

18
Relationship with other latent variable models
(cont.)
  • Probabilistic latent semantic indexing
  • Attempt to relax the simplifying assumption made
    in the mixture of unigrams models
  • In a sense, it does capture the possibility that
    a document may contain multiple topics
  • kvkM parameters and linear growth in M

19
Relationship with other latent variable models
(cont.)
  • Problem of PLSI
  • There is no natural way to use it to assign
    probability to a previously unseen document
  • The linear growth in parameters suggests that the
    model is prone to overfitting and empirically ,
    overfitting is indeed a serious problem
  • LDA overcomes both of there problems by treating
    the topic mixture weights as a k-parameter hidden
    random variable
  • The kkV parameters in a k-topic LDA model do not
    grow with the size of the training corpus.

20
Relationship with other latent variable models
(cont.)
  • A geometric interpretation three topics and
    three words

21
Relationship with other latent variable models
(cont.)
  • The unigram model find a single point on the word
    simplex and posits that all word in the corpus
    come from the corresponding distribution.
  • The mixture of unigram models posits that for
    each documents, one of the k points on the word
    simplex is chosen randomly and all the words of
    the document are drawn from the distribution
  • The pLSI model posits that each word of a
    training documents comes from a randomly chosen
    topic. The topics are themselves drawn from a
    document-specific distribution over topics.
  • LDA posits that each word of both the observed
    and unseen documents is generated by a randomly
    chosen topic which is drawn from a distribution
    with a randomly chosen parameter

22
Inference and parameter estimation
  • The key inferential problem is that of computing
    the posteriori distribution of the hidden
    variable given a document

Unfortunately, this distribution is intractable
to compute in general. A function which is
intractable due to the coupling between ? and ß
in the summation over latent topics
23
Inference and parameter estimation (cont.)
  • The basic idea of convexity-based variational
    inference is to make use of Jensens inequality
    to obtain an adjustable lower bound on the log
    likelihood.
  • Essentially, one considers a family of lower
    bounds, indexed by a set of variational
    parameters.
  • A simple way to obtain a tractable family of
    lower bound is to consider simple modifications
    of the original graph model in which some of the
    edges and nodes are removed.

24
Inference and parameter estimation (cont.)
  • Drop some edges and the w nodes

25
Inference and parameter estimation (cont.)
  • Variational distribution
  • Lower bound on Log-likelihood
  • KL between variational posteriori and true
    posteriori

26
Inference and parameter estimation (cont.)
  • Finding a tight lower bound on the log likelihood
  • Maximizing the lower bound with respect to ?and f
    is equivalent to minimizing the KL divergence
    between the variational posterior probability and
    the true posterior probability

27
Inference and parameter estimation (cont.)
  • Expand the lower bound

28
Inference and parameter estimation (cont.)
  • Then

29
Inference and parameter estimation (cont.)
  • We can get variational parameters by adding
    Lagrange multipliers and setting this derivative
    to zero

30
Inference and parameter estimation (cont.)
  • Parameter estimation
  • Maximize log likelihood of the data
  • Variational inference provide us with a tractable
    lower bound on the log likelihood, a bound which
    we can maximize with respect a and ß
  • Variational EM procedure
  • 1. (E-step) For each document, find the
    optimizing values of the variational parameters
    ?, f
  • 2. (M-step) Maximize the result lower bound on
    the log likelihood with respect to the model
    parameters a and ß

31
Inference and parameter estimation (cont.)
  • Smoothed LDA model

32
Discussion
  • LDA is a flexible generative probabilistic model
    for collection of discrete data.
  • Exact inference is intractable for LDA, but any
    or a large suite of approximate inference
    algorithms for inference and parameter estimation
    can be used with the LDA framework.
  • LDA is a simple model and is readily extended to
    continuous data or other non-multinomial data.
Write a Comment
User Comments (0)
About PowerShow.com