Latent%20Dirichlet%20Allocation - PowerPoint PPT Presentation

About This Presentation
Title:

Latent%20Dirichlet%20Allocation

Description:

Latent Dirichlet Allocation David M Blei, Andrew Y Ng & Michael I Jordan presented by Tilaye Alemu & Anand Ramkissoon Motivation for LDA In lay terms: document ... – PowerPoint PPT presentation

Number of Views:184
Avg rating:3.0/5.0
Slides: 23
Provided by: AnandRam2
Category:

less

Transcript and Presenter's Notes

Title: Latent%20Dirichlet%20Allocation


1
Latent Dirichlet Allocation
  • David M Blei, Andrew Y Ng Michael I Jordan
  • presented by Tilaye Alemu Anand Ramkissoon

2
Motivation for LDA
  • In lay terms
  • document modelling
  • text classification
  • collaborative filtering
  • ...
  • ...in the context of Information Retrieval
  • The principal focus in this paper is on document
    classification within a corpus

3
Structure of this talk
  • Part 1
  • Theory
  • Background
  • (some) other approaches
  • Part 2
  • Experimental results
  • some details of usage
  • wider applications

4
LDA conceptual features
  • Generative
  • Probabilistic
  • Collections of discrete data
  • 3-level hierarchical Bayesian model
  • mixture models
  • efficient approximate inference techniques
  • variational methods
  • EM algorithm for empirical Bayes parameter
    estimation

5
How to classify text documents
  • Word (term) frequency
  • tf-idf
  • term-by-document matrix
  • discriminative sets of words
  • fixed-length lists of numbers
  • little statistical structure
  • Dimensionality reduction techniques
  • Latent Semantic Indexing
  • Singular value decomposition
  • not generative

6
How to classify text documents ct'd
  • probabilistic LSI (PLSI)
  • each word generated by one topic
  • each document generated by a mixture of topics
  • a document is represented as a list of mixing
    proportions for topics
  • No generative model for these numbers
  • Number of parameters grows linearly with the
    corpus
  • Overfitting
  • How to classify documents outside training set

7
A major simplifying assumption
  • A document is a bag of words
  • A corpus is a bag of documents
  • order is unimportant
  • exchangeability
  • de Finetti representation theorem
  • any collection of exchangeable random variables
    has a representation as a (generally infinite)
    mixture distribution

8
A note about exchangeability
  • Does not mean that random variables are iid
  • iid when conditioned on wrt to an underlying
    latent parameter of a probability distribution
  • Conditionally the joint distribution is simple
    and factored

9
Notation
  • word unit of discrete data, an item from a
    vocabulary indexed 1,...,V
  • each word is a unit basis V-vector
  • document sequence of N words w(w1,...,wN)
  • corpus a collection of M documents D(w1,...,wM)
  • Each document is considered a random mixture over
    latent topics
  • Each topic is considered a distribution over words

10
LDA assumes a generative processfor each
document in the corpus
11
Probability density for the DirichletRandom
variable
12
Joint distribution of a Topic mixture
13
Marginal distribution of a document
14
Probability of a corpus
15
Marginalize over z
  • The word distribution
  • The generative process

16
a Unigram Model
17
probabilistic Latent Semantic Indexing
18
Inference from LDA
19
Variational Inference
20
A family of distributions on latent variables
  • The Dirichlet parameter ? and the multinomial
    parameters f are the free variational parameters

21
The update equations
  • Minimize the Kullback-Leibler divergence between
    the distribution and the true posterior

22
Variational Inference Algorithm
Write a Comment
User Comments (0)
About PowerShow.com