Title: Generative Topic Models for Community Analysis
1Generative Topic Models for Community Analysis
2Objectives
- Provide an overview of topic models and their
learning techniques - Mixture models, PLSA, LDA
- EM, variational EM, Gibbs sampling
- Convince you that topic models are an attractive
framework for community analysis - 5 definitive papers
3Outline
- Part I Introduction to Topic Models
- Naive Bayes model
- Mixture Models
- Expectation Maximization
- PLSA
- LDA
- Variational EM
- Gibbs Sampling
- Part II Topic Models for Community Analysis
- Citation modeling with PLSA
- Citation Modeling with LDA
- Author Topic Model
- Author Topic Recipient Model
- Modeling influence of Citations
- Mixed membership Stochastic Block Model
4Introduction to Topic Models
?
- For each document d 1,?, M
- Generate Cd Mult( ?)
- For each position n 1,?, Nd
- Generate wn Mult(?,Cd)
C
..
WN
W1
W2
W3
M
b
5Introduction to Topic Models
- Naïve Bayes Model Compact representation
?
?
C
C
..
WN
W1
W2
W3
W
M
N
b
M
b
6Introduction to Topic Models
- Multinomial naïve Bayes Learning
- Maximize the log-likelihood of observed variables
w.r.t. the parameters - Convex function global optimum
- Solution
7Introduction to Topic Models
- Mixture model unsupervised naïve Bayes model
- Joint probability of words and classes
- But classes are not visible
?
C
Z
W
N
M
b
8Introduction to Topic Models
- Mixture model learning
- Not a convex function
- No global optimum solution
- Solution Expectation Maximization
- Iterative algorithm
- Finds local optimum
- Guaranteed to maximize a lower-bound on the
log-likelihood of the observed data
9Introduction to Topic Models
log(0.5x10.5x2)
- Quick summary of EM
- Log is a concave function
- Lower-bound is convex!
- Optimize this lower-bound w.r.t. each variable
instead
0.5log(x1)0.5log(x2)
X2
X1
0.5x10.5x2
H(?)
10Introduction to Topic Models
- Mixture model EM solution
E-step
M-step
11Introduction to Topic Models
12Introduction to Topic Models
- Probabilistic Latent Semantic Analysis Model
d
- Select document d Mult(?)
- For each position n 1,?, Nd
- generate zn Mult( ?d)
- generate wn Mult( ?zn)
?d
?
Topic distribution
z
w
N
M
?
13Introduction to Topic Models
- Probabilistic Latent Semantic Analysis Model
- Learning using EM
- Not a complete generative model
- Has a distribution ? over the training set of
documents no new document can be generated! - Nevertheless, more realistic than mixture model
- Documents can discuss multiple topics!
14Introduction to Topic Models
- PLSA topics (TDT-1 corpus)
15Introduction to Topic Models
16Introduction to Topic Models
- Latent Dirichlet Allocation
?
- For each document d 1,?,M
- Generate ?d Dir( ?)
- For each position n 1,?, Nd
- generate zn Mult( ?d)
- generate wn Mult( ?zn)
a
z
w
N
M
?
17Introduction to Topic Models
- Latent Dirichlet Allocation
- Overcomes the issues with PLSA
- Can generate any random document
- Parameter learning
- Variational EM
- Numerical approximation using lower-bounds
- Results in biased solutions
- Convergence has numerical guarantees
- Gibbs Sampling
- Stochastic simulation
- unbiased solutions
- Stochastic convergence
18Introduction to Topic Models
- Variational EM for LDA
- Approximate the posterior by a simpler
distribution
- A convex function in each parameter!
19Introduction to Topic Models
- Gibbs sampling
- Applicable when joint distribution is hard to
evaluate but conditional distribution is known - Sequence of samples comprises a Markov Chain
- Stationary distribution of the chain is the joint
distribution
20Introduction to Topic Models
21Introduction to Topic Models
22Introduction to Topic Models
- Perplexity comparison of various models
Unigram
Mixture model
PLSA
Lower is better
LDA
23Introduction to Topic Models
- Summary
- Generative models for exchangeable data
- Unsupervised models
- Automatically discover topics
- Well developed approximate techniques available
for inference and learning
24Outline
- Part I Introduction to Topic Models
- Naive Bayes model
- Mixture Models
- Expectation Maximization
- PLSA
- LDA
- Variational EM
- Gibbs Sampling
- Part II Topic Models for Community Analysis
- Citation modeling with PLSA
- Citation Modeling with LDA
- Author Topic Model
- Author Topic Recipient Model
- Modeling influence of Citations
- Mixed membership Stochastic Block Model
25Hyperlink modeling using PLSA
26Hyperlink modeling using PLSACohn and Hoffman,
NIPS, 2001
?
- Select document d Mult(?)
- For each position n 1,?, Nd
- generate zn Mult( ?d)
- generate wn Mult( ?zn)
- For each citation j 1,?, Ld
- generate zj Mult( ?d)
- generate cj Mult( ?zj)
d
?d
z
z
w
c
N
L
M
?
g
27Hyperlink modeling using PLSACohn and Hoffman,
NIPS, 2001
?
PLSA likelihood
d
?d
z
z
New likelihood
w
c
N
L
M
?
g
Learning using EM
28Hyperlink modeling using PLSACohn and Hoffman,
NIPS, 2001
Heuristic
?
(1-?)
0 ? 1 determines the relative importance of
content and hyperlinks
29Hyperlink modeling using PLSACohn and Hoffman,
NIPS, 2001
- Experiments Text Classification
- Datasets
- Web KB
- 6000 CS dept web pages with hyperlinks
- 6 Classes faculty, course, student, staff, etc.
- Cora
- 2000 Machine learning abstracts with citations
- 7 classes sub-areas of machine learning
- Methodology
- Learn the model on complete data and obtain ?d
for each document - Test documents classified into the label of the
nearest neighbor in training set - Distance measured as cosine similarity in the ?
space - Measure the performance as a function of ?
30Hyperlink modeling using PLSACohn and Hoffman,
NIPS, 2001
- Classification performance
content
Hyperlink
Hyperlink
content
31Hyperlink modeling using LDA
32Hyperlink modeling using LDAErosheva, Fienberg,
Lafferty, PNAS, 2004
a
?
- For each document d 1,?,M
- Generate ?d Dir( ?)
- For each position n 1,?, Nd
- generate zn Mult( ?d)
- generate wn Mult( ?zn)
- For each citation j 1,?, Ld
- generate zj Mult( . ?d)
- generate cj Mult( . ?zj)
z
z
w
c
N
L
M
?
g
Learning using variational EM
33Hyperlink modeling using LDAErosheva, Fienberg,
Lafferty, PNAS, 2004
34Author-Topic Model for Scientific Literature
35Author-Topic Model for Scientific
LiteratureRozen-Zvi, Griffiths, Steyvers, Smyth
UAI, 2004
a
P
- For each author a 1,?,A
- Generate ?a Dir( ?)
- For each topic k 1,?,K
- Generate fk Dir( ?)
- For each document d 1,?,M
- For each position n 1,?, Nd
- Generate author x Unif( ad)
- generate zn Mult( ?a)
- generate wn Mult( fzn)
a
x
z
?
A
w
N
M
f
b
K
36Author-Topic Model for Scientific Literature
Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004
a
P
?
x
z
?
A
w
N
M
f
b
K
37Author-Topic Model for Scientific Literature
Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004
38Author-Topic Model for Scientific Literature
Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004
- Topic-Author visualization
39Author-Topic Model for Scientific
LiteratureRozen-Zvi, Griffiths, Steyvers, Smyth
UAI, 2004
- Application 1 Author similarity
40Author-Topic Model for Scientific Literature
Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004
- Application 2 Author entropy
41Author-Topic-Recipient model for email data
McCallum, Corrada-Emmanuel,Wang, ICJAI05
42Author-Topic-Recipient model for email data
McCallum, Corrada-Emmanuel,Wang, ICJAI05
Gibbs sampling
43Author-Topic-Recipient model for email data
McCallum, Corrada-Emmanuel,Wang, ICJAI05
- Datasets
- Enron email data
- 23,488 messages between 147 users
- McCallums personal email
- 23,488(?) messages with 128 authors
44Author-Topic-Recipient model for email data
McCallum, Corrada-Emmanuel,Wang, ICJAI05
- Topic Visualization Enron set
45Author-Topic-Recipient model for email data
McCallum, Corrada-Emmanuel,Wang, ICJAI05
- Topic Visualization McCallums data
46Author-Topic-Recipient model for email data
McCallum, Corrada-Emmanuel,Wang, ICJAI05
47Modeling Citation Influences
48Modeling Citation InfluencesDietz, Bickel,
Scheffer, ICML 2007
49Modeling Citation InfluencesDietz, Bickel,
Scheffer, ICML 2007
50Modeling Citation InfluencesDietz, Bickel,
Scheffer, ICML 2007
- Citation influence graph for LDA paper
51Modeling Citation InfluencesDietz, Bickel,
Scheffer, ICML 2007
- Words in LDA paper assigned to citations
52Modeling Citation InfluencesDietz, Bickel,
Scheffer, ICML 2007
- Performance evaluation
- Data
- 22 seed papers and 132 cited papers
- Users labeled citations on a scale of 1-4
- Models considered
- Citation influence model
- Copy cat model
- LDA-JS-divergence
- Symmetric Divergence in topic space
- LDA-post
- Page Rank
- TF-IDF
- Evaulation measure
- Area under the ROC curve
where
53Modeling Citation InfluencesDietz, Bickel,
Scheffer, ICML 2007
54Mixed membership Stochastic Block modelsWork In
Progress
- A complete generative model for text and
citations - Can model the topicality of citations
- Topic Specific PageRank
- Can also predict citations between unseen
documents
55Summary
- Topic Modeling is an interesting, new framework
for community analysis - Sound theoretical basis
- Completely unsupervised
- Simultaneous modeling of multiple fields
- Discovers soft-communities and clusters in
terms of topic membership - Can also be used for predictive purposes
-