Title: Hierarchical Mixture Models
1Hierarchical Mixture Models
- A Probabilistic Analysis
- Mark Sandler
- Google Inc.
2Mixture models quick overview
- Classical problem
- Many documents on various topics, how do we
automatically classify them? - Mixture models allows to formalize the problem
- Each topic defines a probability distribution
over entire vocabulary - Math (0.1, 0.00, 0. 03, ) Physics (0.01,
0.3, 0.01, ) - Each document has quantitative relevance to one
or more topics - Document is created by repeated sample from the
mixture of topics - Goal given documents, reconstruct the underlying
topics and each documents relevance to each
topic
Term 1
Term 2
Term 3
Term 2
Term 1
3Topical Hierarchy
- Motivation there are lots of topics in real
data! - Topics are not independent
- Document about math, usually is related to
science as well. - Hierarchy allows to encode these dependencies
- Hierarchy allows to encode dependencies and do
lazy evaluation - A report on Tour De France, can be initially
classified into sports, without worrying about
where it falls within sports - Where do we get hierarchy from?
4Our results
- 1. A two stage generative model
- Topical hierarchy is constructed using
adversarial game - We treat the hierarchy is a giant mixture model
and documents are created accordingly. - 2. Given a part of the hierarchy we prove that
we maintain classification accuracy for documents
in entire hierarchy - 3. Design an algorithm which learns hierarchy
from unlabeled data - 4. Experimental results
Tour de france
Cycling (un)
5Generative model for the topical hierarchy
- Each topic is probability distribution over terms
as before - There is a base topic which includes all the
documents - Each new topic is generated from the parent by
adversarial mutation of some (possibly all)
frequencies.
Base topic
Science
Sports
baseball
hockey
physics
math
6Generative model for the hierarchy
- Mulstistep adversary-driven random process
- Adversary first chooses the base topical
distribution( is a frequency of term i
in the language) - For parent topic adversary
- Decides on the number of children
- For each child chooses a vector
probability distribution - Frequency of term l in the child topic is
determined by - Where is sampled from
- Distributions can depend on constructed
part of the hierarchy
D1(i), , Dl(i),,
7Distributions satisfy a few conditions
- We have
- Each frequency change has zero expectation
- The new topic is different from the parent
- No negative frequencies allowed
- The spread (slope) of a distribution is large
- (change in frequencies is not concentrated on
just a few terms)
8Related work reconstructing hierarchies
- Using chinese restaurant process followed by LDA
- Blei et al, NIPS 2004
- Get it from labeled data
- Toutanova et al , CIKM 2001
- Cluster-Abstraction model (EM based local search)
- Hofmann, IJCAI 1999
- Bottom-up approach Hierarchical agglomerative
clustering (lots of work)
9Our results
- 1. A two stage generative model
- Topical hierarchy is constructed using
adversarial game - We treat the hierarchy is a giant mixture model
and documents are created accordingly - 2. Given a part of the hierarchy we prove that
we maintain classification accuracy for documents
in entire hierarchy - 3. Design an algorithm which learns hierarchy
from unlabeled data - 4. Experimental results
10Classification along the path in the tree
Base topic
- Suppose we know
- Base -gt science -gt physics
- Consider a document on physics of hockey puck.
- If a document is relevant to a topic, not in the
known part of the hierarchy then it contributes
to the closest node in the path.
Science
Sports
baseball
hockey
physics
math
11Algorithm when a path in the hierarchy is known
- There is an hierarchy, and we know path in it
- Treat the path as an instance of mixture models
- We can treat this as a single classification
- problem and solve it
- How to classify? Why does it work?
- Problem documents are generated from
- distributions which are not part of a known
mixture
12Why does it work? Part 1 back to plain mixtures
- Suppose we know a matrix of topics
- Each document is a sample from a mixture of
topics d Wp - Need to compute the underlying mixing
coefficients. - Classical approach use Naïve Bayes gives
unclear guarantees - Pseudoinverses guarantee that we find underlying
mixing coefficients with small error and high
probability
13Generalized pseudoinverse
- Generalized pseudoinverses (Kleinberg, S, STOC04)
- Let V is such that
- Then
- Error is bounded with high probability
- The length of a document is a function of B
- Take home message.
- Exists a matrix V, such that
- If the topics are linearly independent then we
can guarantee accuracy of classification - See the above papers for more details
14Part II still, why does it work?
- The mixture model is obtained by using the
topics along the path. - The documents are generated by topics in the
different (unknown) parts of hierarchy - Suppose document d is produced using topic
- (underlying distribution is )
- If is a parent of then
- because expectation of e(i) is 0
15Reconstructing hierarchy from unlabeled data
- Construct base topic - an ambient distribution
across all documents - Gives a root of the hierarchy tree
- For each child topic T
- Build a co-occurrence matrix from the documents
which belong to the topic - Choose column which is furthest away from T (in
L1 norm) - Classify documents that belong to the new topic
- Iterate on topic T, until no documents are
split out of the parent - Iterate the procedure on each child topic
16Overview of the rest of the talk
- 1. A two stage model
- Topic hierarchy is constructed using adversarial
game - Topics form a mixture model and documents are
created using the model - In this model we show that if we know a part of
the hierarchy we can guarantee classification
accuracy along this path - Design an algorithm which learns hierarchy from
unlabeled data - Experimental results
17Experiments abstracts from ArXiV
- 250K abstracts on different areas of physics with
some computer science and math - 15 categories total
- Run our hierarchy reconstruction algorith to
produce individual clusters. - 76 clusters, overall recall / precision 70/70
18Experiments ArXiV
19Newsgroup 20
- Contains 20 newsgroup on several related topics
- (computers, electronics, politics, religion,
etc..) -
- Relatively small dataset (20K documents)
- We use our algorithm to build the top level
clusters - The clusters coincide with natural split of the
topic
20Experiments Newsgroups
21Conclusions
- Theoretical framework to analyze topical
hierarchies - A natural generative model to construct
hieararchy of topics - Provided algorithms can operate without
reconstructing the entire hierarchy - Provides an algorithm to reconstruct topics
- Questions?
- (Ask now, or come see poster 13)
22Pseudoinverse, independence coefficient and such
- From Kleinberg and S. (STOC04), and S. (KDD05)
- Simple observation
- but d is sparse whereas d is not
- Can be shown that for any k x N matrix V, if
its maximal element is bounded - The error can be bounded in terms of k, maximal
element of V, and the length of a document
(number of non zero entries in d ) - But independent of the total size of the
dictionary!
23Thanks!
24An example
- Dictionary algebra, equation, generator,
charge, electron - Topics MathT 0.4, 0.5, 0.098, 0.01, 0.01
- PhysicsT 0.01, 0.4, 0.2, 0.2, 0.2
- Typical Math document relevance vector (1,0)
- Typical physics document relevance vector
(0, 1) - Document related to both Math and Physics (0.5,
0.5)
Algebra
Generator
Equation
Equation
Algebra
Equation
Equation
Charge
Generator
Electron
Equation
Charge
Generator
Equation
Algebra