Title: Non-Bayesian Networks
1Non-Bayesian Networks
- Ron Bekkerman,
- University of Massachusetts
Joint work with Mehran Sahami (Google)
2Unsupervised learning
- Tell me something about the data I have
- Compact representation
- Example clustering
- Generally, ill-defined
- Supervised methods are clearly preferable
- But sometimes inapplicable
3Semi-(un)supervised learning
- We have a few labeled examples
- And many unlabeled
- The real-world case
- Close problem transfer learning
- We know something about something else
Labeled red
Labeled blue
4Generative models
- Very popular for unsupervised learning
- Example Latent Dirichlet Allocation
- Blei, Ng Jordan 2003
- Number of nodes
5Many generative models
- Are huge
- Inference is very difficult
- Model learning is impossible
- And biased
- Too many assumptions on the data
- That may be wrong after all
6Pros cons of generative models
- Pros
- Visualization
- Markov property
- Factorization
- Cons
- Size
- Free parameters
- Arbitrary choice of distribution families
7Proposal
- Symmetric interactions between variables
- Instead of asymmetric causal relationships
- Undirected edges
8Proposal
- Symmetric interactions between variables
- Instead of asymmetric causal relationships
- Undirected edges
- Keeping track of the model size
- One random variable per concept
- Object-oriented approach
9Proposal
- Symmetric interactions between variables
- Instead of asymmetric causal relationships
- Undirected edges
- Keeping track of the model size
- One random variable per concept
- Object-oriented approach
- No arbitrarily chosen distribution families
10Proposal
- Symmetric interactions between variables
- Instead of asymmetric causal relationships
- Undirected edges
- Keeping track of the model size
- One random variable per concept
- Object-oriented approach
- No arbitrarily chosen distribution families
- Minimum free parameters
11Undirected graphical models
- A.k.a. Markov Random Fields (MRFs)
- Markov property holds
- Joint distribution is a product of potential
functions over cliques - Hammersley-Clifford theorem
- Potentials are arbitrary functions
- is a normalization factor
12Non-Bayesian inference
- No training phase
- Potentials are fixed for each clique
- is a constant
- Maximum likelihood procedure is then
- Maximization of a non-probabilistic objective
function - Defined over cliques of a graphical model
13Objective function
- would measure
- Similarity between and
- Distance between and
- Kernel-type
- Information and provide on each other
- Etc
- Maximum likelihood
- Find the best assignment , and
- Best means similar / close / informative etc.
14Combinatorial Markov Random Fields
15Combinatorial random variable
- Discrete random variable defined over a
combinatorial set - Given a set X of n values
- is defined over a set of values
- Example lotto 6/49
- Given a set X of 49 balls, draw 6 balls
- is defined over all the subsets of size 6
- values
16Example hard clustering
- is your data (n data points)
- is (hard) clustering of
- is random variable over all possible
- values (k is the number of clusters)
X
17Another example ranking
- is your data
- is partial order on values of
- is random variable over all possible
- n! values
X
18Combinatorial MRF (Comraf)
- MRF with combinatorial random variables
- Which are not necessarily all nodes of MRF
- Goal
- Find best (most likely) assignment to
combinatorial random variables - Comraf model graph G and objective F
- Challenge
- Usually, P( ) cannot be explicitly specified
- No existing inference methods applicable
19Properties of Comraf models
- Compact one node per concept
- Such as clusterings of documents, rankings of
movies, subsets of images etc. - Data-driven no assumptions on data distributions
- Only empirical distributions represented
- Such as distribution of words over documents
- Generic applicable to many tasks
- In unsupervised semi-supervised learning
20Comrafs for clustering
21Comrafs for clustering
- Allows multi-way clustering
- Example
- Clustering of documents, words, authors and titles
- Set of possible clusterings form a lattice
- Where each point is a clustering
22Clustering objective function
- Recall that each value is a clustering
- Each clustering is a random variable
- Over its clusters
- Our objective
- Sum of pairwise Mutual Information
23Clustering quasi-random walk
- For each variable
- Start with some clustering
- Say, (0,0,0)
- All data points are in cluster
24Clustering quasi-random walk
- For each variable
- Start with some solution
- Say, (0,0,0)
- All data points are in cluster
- Walk on the lattice
- While maximizing the objective
(1,1,0)
Split of a cluster
25Clustering quasi-random walk
- For each variable
- Start with some solution
- Say, (0,0,0)
- All data points are in cluster
- Walk on the lattice
- While maximizing the objective
(1,1,0)
Merge of two clusters
26Iterative Conditional Mode
- Fix values of all but one variable
- Perform quasi-random walk
- To a local maximum
27Iterative Conditional Mode
- Fix values of all but one variable
- Perform quasi-random walk
- To a local maximum
- Fix this value, move to next variable
28Algorithm summary
29Examples of Comraf models
- Simplest case
- Objective
- This is Information Bottleneck
- Tishby, Pereira Bialek 1999
- More complex case
- Objective
- This is Information-theoretic Co-clustering
- Dhillon, Mallela Modha 2003
X
Y
X
30Evaluation methodology
- Clustering evaluation
- Is generally unintuitive
- Is an entire research field
- We use the accuracy measure
- Following Slonim et al. and Dhillon et al.
- Ground truth
- Our results
-
Size of dominant class in cluster c
31Datasets
- Three CALO email datasets
- acheyer 664 messages, 38 folders
- mgervasio 777 messages, 15 folders
- mgondek 297 messages, 14 folders
- Two Enron email datasets
- kitchen-l 4015 messages, 47 folders
- sanders-r 1188 messages, 30 folders
- The 20 Newsgroups 19,997 messages
32Clustering results
33Comrafs for semi-supervised and transfer learning
34Observed nodes in Comrafs
- A node is observed if its value is fixed
- In the case of clustering
- We are given a clustering of set X
- Observed nodes shaded in Comraf graph
- Comraf model remains essentially the same
- can be labeled data of D
- As well as of something else
35Semi-supervised clustering
- Intrinsic model
- is some labeled data of D
- Which forms natural partitioning
- Represented as observed node
- Objective
- Algorithmic setup is the same
- Clearly, without optimizing the observed node
36Constrained optimization scheme
- Well-established approach to
semi-supervised clustering - Wagstaff Cardie 2000
- Must-link ml and cannot-link cl constraints
- Comraf graph
- Objective
37Results on email datasets
- Randomly choose 10, 20 and 30 of data to be
labeled - Plot the accuracy of the unlabeled portion
38Semi-supervised clustering on 20NG
- 69.50.7 unsupervised clustering
- We consider 10 of data as labeled
- 74.80.6 constrained Comraf scheme
- 78.90.8 intrinsic Comraf scheme
- 3 of 5 runs built well-balanced clusterings
- Where each category is dominant in one cluster
- Clustering can be compared with classification
- 80.00.6 on 3 well-balanced clusterings
- 77.20.2 SVM on the same data
39Resistance to noise
- Intrinsic scheme is resistant to noise
- In contrast to constrained scheme
- Randomly change 10, 20 and 30 labels
40Transfer learning
- Variety of possibilities
- Given two datasets (close in content)
- acheyer and mgervasio
- Consider one labeled and another not
- And vice versa
- We clustered words given class labels
- And presented the clustering as observed
for the other dataset - 3 improvement on mgervasio
41Other Comraf applications and open problems
42Comraf for image clustering
- Multi-media Information Retrieval setup
- Clustering of images, their
local features, captions and
words in these captions - Ordinary Comraf clustering case
- Challenge
- Probability of a feature to appear in an image is
difficult to estimate - Unlike the word / document case
43Comraf for topic detection
- Given a set of n documents, where k of them are
on a certain topic - While the rest is noise
- Goal filter out (n-k) noisy documents
- Two combinatorial RVs
- Over all subsets of documents / words
- Objective
- Minimizing joint entropy
44Comraf for multi-way ranking
- Given ranking of movies, rank actors and
directors - Each node is distributed over all possible
rankings - n! values
- Challenge the objective
- Should embrace the notion of order
45Model selection
- Comrafs are usually small
- Just a few nodes
- We are certain in choosing nodes, uncertain in
choosing interactions - Finding a good set of interactions is feasible
- E.g. there are only 38 possibilities to build a
connected graph of 4 nodes
46Hierarchy in Comrafs
- Hierarchy naturally emerges in probabilistic
content - As well as in object-oriented
- Proposal telescopic models
- P( )P(A,B,C)
A
B
C
D
E
K
L
F
G
47Conclusions
48Comraf recipe
- Break your problem to concepts
- Represent each concept as combinatorial RV
- Decide which RV interacts with which
- Build the Comraf graph
- Choose the objective function
- E.g., for each interacting pair ( , )
provide joint distribution of underlying data
P(X,Y) - Choose the particular algorithmic setup
- How to traverse lattice of possible solutions
49Conclusion
- Comraf is a useful model
- At least for clustering
- But other applications are also developed
- The model is generic
- Semi-supervised case is straightforward
- Inference is feasible
- Relatively complex algorithm is sub-cubic
- Model selection is possible
50I thank Mehran Sahami and
- Ran El-Yaniv and Andrew McCallum
- for some early discussions
- Victor Lavrenko
- for putting this work into its proper context
- Leslie Kaelbling, Polina Golland and Uri Lerner
- for terminology clarifications
- Erik Learned-Miller, Yoram Singer, Carmel
Domshlak and David Mease - for ongoing discussions
- Hema Raghavan and Charles Sutton
- for comments on the paper draft
- My wife Anna for her constant support