Non-Bayesian Networks - PowerPoint PPT Presentation

1 / 50

About This Presentation

Title:

Non-Bayesian Networks

Description:

Clustering: quasi-random walk. For each variable. Start with some ... Perform quasi-random walk. To a local maximum. Fix this value, move to next variable ... – PowerPoint PPT presentation

Number of Views:71

Avg rating:3.0/5.0

Slides: 51

Provided by: ronb

Category:

more less

Transcript and Presenter's Notes

Title: Non-Bayesian Networks

1
Non-Bayesian Networks

Ron Bekkerman,
University of Massachusetts

Joint work with Mehran Sahami (Google)
2
Unsupervised learning

Tell me something about the data I have
Compact representation
Example clustering
Generally, ill-defined

Supervised methods are clearly preferable
But sometimes inapplicable

3
Semi-(un)supervised learning

We have a few labeled examples
And many unlabeled
The real-world case
Close problem transfer learning
We know something about something else

Labeled red
Labeled blue
4
Generative models

Very popular for unsupervised learning
Example Latent Dirichlet Allocation
Blei, Ng Jordan 2003
Number of nodes

5
Many generative models

Are huge
Inference is very difficult
Model learning is impossible

And biased
Too many assumptions on the data
That may be wrong after all

6
Pros cons of generative models

Pros
Visualization
Markov property
Factorization
Cons
Size
Free parameters
Arbitrary choice of distribution families

7
Proposal

Symmetric interactions between variables
Instead of asymmetric causal relationships
Undirected edges

8
Proposal

Symmetric interactions between variables
Instead of asymmetric causal relationships
Undirected edges
Keeping track of the model size
One random variable per concept
Object-oriented approach

9
Proposal

Symmetric interactions between variables
Instead of asymmetric causal relationships
Undirected edges
Keeping track of the model size
One random variable per concept
Object-oriented approach
No arbitrarily chosen distribution families

10
Proposal

Symmetric interactions between variables
Instead of asymmetric causal relationships
Undirected edges
Keeping track of the model size
One random variable per concept
Object-oriented approach
No arbitrarily chosen distribution families
Minimum free parameters

11
Undirected graphical models

A.k.a. Markov Random Fields (MRFs)
Markov property holds
Joint distribution is a product of potential
functions over cliques
Hammersley-Clifford theorem
Potentials are arbitrary functions
is a normalization factor

12
Non-Bayesian inference

No training phase
Potentials are fixed for each clique
is a constant
Maximum likelihood procedure is then
Maximization of a non-probabilistic objective
function
Defined over cliques of a graphical model

13
Objective function

would measure
Similarity between and
Distance between and
Kernel-type
Information and provide on each other
Etc
Maximum likelihood
Find the best assignment , and
Best means similar / close / informative etc.

14
Combinatorial Markov Random Fields
15
Combinatorial random variable

Discrete random variable defined over a
combinatorial set
Given a set X of n values
is defined over a set of values
Example lotto 6/49
Given a set X of 49 balls, draw 6 balls
is defined over all the subsets of size 6
values

16
Example hard clustering

is your data (n data points)
is (hard) clustering of
is random variable over all possible
values (k is the number of clusters)

X
17
Another example ranking

is your data
is partial order on values of
is random variable over all possible
n! values

X
18
Combinatorial MRF (Comraf)

MRF with combinatorial random variables
Which are not necessarily all nodes of MRF
Goal
Find best (most likely) assignment to
combinatorial random variables
Comraf model graph G and objective F
Challenge
Usually, P( ) cannot be explicitly specified
No existing inference methods applicable

19
Properties of Comraf models

Compact one node per concept
Such as clusterings of documents, rankings of
movies, subsets of images etc.
Data-driven no assumptions on data distributions
Only empirical distributions represented
Such as distribution of words over documents
Generic applicable to many tasks
In unsupervised semi-supervised learning

20
Comrafs for clustering
21
Comrafs for clustering

Allows multi-way clustering
Example
Clustering of documents, words, authors and titles

Set of possible clusterings form a lattice
Where each point is a clustering

22
Clustering objective function

Recall that each value is a clustering
Each clustering is a random variable
Over its clusters

Our objective
Sum of pairwise Mutual Information

23
Clustering quasi-random walk

For each variable
Start with some clustering
Say, (0,0,0)
All data points are in cluster

24
Clustering quasi-random walk

For each variable
Start with some solution
Say, (0,0,0)
All data points are in cluster
Walk on the lattice
While maximizing the objective

(1,1,0)
Split of a cluster
25
Clustering quasi-random walk

For each variable
Start with some solution
Say, (0,0,0)
All data points are in cluster
Walk on the lattice
While maximizing the objective

(1,1,0)
Merge of two clusters
26
Iterative Conditional Mode

Fix values of all but one variable
Perform quasi-random walk
To a local maximum

27
Iterative Conditional Mode

Fix values of all but one variable
Perform quasi-random walk
To a local maximum
Fix this value, move to next variable

28
Algorithm summary
29
Examples of Comraf models

Simplest case
Objective
This is Information Bottleneck
Tishby, Pereira Bialek 1999
More complex case
Objective
This is Information-theoretic Co-clustering
Dhillon, Mallela Modha 2003

X
Y

X
30
Evaluation methodology

Clustering evaluation
Is generally unintuitive
Is an entire research field
We use the accuracy measure
Following Slonim et al. and Dhillon et al.
Ground truth
Our results

Size of dominant class in cluster c
31
Datasets

Three CALO email datasets
acheyer 664 messages, 38 folders
mgervasio 777 messages, 15 folders
mgondek 297 messages, 14 folders
Two Enron email datasets
kitchen-l 4015 messages, 47 folders
sanders-r 1188 messages, 30 folders
The 20 Newsgroups 19,997 messages

32
Clustering results
33
Comrafs for semi-supervised and transfer learning
34
Observed nodes in Comrafs

A node is observed if its value is fixed
In the case of clustering
We are given a clustering of set X
Observed nodes shaded in Comraf graph
Comraf model remains essentially the same
can be labeled data of D
As well as of something else

35
Semi-supervised clustering

Intrinsic model
is some labeled data of D
Which forms natural partitioning
Represented as observed node
Objective
Algorithmic setup is the same
Clearly, without optimizing the observed node

36
Constrained optimization scheme

Well-established approach to
semi-supervised clustering
Wagstaff Cardie 2000
Must-link ml and cannot-link cl constraints
Comraf graph
Objective

37
Results on email datasets

Randomly choose 10, 20 and 30 of data to be
labeled
Plot the accuracy of the unlabeled portion

38
Semi-supervised clustering on 20NG

69.50.7 unsupervised clustering
We consider 10 of data as labeled
74.80.6 constrained Comraf scheme
78.90.8 intrinsic Comraf scheme
3 of 5 runs built well-balanced clusterings
Where each category is dominant in one cluster
Clustering can be compared with classification
80.00.6 on 3 well-balanced clusterings
77.20.2 SVM on the same data

39
Resistance to noise

Intrinsic scheme is resistant to noise
In contrast to constrained scheme
Randomly change 10, 20 and 30 labels

40
Transfer learning

Variety of possibilities
Given two datasets (close in content)
acheyer and mgervasio
Consider one labeled and another not
And vice versa
We clustered words given class labels
And presented the clustering as observed
for the other dataset
3 improvement on mgervasio

41
Other Comraf applications and open problems
42
Comraf for image clustering

Multi-media Information Retrieval setup
Clustering of images, their
local features, captions and
words in these captions
Ordinary Comraf clustering case
Challenge
Probability of a feature to appear in an image is
difficult to estimate
Unlike the word / document case

43
Comraf for topic detection

Given a set of n documents, where k of them are
on a certain topic
While the rest is noise
Goal filter out (n-k) noisy documents
Two combinatorial RVs
Over all subsets of documents / words
Objective
Minimizing joint entropy

44
Comraf for multi-way ranking

Given ranking of movies, rank actors and
directors
Each node is distributed over all possible
rankings
n! values
Challenge the objective
Should embrace the notion of order

45
Model selection

Comrafs are usually small
Just a few nodes
We are certain in choosing nodes, uncertain in
choosing interactions
Finding a good set of interactions is feasible
E.g. there are only 38 possibilities to build a
connected graph of 4 nodes

46
Hierarchy in Comrafs

Hierarchy naturally emerges in probabilistic
content
As well as in object-oriented
Proposal telescopic models
P( )P(A,B,C)

A
B
C
D
E
K
L
F
G
47
Conclusions
48
Comraf recipe

Break your problem to concepts
Represent each concept as combinatorial RV
Decide which RV interacts with which
Build the Comraf graph
Choose the objective function
E.g., for each interacting pair ( , )
provide joint distribution of underlying data
P(X,Y)
Choose the particular algorithmic setup
How to traverse lattice of possible solutions

49
Conclusion