Non-Bayesian Networks - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Non-Bayesian Networks

Description:

Clustering: quasi-random walk. For each variable. Start with some ... Perform quasi-random walk. To a local maximum. Fix this value, move to next variable ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 51
Provided by: ronb
Category:
Tags: bayesian | networks | non | quasi

less

Transcript and Presenter's Notes

Title: Non-Bayesian Networks


1
Non-Bayesian Networks
  • Ron Bekkerman,
  • University of Massachusetts

Joint work with Mehran Sahami (Google)
2
Unsupervised learning
  • Tell me something about the data I have
  • Compact representation
  • Example clustering
  • Generally, ill-defined
  • Supervised methods are clearly preferable
  • But sometimes inapplicable

3
Semi-(un)supervised learning
  • We have a few labeled examples
  • And many unlabeled
  • The real-world case
  • Close problem transfer learning
  • We know something about something else

Labeled red
Labeled blue
4
Generative models
  • Very popular for unsupervised learning
  • Example Latent Dirichlet Allocation
  • Blei, Ng Jordan 2003
  • Number of nodes

5
Many generative models
  • Are huge
  • Inference is very difficult
  • Model learning is impossible
  • And biased
  • Too many assumptions on the data
  • That may be wrong after all

6
Pros cons of generative models
  • Pros
  • Visualization
  • Markov property
  • Factorization
  • Cons
  • Size
  • Free parameters
  • Arbitrary choice of distribution families

7
Proposal
  • Symmetric interactions between variables
  • Instead of asymmetric causal relationships
  • Undirected edges

8
Proposal
  • Symmetric interactions between variables
  • Instead of asymmetric causal relationships
  • Undirected edges
  • Keeping track of the model size
  • One random variable per concept
  • Object-oriented approach

9
Proposal
  • Symmetric interactions between variables
  • Instead of asymmetric causal relationships
  • Undirected edges
  • Keeping track of the model size
  • One random variable per concept
  • Object-oriented approach
  • No arbitrarily chosen distribution families

10
Proposal
  • Symmetric interactions between variables
  • Instead of asymmetric causal relationships
  • Undirected edges
  • Keeping track of the model size
  • One random variable per concept
  • Object-oriented approach
  • No arbitrarily chosen distribution families
  • Minimum free parameters

11
Undirected graphical models
  • A.k.a. Markov Random Fields (MRFs)
  • Markov property holds
  • Joint distribution is a product of potential
    functions over cliques
  • Hammersley-Clifford theorem
  • Potentials are arbitrary functions
  • is a normalization factor

12
Non-Bayesian inference
  • No training phase
  • Potentials are fixed for each clique
  • is a constant
  • Maximum likelihood procedure is then
  • Maximization of a non-probabilistic objective
    function
  • Defined over cliques of a graphical model

13
Objective function
  • would measure
  • Similarity between and
  • Distance between and
  • Kernel-type
  • Information and provide on each other
  • Etc
  • Maximum likelihood
  • Find the best assignment , and
  • Best means similar / close / informative etc.

14
Combinatorial Markov Random Fields
15
Combinatorial random variable
  • Discrete random variable defined over a
    combinatorial set
  • Given a set X of n values
  • is defined over a set of values
  • Example lotto 6/49
  • Given a set X of 49 balls, draw 6 balls
  • is defined over all the subsets of size 6
  • values

16
Example hard clustering
  • is your data (n data points)
  • is (hard) clustering of
  • is random variable over all possible
  • values (k is the number of clusters)


X
17
Another example ranking
  • is your data
  • is partial order on values of
  • is random variable over all possible
  • n! values


X
18
Combinatorial MRF (Comraf)
  • MRF with combinatorial random variables
  • Which are not necessarily all nodes of MRF
  • Goal
  • Find best (most likely) assignment to
    combinatorial random variables
  • Comraf model graph G and objective F
  • Challenge
  • Usually, P( ) cannot be explicitly specified
  • No existing inference methods applicable

19
Properties of Comraf models
  • Compact one node per concept
  • Such as clusterings of documents, rankings of
    movies, subsets of images etc.
  • Data-driven no assumptions on data distributions
  • Only empirical distributions represented
  • Such as distribution of words over documents
  • Generic applicable to many tasks
  • In unsupervised semi-supervised learning

20
Comrafs for clustering
21
Comrafs for clustering
  • Allows multi-way clustering
  • Example
  • Clustering of documents, words, authors and titles
  • Set of possible clusterings form a lattice
  • Where each point is a clustering

22
Clustering objective function
  • Recall that each value is a clustering
  • Each clustering is a random variable
  • Over its clusters
  • Our objective
  • Sum of pairwise Mutual Information

23
Clustering quasi-random walk
  • For each variable
  • Start with some clustering
  • Say, (0,0,0)
  • All data points are in cluster

24
Clustering quasi-random walk
  • For each variable
  • Start with some solution
  • Say, (0,0,0)
  • All data points are in cluster
  • Walk on the lattice
  • While maximizing the objective

(1,1,0)
Split of a cluster
25
Clustering quasi-random walk
  • For each variable
  • Start with some solution
  • Say, (0,0,0)
  • All data points are in cluster
  • Walk on the lattice
  • While maximizing the objective

(1,1,0)
Merge of two clusters
26
Iterative Conditional Mode
  • Fix values of all but one variable
  • Perform quasi-random walk
  • To a local maximum

27
Iterative Conditional Mode
  • Fix values of all but one variable
  • Perform quasi-random walk
  • To a local maximum
  • Fix this value, move to next variable

28
Algorithm summary
29
Examples of Comraf models
  • Simplest case
  • Objective
  • This is Information Bottleneck
  • Tishby, Pereira Bialek 1999
  • More complex case
  • Objective
  • This is Information-theoretic Co-clustering
  • Dhillon, Mallela Modha 2003

X
Y

X
30
Evaluation methodology
  • Clustering evaluation
  • Is generally unintuitive
  • Is an entire research field
  • We use the accuracy measure
  • Following Slonim et al. and Dhillon et al.
  • Ground truth
  • Our results

Size of dominant class in cluster c
31
Datasets
  • Three CALO email datasets
  • acheyer 664 messages, 38 folders
  • mgervasio 777 messages, 15 folders
  • mgondek 297 messages, 14 folders
  • Two Enron email datasets
  • kitchen-l 4015 messages, 47 folders
  • sanders-r 1188 messages, 30 folders
  • The 20 Newsgroups 19,997 messages

32
Clustering results
33
Comrafs for semi-supervised and transfer learning
34
Observed nodes in Comrafs
  • A node is observed if its value is fixed
  • In the case of clustering
  • We are given a clustering of set X
  • Observed nodes shaded in Comraf graph
  • Comraf model remains essentially the same
  • can be labeled data of D
  • As well as of something else

35
Semi-supervised clustering
  • Intrinsic model
  • is some labeled data of D
  • Which forms natural partitioning
  • Represented as observed node
  • Objective
  • Algorithmic setup is the same
  • Clearly, without optimizing the observed node

36
Constrained optimization scheme
  • Well-established approach to
    semi-supervised clustering
  • Wagstaff Cardie 2000
  • Must-link ml and cannot-link cl constraints
  • Comraf graph
  • Objective

37
Results on email datasets
  • Randomly choose 10, 20 and 30 of data to be
    labeled
  • Plot the accuracy of the unlabeled portion

38
Semi-supervised clustering on 20NG
  • 69.50.7 unsupervised clustering
  • We consider 10 of data as labeled
  • 74.80.6 constrained Comraf scheme
  • 78.90.8 intrinsic Comraf scheme
  • 3 of 5 runs built well-balanced clusterings
  • Where each category is dominant in one cluster
  • Clustering can be compared with classification
  • 80.00.6 on 3 well-balanced clusterings
  • 77.20.2 SVM on the same data

39
Resistance to noise
  • Intrinsic scheme is resistant to noise
  • In contrast to constrained scheme
  • Randomly change 10, 20 and 30 labels

40
Transfer learning
  • Variety of possibilities
  • Given two datasets (close in content)
  • acheyer and mgervasio
  • Consider one labeled and another not
  • And vice versa
  • We clustered words given class labels
  • And presented the clustering as observed
    for the other dataset
  • 3 improvement on mgervasio

41
Other Comraf applications and open problems
42
Comraf for image clustering
  • Multi-media Information Retrieval setup
  • Clustering of images, their
    local features, captions and
    words in these captions
  • Ordinary Comraf clustering case
  • Challenge
  • Probability of a feature to appear in an image is
    difficult to estimate
  • Unlike the word / document case

43
Comraf for topic detection
  • Given a set of n documents, where k of them are
    on a certain topic
  • While the rest is noise
  • Goal filter out (n-k) noisy documents
  • Two combinatorial RVs
  • Over all subsets of documents / words
  • Objective
  • Minimizing joint entropy

44
Comraf for multi-way ranking
  • Given ranking of movies, rank actors and
    directors
  • Each node is distributed over all possible
    rankings
  • n! values
  • Challenge the objective
  • Should embrace the notion of order

45
Model selection
  • Comrafs are usually small
  • Just a few nodes
  • We are certain in choosing nodes, uncertain in
    choosing interactions
  • Finding a good set of interactions is feasible
  • E.g. there are only 38 possibilities to build a
    connected graph of 4 nodes

46
Hierarchy in Comrafs
  • Hierarchy naturally emerges in probabilistic
    content
  • As well as in object-oriented
  • Proposal telescopic models
  • P( )P(A,B,C)

A
B
C
D
E
K
L
F
G
47
Conclusions
48
Comraf recipe
  • Break your problem to concepts
  • Represent each concept as combinatorial RV
  • Decide which RV interacts with which
  • Build the Comraf graph
  • Choose the objective function
  • E.g., for each interacting pair ( , )
    provide joint distribution of underlying data
    P(X,Y)
  • Choose the particular algorithmic setup
  • How to traverse lattice of possible solutions

49
Conclusion
  • Comraf is a useful model
  • At least for clustering
  • But other applications are also developed
  • The model is generic
  • Semi-supervised case is straightforward
  • Inference is feasible
  • Relatively complex algorithm is sub-cubic
  • Model selection is possible

50
I thank Mehran Sahami and
  • Ran El-Yaniv and Andrew McCallum
  • for some early discussions
  • Victor Lavrenko
  • for putting this work into its proper context
  • Leslie Kaelbling, Polina Golland and Uri Lerner
  • for terminology clarifications
  • Erik Learned-Miller, Yoram Singer, Carmel
    Domshlak and David Mease
  • for ongoing discussions
  • Hema Raghavan and Charles Sutton
  • for comments on the paper draft
  • My wife Anna for her constant support
Write a Comment
User Comments (0)
About PowerShow.com