Title: Local one class optimization
1Local one class optimization
- Gal Chechik, Stanford
- joint work with Koby Crammer, Hebrew university
of Jerusalem
2The one-class problem
- Find a subset of similar/typical samples
- Formally find a ball of a given radius (with
some metric) that covers as many data points as
possible (related to the set covering problem).
3Motivation I
- Unsupervised setting Sometimes we wish to model
small parts of the data and ignore the rest. This
happens when many data points are irrelevant. - Example
- Finding sets of co-expressed genes in genome
wide-experiment identify the relevant genes out
of thousands irrelevant ones. - Finding a set of document of the same topic, in
an heterogeneous corpus
4Motivation II
- Supervised setting Learning given positive
samples only - Examples
- Protein interactions
- Intrusion detection application
- Care about low false positive rate
5Current approaches
- Often treat the problem as Outliers and novelty
detection most samples are relevant - Current approaches use
- A convex cost function (Schölkopf 95, Tax and
Duin 99, Ben-Hur et al 2001). - A parameter that affects the size or weight of
the ball -
- Bias towards center of massWhen searching for a
small ball, the center of the optimal ball is in
the global center of mass, wargmin Sx(x-w)2
missing the interesting structures.
6Current approaches
- Example with synthetic data
- 2 Gaussians uniform background
Convex one class (OSU-SVM)
Local one-class
7How do we do it
- A cost function designed for small sets
- A probabilistic approach allow soft assignment
to the set - Regularized optimization
81. A cost function for small sets
- The case where only few samples are relevant
- Use cost function that is flat for samples not in
the set - Two parameters
- Divergence measure DBF
- Flat cost K
- Indifferent to the position of irrelevant
samples. - Solutions converge to the center of mass when
ball is large.
92. A probabilistic formulation
- We are given m samples in a d dimensional space
or simplex, indexed by x . - p(x) is the prior distribution over samples
- c TRUE,FALSE is an R.V. that characterizes
assignment to the interesting set (the Ball). - p(cx) reflects our belief that the sample x is
interesting. - The cost function will be Dp(cx)DBF(wvx)
(1-p(cx))KDBF is a divergence measure, to be
discussed later
103. Regularized optimization
- The goal minimize the mean costregularization
- min ß ltDBF,K(,wCvx)gtp(c,x) I(CX)
p(cx),w - The first term measures the mean distortion
- ltDBF,R(p(cx),wvx)gt S p(x)
p(cx)BF(wvx)(1-p(cx))K - The second term regularizes the compression of
the data (removes information about X) - I(CX) H(X) H(XC),
- It pushes for putting many points in the set.
- This target function is not convex
11To solve the problem
- It turns out that for a family of divergence
functions, called Bregman divergences, we can
analytically describe properties of the optimal
solution. - The proof follows the analysis of the Information
Bottleneck method (Tishby,Pereira,Bialek,99)
12Relation to Information Bottleneck
- IB aims to compresses one variable X into T,
while at the same time preserves information
about Y. Combined to a single tradeoff
optimization - min I(TX) ß I(TY)
- A mathematically equivalent formulation
- min ß ltD BF (wvx)gt I(TX)
- where ltDKLgt measures the mean distortion between
the clusters centroids wp(yt) and the samples
vxp(yx),
KL
13Bregman divergences
- A Bregman divergence is defined by a convex
function F (in our case F(v)Sf(vi)) - Common examples
- L2 norm f(x)½x2
- Itkura-Saito f(x)-log(x)
- DKL f(x)xlog(x)
- Unnormalized relative entropy f(x)xlogx-x
- Lemma Convexity of the Bregman Ball
- The set of points v s.t. BF(vw)ltR is convex
14Relation to Information Bottleneck
- The extended IB problem
- min ß ltD BF (wTvx)gt p(T,x) I(TX)
- p(tx),w
- The one class problem
- min ß ltDBF,K(wCvx)gtp(c,x) I(CX)
p(cx),w
15Properties of the solution
- OC solutions obey three fixed point equations
- When ß?8,
- Best assignment for x is to minimize
16The effect of the K
- K controls the nature of the solution.
- Is the cost of leaving a point out of the ball
- Large K gt large radius many points in set
- For the L2 norm, K is formally related to the
prior of a single Gaussian fit to the subset. - A full description of a data may require to solve
for the complete spectrum of K values.
17Algorithm One-Class IB
- Adapting the sequential-IB algorithm
- One-Class IB
- Input set of m points vx, divergence BF, cost K
- Output centroid w, assignment p(cx)
- Optimization method
- Iteratively operating sample-by-sample, try to
modify the status of a single sample - One step Look-ahead re-fit the model and decide
if to change assignment of a sample - This uses a simple formula because of the nice
properties of Bregman divergences - search in the dual space of samples, rather than
parameters w.
18Experiments 1 information retrieval
- Five most frequent categories of Reuters21578.
- Each document represented as a multinomial
distribution over 2000 terms. - The experimental setup For each category
- train with half of the positive documents,
- test with all rest of documents
- Compared one-class IB with One-class Convex which
uses a convex loss function (Crammer
Singer-2003). Controlled by a single parameter ?,
that determines weight of the class.
19Experiments 1 information retrieval
- Compare precision recall performance, for a range
or K/µ values.
precision
recall
20Experiments 1 information retrieval
- Centroids of clusters, and their distances from
the center of mass
21Experiments 2 gene expression
- A typical application for searching small but
interesting sets of genes.
Genes represented by expression profile across
tissues from different patients Alizadeh-2000,
(B-cell lymphoma tissues) has mortality data
which can be used as an objective method for
validating quality of the genes selected.
22Experiments 2 gene expression
- One-class IB compared with one-class SVM (L2)
- For a series of K values, gene sets with lowest
loss function was found (10 restarts). - The set of genes was used for regression vs, the
mortality data.
good
Significance of regression prediction (p- value)
bad
23Future work finding ALL relevant subsets
- Complete characterization of all interesting
subsets in the data. - Assume we have a function that assign an interest
value to each subset. We search in the space of
subsets and for all local maxima. - Requires to define the locality. A natural
measure of locality in the subsets-space is the
Hamming distance. - The complete characterization of the data require
description using a range of local neighborhoods.
24Future work multiple one-class
- Synthetic example two overlapping Gaussians and
background uniform noise
25Conclusions
- We focus on learning one-class for cases where a
small ball is sought. - Formalize the problem using IB, and derive its
formal solutions - One-class IB performs well in the regime of small
subsets.