Title: Features, Kernels, and Similarity functions
1Features, Kernels, and Similarity functions
Avrim Blum Machine learning lunch 03/05/07
2Suppose you want to
- use learning to solve some classification
problem. - E.g., given a set of images, learn a
rule to distinguish men from women. - The first thing you need to do is decide what you
want as features. - Or, for algs like SVM and Perceptron, can use a
kernel function, which provides an implicit
feature space. But then what kernel to use? - Can Theory provide any help or guidance?
3Plan for this talk
- Discuss a few ways theory might be of help
- Algorithms designed to do well in large feature
spaces when only a small number of features are
actually useful. - So you can pile a lot on when you dont know much
about the domain. - Kernel functions. Standard theoretical view,
plus new one that may provide more guidance. - Bridge between implicit mapping and similarity
function views. Talk about quality of a kernel
in terms of more tangible properties. work with
Nina Balcan - Combining the above. Using kernels to generate
explicit features.
4A classic conceptual question
- How is it possible to learn anything quickly when
there is so much irrelevant information around? - Must there be some hard-coded focusing mechanism,
or can learning handle it?
5A classic conceptual question
- Lets try a very simple theoretical model.
- Have n boolean features. Labels are or -.
- 1001101110
- 1100111101
- 0111010111 -
- Assume distinction is based on just one feature.
- How many prediction mistakes do you need to make
before youve figured out which one it is? - Can take majority vote over all possibilities
consistent with data so far. Each mistake
crosses off at least half. O(log n) mistakes
total. - log(n) is good doubling n only adds 1 more
mistake. - Cant do better (consider log(n) random strings
with random labels. Whp there is a consistent
feature in hindsight).
6A classic conceptual question
- What about more interesting classes of functions
(not just target ? a single feature)?
7Littlestones Winnow algorithm MLJ 1988
- Motivated by the question what if target is an
OR of r ltlt n features? - Majority vote scheme over all nr possibilities
would make O(r log n) mistakes but totally
impractical. Can you do this efficiently? - Winnow is simple efficient algorithm that meets
this bound. - More generally, if exists LTF such that
- positives satisfy w1x1w2x2wnxn ? c,
- negatives satisfy w1x1w2x2wnxn ? c - ?,
(W?iwi) - Then mistakes O((W/?)2 log n).
- E.g., if target is k of r function, get O(r2
log n). - Key point still only log dependence on n.
100101011001101011 x4 ? x7 ? x10
8Littlestones Winnow algorithm MLJ 1988
1001011011001
- How does it work? Balanced version
- Maintain weight vectors w and w-.
- Initialize all weights to 1. Classify based on
whether w?x or w-?x is larger. (Have x0?0) - If make mistake on positive x, for each xi1,
- wi (1?)wi, wi- (1-?)wi-.
- And vice-versa for mistake on negative x.
- Other properties
- Can show this approximates maxent constraints.
- In other direction, Ng04 shows that maxent
with L1 regularization gets Winnow-like bounds.
9Practical issues
- On batch problem, may want to cycle through data,
each time with smaller ?. - Can also do margin version update if just barely
correct. - If want to output a likelihood, natural is
ew?x/ew?x ew-?x. Can extend to
multiclass too. - William Vitor have paper with some other nice
practical adjustments.
10Winnow versus Perceptron/SVM
- Winnow is similar at high level to Perceptron
updates. Whats the difference? - Suppose data is linearly separable by w?x 0
with w?x ? ?.
- For Perceptron, mistakes/samples bounded by
O((L2(w)L2(x)/?)2) - For Winnow, mistakes/samples bounded by
O((L1(w)L?(x)/?)2(log n))
- For boolean features, L?(x)1. L2(x) can be
sqrt(n). - If target is sparse, examples dense, Winnow is
better. - E.g., x random in 0,1n, f(x)x1. Perceptron
O(n) mistakes. - If target is dense (most features are relevant)
and examples are sparse, then Perceptron wins.
11OK, now on to kernels
12Generic problem
- Given a set of images , want to
learn a linear separator to distinguish men from
women. - Problem pixel representation no good.
- One approach
- Pick a better set of features! But seems ad-hoc.
- Instead
- Use a Kernel! K( , ) ?(
)?( ). ? is implicit, high-dimensional
mapping.
- Perceptron/SVM only interact with data through
dot-products, so can be kernelized. If data is
separable in ?-space by large L2 margin, dont
have to pay for it.
13Kernels
- E.g., the kernel K(x,y) (1xy)d for the case
of n2, d2, corresponds to the implicit mapping
14Kernels
- Perceptron/SVM only interact with data through
dot-products, so can be kernelized. If data is
separable in ?-space by large L2 margin, dont
have to pay for it.
- E.g., K(x,y) (1 x?y)d
- ?(n-diml space) ! (nd-diml space).
- E.g., K(x,y) e-(x-y)2
- Conceptual warning Youre not really getting
all the power of the high dimensional space
without paying for it. The margin matters. - E.g., K(x,y)1 if xy, K(x,y)0 otherwise.
Corresponds to mapping where every example gets
its own coordinate. Everything is linearly
separable but no generalization.
15Question do we need the notion of an implicit
space to understand what makes a kernel helpful
for learning?
16Focus on batch setting
- Assume examples drawn from some probability
distribution - Distribution D over x, labeled by target function
c. - Or distribution P over (x, l)
- Will call P (or (c,D)) our learning problem.
- Given labeled training data, want algorithm to do
well on new data.
17Something funny about theory of kernels
- On the one hand, operationally a kernel is just a
similarity function K(x,y) 2 -1,1, with some
extra requirements. here Im scaling to ?(x)
1 - And in practice, people think of a good kernel as
a good measure of similarity between data points
for the task at hand. - But Theory talks about margins in implicit
high-dimensional F-space. K(x,y) F(x)F(y).
18I want to use ML to classify protein structures
and Im trying to decide on a similarity fn to
use. Any help?
It should be pos. semidefinite, and should result
in your data having a large margin separator in
implicit high-diml space you probably cant even
calculate.
19Umm thanks, I guess.
It should be pos. semidefinite, and should result
in your data having a large margin separator in
implicit high-diml space you probably cant even
calculate.
20Something funny about theory of kernels
- Theory talks about margins in implicit
high-dimensional F-space. K(x,y) F(x)F(y). - Not great for intuition (do I expect this kernel
or that one to work better for me) - Can we connect better with idea of a good kernel
being one that is a good notion of similarity for
the problem at hand? - Motivation BBV If margin ? in ?-space, then
can pick Õ(1/?2) random examples y1,,yn
(landmarks), and do mapping x ?
K(x,y1),,K(x,yn). Whp data in this space will
be apx linearly separable.
21Goal notion of good similarity function that
- Talks in terms of more intuitive properties (no
implicit high-diml spaces, no requirement of
positive-semidefiniteness, etc) - If K satisfies these properties for our given
problem, then has implications to learning
- Is broad includes usual notion of good kernel
(one that induces a large margin separator in
F-space). - If so, then this can help with designing the K.
Recent work with Nina, with extensions by Nati
Srebro
22Proposal satisfying (1) and (2)
- Say have a learning problem P (distribution D
over examples labeled by unknown target f). - Sim fn K(x,y)!-1,1 is (?,?)-good for P if at
least a 1-? fraction of examples x satisfy
EyDK(x,y)l(y)l(x) EyDK(x,y)l(y)?l(x)?
- Q how could you use this to learn?
23How to use it
- At least a 1-? prob mass of x satisfy
- EyDK(x,y)l(y)l(x) EyDK(x,y)l(y)?l(x)
?
- Draw S of O((1/?2)ln 1/?2) positive examples.
- Draw S- of O((1/?2)ln 1/?2) negative examples.
- Classify x based on which gives better score.
- Hoeffding for any given good x, prob of error
over draw of S,S- at most ?2. - So, at most ? chance our draw is bad on more than
? fraction of good x. - With prob 1-?, error rate ? ?.
24But not broad enough
30o
30o
- K(x,y)xy has good separator but doesnt satisfy
defn. (half of positives are more similar to negs
that to typical pos)
25But not broad enough
30o
30o
- Idea would work if we didnt pick ys from
top-left. - Broaden to say OK if 9 large region R s.t. most
x are on average more similar to y2R of same
label than to y2R of other label. (even if dont
know R in advance)
26Broader defn
- Say K(x,y)!-1,1 is an (?,?)-good similarity
function for P if exists a weighting function
w(y)20,1 s.t. at least 1-? frac. of x satisfy
EyDw(y)K(x,y)l(y)l(x) EyDw(y)K(x,y)l(y)?
l(x)?
- Can still use for learning
- Draw S y1,,yn, S- z1,,zn. nÕ(1/?2)
- Use to triangulate data
- x ? K(x,y1), ,K(x,yn), K(x,z1),,K(x,zn).
- Whp, exists good separator in this space w
w(y1),,w(yn),-w(z1),,-w(zn)
27Broader defn
- Say K(x,y)!-1,1 is an (?,?)-good similarity
function for P if exists a weighting function
w(y)20,1 s.t. at least 1-? frac. of x satisfy
EyDw(y)K(x,y)l(y)l(x) EyDw(y)K(x,y)l(y)?
l(x)?
- So, take new set of examples, project to this
space, and run your favorite linear separator
learning algorithm.
- Whp, exists good separator in this space w
w(y1),,w(yn),-w(z1),,-w(zn)
Technically bounds are better if adjust
definition to penalize examples more that fail
the inequality badly
28Broader defn
Algorithm
- Draw Sy1, ?, yd, S-z1, ?, zd, dO((1/?2)
ln(1/?2)). Think of these as landmarks.
X ? K(x,y1), ,K(x,yd), K(x,zd),,K(x,zd).
Guarantee with prob. 1-?, exists linear
separator of error ? ? at margin ?/4.
- Actually, margin is good in both L1 and L2
senses. - This particular approach requires wasting
examples for use as the landmarks. But could
use unlabeled data for this part.
29Interesting property of definition
- An (?,?)-good kernel at least 1-? fraction of x
have margin ? is an (?,?)-good sim fn under
this definition. - But our current proofs suffer a penalty ? ?
?extra, ? ?3?extra. - So, at qualitative level, can have theory of
similarity functions that doesnt require
implicit spaces.
Nati Srebro has improved to ?2, which is tight,
extended to hinge-loss.
30Approach were investigating
- With Nina Mugizi
- Take a problem where original features already
pretty good, plus you have a couple reasonable
similarity functions K1, K2, - Take some unlabeled data as landmarks, use to
enlarge feature space K1(x,y1), K2(x,y1),
K1(x,y2), - Run Winnow on the result.
- Can prove guarantees if some convex combination
of the Ki is good.
31Open questions
- This view gives some sufficient conditions for a
similarity function to be useful for learning but
doesnt have direct implications to direct use in
SVM, say. - Can one define other interesting, reasonably
intuitive, sufficient conditions for a similarity
function to be useful for learning?