Title: Techniques For Exploiting Unlabeled Data
1Techniques For Exploiting Unlabeled Data
Thesis Proposal
May 11,2007
Committee Avrim Blum, CMU (Co-Chair) John
Lafferty, CMU (Co-Chair) William Cohen,
CMU Xiaojin (Jerry) Zhu, Wisconsin
2Motivation
Supervised Machine Learning
induction
Labeled Examples (xi,yi)
Model x ?y
Problems Document classification, image
classification, protein sequence determination.
Algorithms SVM, Neural Nets, Decision Trees, etc.
3Motivation
- In recent years, there has been growing interest
in techniques for using unlabeled data
More data is being collected than ever before.
Labeling examples can be expensive and/or require
human intervention.
4Examples
Images Abundantly available (digital cameras)
labeling requires humans (captchas).
Web Pages Can be easily crawled on the web,
labeling requires human intervention.
- Proteins sequence can be easily determined,
structure determination is a hard problem.
5Motivation
Semi-Supervised Machine Learning
Labeled Examples (xi,yi)
x ?y
Unlabeled Examples xi
6Motivation
7However
Techniques not as well developed as supervised
techniques
Best practices for using unlabeled data
- Techniques for adapting supervised algorithms to
semi-supervised algorithms
8Outline
Motivation
Randomized Graph Mincut
Local Linear Semi-supervised Regression
Learning with Similarity Functions
Proposed Work and Time Line
9Graph Mincut (Blum Chawla,2001)
10Construct an (unweighted) Graph
11Add auxiliary super-nodes
12Obtain s-t mincut
-
Mincut
13Classification
14Plain mincut can give very unbalanced cuts.
15Add random weights to the edges
Run plain mincut and obtain a classification.
Repeat the above process several times.
For each unlabeled example take a majority vote.
16Before adding random weights
-
Mincut
17After adding random weights
-
Mincut
18- PAC-Bayes
- PAC-Bayes bounds suggests that when the graph has
many small cuts consistent with the labeling,
randomization should improve generalization
performance. - In this case each distinct cut corresponds to a
different hypothesis. - Hence the average of these cuts will be less
likely to overfit than any single cut.
19- Markov Random Fields
- Ideally we would like to assign a weight to each
cut in the graph (a higher weight to small cuts)
and then take a weighted vote over all the cuts
in the graph. - This corresponds to a Markov Random Field model.
- We dont know how to do this efficiently, but we
can view randomized mincuts as an approximation.
20- How to construct the graph?
- k-NN
- Graph may not have small balanced cuts.
- How to learn k?
- Connect all points within distance d
- Can have disconnected components.
- How to learn d?
- Minimum Spanning Tree
- No parameters to learn.
- Gives connected, sparse graph.
- Seems to work well on most datasets.
21Experiments
- ONE vs. TWO 1128 examples .
- (8 X 8 array of integers, Euclidean distance).
- ODD vs. EVEN 4000 examples .
- (16 X 16 array of integers, Euclidean distance).
- PC vs. MAC 1943 examples .
- (20 newsgroup dataset, TFIDF distance) .
22ONE vs. TWO
23ODD vs. EVEN
24PC vs. MAC
25Summary
Randomization helps plain mincut achieve a
comparable performance to Gaussian Fields.
We can apply PAC sample complexity analysis and
interpret it in terms of Markov Random Fields.
There is an intuitive interpretation for the
confidence of a prediction in terms of the
margin of the vote.
- Semi-supervised Learning Using Randomized
Mincuts, - Blum, J. Lafferty, M.R. Rwebangira, R. Reddy ,
- ICML 2004
26Outline
Motivation
Randomized Graph Mincut
Local Linear Semi-supervised Regression
Learning with Similarity Functions
Proposed Work and Time Line
27Gaussian Fields (Zhu, Ghahramani Lafferty)
This algorithm minimize the following functional
?(f) ? wij(fi-fj)2
Where wij is the similarity between examples i
and j.
And fi and fj are the predictions for example i
and j.
28Locally Constant (Kernel regression)
y
x
29Locally Linear
y
x
30Local Linear Regression
This algorithm minimize the following functional
?(ß) ? wi (yi-ßTXxi)2
Where wi is the similarity between examples i and
x.
ß is the coefficient of the local linear fit at x.
31Problem
Develop Local Linear version of Gaussian Fields
Or semi-supervised version of Local Linear
Regression
Local Linear Semi-supervised Regression
32Local Linear Semi-supervised Regression
ßj
ßjo
ßio
(ßio XjiTßj)2
XjiTßj
ßi
xi
xj
33Local Linear Semi-supervised Regression
This algorithm minimize the following functional
?(ß) ? wij (ßio XjiTßj)2
Where wij is the similarity between xi and xj.
34Synthetic Data Doppler
Doppler function y (1/x)sin (15/x)
s2 0.1 (noise)
35Experimental Results DOPPLER
Weighted Kernel Regression, LOOCV MSE 6.54,
MSE25.7
36Experimental Results DOPPLER
Local Linear Regression, LOOCV MSE 80.8,
MSE14.4
37Experimental Results DOPPLER
LLSR, LOOCV MSE 2.00, MSE7.99
38PROBLEM RUNNING TIME
If number of examples is n and the dimension of
the examples is d then we have to invert an
n(d1) X n(d1) matrix.
This is prohibitively expensive, especially if
the d is large.
39PROPOSED WORK Improving Running Time
Sparsification Ignore examples which are far
away so as to get a sparser matrix to invert.
Iterative Methods for solving Linear systems For
a matrix equation Axb, we can obtain successive
approximations x1, x2 xk. Can be significantly
faster if matrix A is sparse.
40PROPOSED WORK Improving Running Time
Power series Use the identity (I-A)-1 I A
A2 A3
y (Q??)-1Py Q-1Py (-?Q-1?)Q-1Py
(-?Q-1?)2Q-1Py
A few terms may be sufficient to get a good
approximation
Compute supervised answer first, then smooth
the answer to get semi- Supervised solution. This
can be combined with iterative methods as we can
use the supervised solution as the starting point
for our iterative algorithm.
41PROPOSED WORK Experimental Evaluation
Comparison against other proposed semi-supervised
regression algorithms.
Evaluation on a large variety of data sets,
especially high dimensional ones.
42Outline
Motivation
Randomized Graph Mincut
Local Linear Semi-supervised Regression
Learning with Similarity Functions
Proposed Work and Time Line
43Kernels
K(x,y) F(x)F(y)
Allows us to implicitly project non-linearly
separable data into a high dimensional space
where a linear separator can be found .
Kernel must satisfy strict mathematical
definitions
1. Continuous
2. Symmetric
3. Positive semi-definite
44Generic similarity Functions
What if the best similarity function in a given
domain does not satisfy the properties of a
kernel?
Two options
1. Use a kernel with inferior performance
2. Try to coerce the similarity function into a
kernel by building a kernel that has similar
behavior.
There is another way
45The Balcan-Blum approach
Recently Balcan and Blum initiated the theory of
learning with generic similarity functions.
They gave a general definition of a good
similarity function for learning and showed that
the popular large margin kernels are a special
case of their definition.
They also gave an algorithm for learning with
good similarity functions.
Their approach makes use of unlabeled data
46The Balcan-Blum approach
The algorithm is very simple
Suppose S(x,y) is our similarity function. Then
- Draw d examples x1, x2, x3, xd uniformly at
random from the - data set.
2. For each example x compute the mapping x ?
S(x,x1), S(x,x2), S(x,x3), S(x,xd)
47Synthetic Data Circle
48Experimental Results Circle
49PROPOSED WORK
Overall goal Investigate the practical
applicability of this theory and find out what is
needed to make it work on real problems.
Two main application areas
1. Domains which have expert defined similarity
functions that are not kernels (protein
homology).
2. Domains which have many irrelevant features
and in which the data may not be linearly
separable in the original features (text
classification).
50PROPOSED WORK Protein Homology
The Smith-Waterman score is the best performing
measure of similarity but it does not satisfy the
kernel properties.
Machine learning applications have either used
other similarity functions Or tried to force SW
score into a kernel.
Can we achieve better performance by using SW
score directly?
51PROPOSED WORK Text Classification
Most popular technique is Bag-of-Words (BOW)
where each document is converted into a vector
and each position in the vector indicates
how many times each word occurred.
The vectors tend to be sparse and there will be
many irrelevant features, hence this is well
suited to the Winnow algorithm. Our approach
makes the winnow algorithm more powerful.
Within this framework we have strong motivation
for investigating domain specific similarity
function, e.g. edit distance between documents
instead of cosine similarity.
Can we achieve better performance than current
techniques using domain specific similarity
functions?
52PROPOSED WORK Domain Specific Similarity
Functions
As mentioned in the previous two slides,
designing specific similarity functions for each
domain, is well motivated in this approach.
What are the best practice principles for
designing domain specific similarity functions?
In what circumstances are domain specific
similarity functions likely to be most useful?
We will answer these questions by generalizing
from several different datasets and
systematically noting what seems to work best.
53Proposed Work and Time Line
Summer 2007 Speeding up LLSR Learning with similarity in protein homology and text classification domain.
Fall 2007 Comparison of LLSR with other semi-supervised regression algs. Investigate principles of domain specific similarity functions.
Spring 2008 Start Writing Thesis
Summer 2008 Finish Writing Thesis
54Back Up Slides
55References
- Semi-supervised Learning Using Randomized
Mincuts, - Blum, J. Lafferty, M.R. Rwebangira, R. Reddy ,
- ICML 2004
56My Work
Techniques for improving graph mincut algorithms
for semi-supervised classification
Techniques for extending Local Linear Regression
to the semi-supervised setting
- Practical techniques for using unlabeled data and
generic similarity functions to kernelize the
winnow algorithm.
57There may be several minimum cuts in the graph.
Indeed, there are potentially exponentially many
minimum cuts in the graph.
58Real Data CO2
Carbon dioxide concentration in the atmosphere
over the last two centuries.
Source World Watch Institute
59Experimental Results CO2
Weighted Kernel Regression, MSE 660
60Experimental Results CO2
Local Linear Regression, MSE 144
61Experimental ResultsCO2
LLSR, MSE 97.4
62Winnow
A linear separator algorithm, first proposed by
Littlestone.
We are particularly interested in winnow because
- It is known to be able to effectively learn in
the presence of irrelevant - attributes. Since we will be creating many new
features, we expect many - of them will be irrelevant.
2. It is fast and does not require a lot of
memory. Since we hope to use large amounts of
unlabeled data, scalability is an important
consideration.
63Synthetic Data Blobs and Lines
Can we create a data set that needs BOTH the
original and the new features to do well?
To answer this we create the data set we will
call Blobs and Lines
We generate the data in the following way
- We select k point to be the centers of our
blobs and assign them - labels in -1,1.
2. We flip a coin.
3. If heads, then we set x to be a random boolean
vector of dimension d and set the label to be
the first coordinate of x.
4. If tails, we pick one of the centers and flip
r bits and set x equal to that and set the label
to the label of the center.
64Synthetic Data Blobs and Lines
-
-
-
-
-
-
-
-
65Experimental Results Blobs and Lines