Data Mining and Machine Learning via Support Vector Machines presentation

About This Presentation

Title:

Data Mining and Machine Learning via Support Vector Machines

Description:

... with the question of how to construct computer programs that automatically ... Construct the bounding planes: Draw two parallel planes to the ... –

Number of Views:417

Avg rating:3.0/5.0

Slides: 57

Provided by: musi4

Learn more at: https://www.d.umn.edu

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining and Machine Learning via Support Vector Machines

1
Data Mining and Machine Learningvia Support
Vector Machines

Dave Musicant

Graphic generated with Lucent TechnologiesDemonst
ration 2-D Pattern Recognition Applet
athttp//svm.research.bell-labs.com/SVT/SVMsvt.ht
ml
2
Outline

The Supervised Learning Classification Problem
The Support Vector Machine for Classification
(linear approaches)
Nonlinear SVM approaches
Active learning techniques for SVMs
Iterative algorithms for solving SVMs
SVM Regression
Wrapup

3
Basic Definitions

Data Mining
non trivial process of identifying valid, novel,
potentially useful, and ultimately understandable
patterns in data.-- Usama Fayyad
Utilizes techniques from machine learning,
databases, and statistics
Machine Learning
concerned with the question of how to construct
computer programs that automatically improve with
experience."-- Tom Mitchell
Fits under Artificial Intelligence umbrella

4
Supervised Learning Classification

Example Cancer diagnosis

Training Set

Use this training set to learn how to classify
patients where diagnosis is not known

Test Set
Input Data
Classification

The input data is often easily obtained, whereas
the classification is not.

5
Classification Problem

Goal Use training set some learning method to
produce a predictive model.
Use this predictive model to classify new data.
Sample applications

6
Application Breast Cancer Diagnosis
Research by Mangasarian,Street, Wolberg
7
Breast Cancer Diagnosis Separation
Research by Mangasarian,Street, Wolberg
8
Application Document Classification

The Federalist Papers
Written in 1787-1788 by Alexander Hamilton, John
Jay, and James Madison to persuade residents of
the State of New York to ratify the U.S.
Constitution
All written under the pseudonym Publius
Who wrote which of them?
Hamilton wrote 56 papers
Madison wrote 50 papers
12 disputed papers, generally understood to be
written by Hamilton or Madison, but not known
which

Research by Bosch, Smith
9
Federalist Papers Classification
Research by Bosch, Smith
Graphic by Fung
10
Application Face Detection

Training data is a collection of Faces and
NonFaces
Rotation and Mirroring added in to provide
robustness

Image obtained from work by Osuna, Freund, and
Girosi athttp//www.ai.mit.edu/projects/cbcl/res-
area/object-detection/face-detection.html
11
Face Detection Results
Image obtained from "Support Vector Machines
Training and Applications" by Osuna, Freund, and
Girosi.
12
Face Detection Results
Image obtained from work by Osuna, Freund, and
Girosi athttp//www.ai.mit.edu/projects/cbcl/res-
area/object-detection/face-detection.html
13
Simple Linear Perceptron
Class -1
Class 1

Goal Find the best line (or hyperplane) to
separate the training data. How to formalize?
In two dimensions, equation of the line is given
by

Better notation for n dimensions treat each data
point and the coefficients as vectors. Then
equation is given by

14
Simple Linear Perceptron (cont.)

The Simple Linear Perceptron is a classifier as
shown in the picture
Points that fall on the right are classified as
1
Points that fall on the left are classified as
-1
Therefore using the training set, find a
hyperplane (line) so that

This is a good starting point. But we can do
better!

Class -1
Class 1
15
Finding the Best Plane

Not all planes are equal. Which of the two
following planes shown is better?

Both planes accurately classify the training set.
The solid green plane is the better choice, since
it is more likely to do well on future test data.
The solid green plane is further away from the
data.

16
Separating the planes

Construct the bounding planes
Draw two parallel planes to the classification
plane.
Push them as far apart as possible, until they
hit data points.
The classification plane with bounding planes
furthest apart is the best one.

Class -1
Class 1
17
Recap Finding the Best Plane

Details
All points in class 1 should be to theright of
bounding plane 1.
All points in class -1 should be to theleft of
bounding plane -1.
Pick yi to be 1 or -1 depending on the
classification. Then the above two inequalities
can be written as one
The distance between bounding planes should be
maximized.
The distance between bounding planes is given by

Class -1
Class 1
18
The Optimization Problem

The previous slide can be rewritten as

This is a mathematical program.
Optimization problem subject to constraints
More specifically, this is a quadratic program
There are high powered software tools for solving
this kind of problem (both commercial and
academic)
These general purpose tools are slow for this
particular problem

19
Data Which is Not Linearly Separable

What if a separating plane does not exist?

Find the plane that maximizes the margin and
minimizes the errors on the training points.
Take original inequality and add a slack variable
to measure error

20
The Support Vector Machine

Push the planes apart and minimize the error at
the same time

C is a positive number that is chosen to balance
these two goals.

This problem is called a Support Vector Machine,
or SVM.

21
Terminology

Those points that touch the bounding plane, or
lie on the wrong side, are called support vectors.

If all the data points except the support vectors
were removed, the solution would turn out the
same.
The SVM is mathematically equivalent to force and
torque equilibrium (hence the name support
vectors).

22
Example from Carleton College

1850 students
4 year undergraduate liberal arts college
Ranked 5th in the nation by US News and World
Report
15-20 computer science majors per year
All research assistants are full-time
undergraduates

23
Student Research Example

Goal automatically generate frequently asked
questions list from discussion groups
Subgoal 1 Given a corpus of discussion group
postings, identify those messages that contain
questions
Recruit student volunteers to identify questions
Learn classification
Work by students Sarah Allen, Janet Campbell,
Ester Gubbrud, Rachel Kirby, Lillie Kittredge

24
Building A Training Set
25
Building A Training Set

Which sentences are questions in the following
text?

From oehler_at_yar.cs.wisc.edu (Wonko the Sane) I
was recently talking to a possible employer (
mine! -) ) and he made a reference to a 48-bit
graphics computer/image processing system. I seem
to remember it being called IMAGE or something
akin to that. Anyway, he claimed it had 48-bit
color a 12-bit alpha channel. That's 60 bits of
info--what could that possibly be for?
Specifically the 48-bit color? That's 280
trillion colors, many more than the human eye can
resolve. Is this an anti-aliasing thing? Or is
this just some magic number to make it work
better with a certain processor.
26
Representing the training set

Each document is a point
Each potential word is a column (bag of words)
Other pre-processing tricks
Remove punctuation
Remove "stop words" such as "is", "a", etc.
Use stemming to remove "ing" and "ed", etc. from
similar words

27
Results

If you just guess brain-dead "every message
contains a question", get 55 right
If you use a Support Vector Machine, get 66.5 of
them right
What words do you think were strong indicators of
questions?
anyone, does, any, what, thanks, how, help, know,
there, do, question
What words do you think were strong
contra-indicators of questions?
re, sale, m, references, not, your

28
Beyond lines

Some datasets may not be best separated by a
plane.
SVMs can be extended to nonlinear surfaces also.

Generated with Lucent TechnologiesDemonstration
2-D Pattern Recognition Applet athttp//svm.resea
rch.bell-labs.com/SVT/SVMsvt.html
29
Finding nonlinear surfaces

How to modify algorithm to find nonlinear
surfaces?
First idea (simple and effective) map each data
point into a higher dimensional space, and find a
linear fit there
Example Find a quadratic surface for
Use new coordinates in regular linear SVM
A plane in this quadratic space is equivalent to
a quadratic surface in our original space

30
Problems with this method

If dimensionality of space is high, lots of
calculations
For a high polynomial space, combinations of
coordinates explodes
Need to do all these calculations for all
training points, and for each testing point
Infinite dimensional spaces impossible
Nonlinear surfaces can be used without these
problems through the use of a kernel function.

31
The Dual Problem

The dual SVM is an alternative approach.
Wrap a string around all the data points.
Find the two points, one on each string, which
are closest together. Connect the dots.
The perpendicular bisector to this connection is
the best classification plane.

Class 1
Class -1
32
The Dual Variable, or Importance

Every point on the string is a linear
combination of the points inside the string.

In general
as are referred to as dual variables, and
represent the importance of each data point.

33
Two Equivalent Approaches
Class 1
Class -1
Class -1
Class 1

Primal Problem
Find best separating plane
Variables w,b

Dual Problem
Find closest points on strings
Variables ?

Both problems yield the same classification
plane.
w,b can be expressed in terms of ?
? can be expressed in terms of w,b

34
How to generalize nonlinear fits

Traditional SVM
Dual formulation
Can find w and b in terms of ?.
But note don't need any xi individually, just
scalar products between points.

35
Kernel function

Dual formulation again
Substitute scalar product with kernel function
Using a kernel corresponds to having mapped the
data into some high dimensional space, possibly
an infinite one.

36
Traditional kernels

Linear
Polynomial
Gaussian

37
Another interpretation

Kernels can be thought of as a distance metric.
Linear SVM determine class by sign of
Nonlinear SVM determine class by sign of
Those support vectors that x is "closest to"
influence its class selection.

38
Example Checkerboard
39
k-Nearest Neighbor Algorithm
40
SVM on Checkerboard
41
Active Learning with SVMs

Given a set of unlabeled points that I can label
at will, how do I choose which one to label next?
Common answer choose a point that is on or close
to the current separating hyperplane (Campbell,
Cristianini, Smola Tong Koller Schohn Cohn)
Why?

42
On the hyperplane Spin 1

Assume data is linearly separable.
A point which is on the hyperplane (or at least
in the margin) is guaranteed to change the
results. (Schohn Cohn)

43
On the hyperplane Spin 2

Intuition suggests that one should grab the point
that is most wrong
Problem don't know the class of the point yet
If you grab a point that is far from the
hyperplane, and it is classified wrong, this
would be wonderful
But points which are far from the hyperplane are
the ones which are most likely be correctly
classified
(Campbell, Cristianini, Smola)

44
Active Learning in Batches

What if you want to choose a number of points to
label at once? (Brinker)
Could choose the n closest points to the
hyperplane, but this is not optimal

45
Heuristic approach instead

Assumption all hyperplanes go through origin
authors claim that this can be compensated for
with appropriate choice of kernel
To have maximal effect on direction of
hyperplane, choose points with largest angle

46
Defining angle

Let ? mapping to feature space
Angle between points x and y

47
Approach for maximizing angle

Introduce artificial point normal to existing
hyperplane.
Choose next point to be one that maximizes angle
with this one.
Choose each successive point to be the one that
maximizes the minimum angle to previous point
(i.e., minimizes the maximum cosine value)

48
What happened to distance?

In practice, use both measures
want points closest to plane
want points with largest angular separation from
others
Iterative greedy algorithmvalue ? distance
to hyperplane (1-?) (largest cosine measure
to an already existing point)
Choose the next point to be the one that
minimizes this value
Paper has results fairly robust to varying ?

49
Iterative Algorithms

Maintain the importance, or dual variable
associated with all data points.
This is small, since it is a single dimensional
array of size m.
Algorithm
Look at each point sequentially.
Update its importance. (How?)
Repeat until no further improvements in goal.

50
Iterative Framework

LSVM, ASVM, SOR, etc. are iterative algorithms on
the dual variables.
Algorithm (Assume that we have m data points.)
for (i0 i lt m i) ai 0 // Initialize
dual variables
while (distance between strings continues to
shorten)
for (i0 i ltm i)
Update ai according to the update rule (not shown
here).
Bottleneck Repeated scans through the dataset.
Many of these data points are unimportant

51
Iterative Framework (Optimized)

Optimization Apply algorithm only to active
points, i.e. those points that appear to be
support vectors, as long as progress is being
made.
Optimized Algorithm
while (strings continue to shorten)
run the unoptimized algorithm for one iteration
while (strings continue to shorten)
for (all i corresponding to active points)
Update ai .
If ai gt 0, keep this data point active.
Otherwise, remove it.
This results in more loops, but the inner loops
are so much faster that it pays off significantly.