Data Mining and Machine Learning via Support Vector Machines

About This Presentation
Title:

Data Mining and Machine Learning via Support Vector Machines

Description:

... with the question of how to construct computer programs that automatically ... Construct the bounding planes: Draw two parallel planes to the ... –

Number of Views:417
Avg rating:3.0/5.0
Slides: 57
Provided by: musi4
Learn more at: https://www.d.umn.edu
Category:

less

Transcript and Presenter's Notes

Title: Data Mining and Machine Learning via Support Vector Machines


1
Data Mining and Machine Learningvia Support
Vector Machines
  • Dave Musicant

Graphic generated with Lucent TechnologiesDemonst
ration 2-D Pattern Recognition Applet
athttp//svm.research.bell-labs.com/SVT/SVMsvt.ht
ml
2
Outline
  • The Supervised Learning Classification Problem
  • The Support Vector Machine for Classification
    (linear approaches)
  • Nonlinear SVM approaches
  • Active learning techniques for SVMs
  • Iterative algorithms for solving SVMs
  • SVM Regression
  • Wrapup

3
Basic Definitions
  • Data Mining
  • non trivial process of identifying valid, novel,
    potentially useful, and ultimately understandable
    patterns in data.-- Usama Fayyad
  • Utilizes techniques from machine learning,
    databases, and statistics
  • Machine Learning
  • concerned with the question of how to construct
    computer programs that automatically improve with
    experience."-- Tom Mitchell
  • Fits under Artificial Intelligence umbrella

4
Supervised Learning Classification
  • Example Cancer diagnosis

Training Set
  • Use this training set to learn how to classify
    patients where diagnosis is not known

Test Set
Input Data
Classification
  • The input data is often easily obtained, whereas
    the classification is not.

5
Classification Problem
  • Goal Use training set some learning method to
    produce a predictive model.
  • Use this predictive model to classify new data.
  • Sample applications

6
Application Breast Cancer Diagnosis
Research by Mangasarian,Street, Wolberg
7
Breast Cancer Diagnosis Separation
Research by Mangasarian,Street, Wolberg
8
Application Document Classification
  • The Federalist Papers
  • Written in 1787-1788 by Alexander Hamilton, John
    Jay, and James Madison to persuade residents of
    the State of New York to ratify the U.S.
    Constitution
  • All written under the pseudonym Publius
  • Who wrote which of them?
  • Hamilton wrote 56 papers
  • Madison wrote 50 papers
  • 12 disputed papers, generally understood to be
    written by Hamilton or Madison, but not known
    which

Research by Bosch, Smith
9
Federalist Papers Classification
Research by Bosch, Smith
Graphic by Fung
10
Application Face Detection
  • Training data is a collection of Faces and
    NonFaces
  • Rotation and Mirroring added in to provide
    robustness

Image obtained from work by Osuna, Freund, and
Girosi athttp//www.ai.mit.edu/projects/cbcl/res-
area/object-detection/face-detection.html
11
Face Detection Results
Image obtained from "Support Vector Machines
Training and Applications" by Osuna, Freund, and
Girosi.
12
Face Detection Results
Image obtained from work by Osuna, Freund, and
Girosi athttp//www.ai.mit.edu/projects/cbcl/res-
area/object-detection/face-detection.html
13
Simple Linear Perceptron
Class -1
Class 1
  • Goal Find the best line (or hyperplane) to
    separate the training data. How to formalize?
  • In two dimensions, equation of the line is given
    by
  • Better notation for n dimensions treat each data
    point and the coefficients as vectors. Then
    equation is given by

14
Simple Linear Perceptron (cont.)
  • The Simple Linear Perceptron is a classifier as
    shown in the picture
  • Points that fall on the right are classified as
    1
  • Points that fall on the left are classified as
    -1
  • Therefore using the training set, find a
    hyperplane (line) so that
  • This is a good starting point. But we can do
    better!

Class -1
Class 1
15
Finding the Best Plane
  • Not all planes are equal. Which of the two
    following planes shown is better?
  • Both planes accurately classify the training set.
  • The solid green plane is the better choice, since
    it is more likely to do well on future test data.
  • The solid green plane is further away from the
    data.

16
Separating the planes
  • Construct the bounding planes
  • Draw two parallel planes to the classification
    plane.
  • Push them as far apart as possible, until they
    hit data points.
  • The classification plane with bounding planes
    furthest apart is the best one.

Class -1
Class 1
17
Recap Finding the Best Plane
  • Details
  • All points in class 1 should be to theright of
    bounding plane 1.
  • All points in class -1 should be to theleft of
    bounding plane -1.
  • Pick yi to be 1 or -1 depending on the
    classification. Then the above two inequalities
    can be written as one
  • The distance between bounding planes should be
    maximized.
  • The distance between bounding planes is given by

Class -1
Class 1
18
The Optimization Problem
  • The previous slide can be rewritten as
  • This is a mathematical program.
  • Optimization problem subject to constraints
  • More specifically, this is a quadratic program
  • There are high powered software tools for solving
    this kind of problem (both commercial and
    academic)
  • These general purpose tools are slow for this
    particular problem

19
Data Which is Not Linearly Separable
  • What if a separating plane does not exist?
  • Find the plane that maximizes the margin and
    minimizes the errors on the training points.
  • Take original inequality and add a slack variable
    to measure error

20
The Support Vector Machine
  • Push the planes apart and minimize the error at
    the same time
  • C is a positive number that is chosen to balance
    these two goals.
  • This problem is called a Support Vector Machine,
    or SVM.

21
Terminology
  • Those points that touch the bounding plane, or
    lie on the wrong side, are called support vectors.
  • If all the data points except the support vectors
    were removed, the solution would turn out the
    same.
  • The SVM is mathematically equivalent to force and
    torque equilibrium (hence the name support
    vectors).

22
Example from Carleton College
  • 1850 students
  • 4 year undergraduate liberal arts college
  • Ranked 5th in the nation by US News and World
    Report
  • 15-20 computer science majors per year
  • All research assistants are full-time
    undergraduates

23
Student Research Example
  • Goal automatically generate frequently asked
    questions list from discussion groups
  • Subgoal 1 Given a corpus of discussion group
    postings, identify those messages that contain
    questions
  • Recruit student volunteers to identify questions
  • Learn classification
  • Work by students Sarah Allen, Janet Campbell,
    Ester Gubbrud, Rachel Kirby, Lillie Kittredge

24
Building A Training Set
25
Building A Training Set
  • Which sentences are questions in the following
    text?

From oehler_at_yar.cs.wisc.edu (Wonko the Sane) I
was recently talking to a possible employer (
mine! -) ) and he made a reference to a 48-bit
graphics computer/image processing system. I seem
to remember it being called IMAGE or something
akin to that. Anyway, he claimed it had 48-bit
color a 12-bit alpha channel. That's 60 bits of
info--what could that possibly be for?
Specifically the 48-bit color? That's 280
trillion colors, many more than the human eye can
resolve. Is this an anti-aliasing thing? Or is
this just some magic number to make it work
better with a certain processor.
26
Representing the training set
  • Each document is a point
  • Each potential word is a column (bag of words)
  • Other pre-processing tricks
  • Remove punctuation
  • Remove "stop words" such as "is", "a", etc.
  • Use stemming to remove "ing" and "ed", etc. from
    similar words

27
Results
  • If you just guess brain-dead "every message
    contains a question", get 55 right
  • If you use a Support Vector Machine, get 66.5 of
    them right
  • What words do you think were strong indicators of
    questions?
  • anyone, does, any, what, thanks, how, help, know,
    there, do, question
  • What words do you think were strong
    contra-indicators of questions?
  • re, sale, m, references, not, your

28
Beyond lines
  • Some datasets may not be best separated by a
    plane.
  • SVMs can be extended to nonlinear surfaces also.

Generated with Lucent TechnologiesDemonstration
2-D Pattern Recognition Applet athttp//svm.resea
rch.bell-labs.com/SVT/SVMsvt.html
29
Finding nonlinear surfaces
  • How to modify algorithm to find nonlinear
    surfaces?
  • First idea (simple and effective) map each data
    point into a higher dimensional space, and find a
    linear fit there
  • Example Find a quadratic surface for
  • Use new coordinates in regular linear SVM
  • A plane in this quadratic space is equivalent to
    a quadratic surface in our original space

30
Problems with this method
  • If dimensionality of space is high, lots of
    calculations
  • For a high polynomial space, combinations of
    coordinates explodes
  • Need to do all these calculations for all
    training points, and for each testing point
  • Infinite dimensional spaces impossible
  • Nonlinear surfaces can be used without these
    problems through the use of a kernel function.

31
The Dual Problem
  • The dual SVM is an alternative approach.
  • Wrap a string around all the data points.
  • Find the two points, one on each string, which
    are closest together. Connect the dots.
  • The perpendicular bisector to this connection is
    the best classification plane.

Class 1
Class -1
32
The Dual Variable, or Importance
  • Every point on the string is a linear
    combination of the points inside the string.
  • In general
  • as are referred to as dual variables, and
    represent the importance of each data point.

33
Two Equivalent Approaches
Class 1
Class -1
Class -1
Class 1
  • Primal Problem
  • Find best separating plane
  • Variables w,b
  • Dual Problem
  • Find closest points on strings
  • Variables ?
  • Both problems yield the same classification
    plane.
  • w,b can be expressed in terms of ?
  • ? can be expressed in terms of w,b

34
How to generalize nonlinear fits
  • Traditional SVM
  • Dual formulation
  • Can find w and b in terms of ?.
  • But note don't need any xi individually, just
    scalar products between points.

35
Kernel function
  • Dual formulation again
  • Substitute scalar product with kernel function
  • Using a kernel corresponds to having mapped the
    data into some high dimensional space, possibly
    an infinite one.

36
Traditional kernels
  • Linear
  • Polynomial
  • Gaussian

37
Another interpretation
  • Kernels can be thought of as a distance metric.
  • Linear SVM determine class by sign of
  • Nonlinear SVM determine class by sign of
  • Those support vectors that x is "closest to"
    influence its class selection.

38
Example Checkerboard
39
k-Nearest Neighbor Algorithm
40
SVM on Checkerboard
41
Active Learning with SVMs
  • Given a set of unlabeled points that I can label
    at will, how do I choose which one to label next?
  • Common answer choose a point that is on or close
    to the current separating hyperplane (Campbell,
    Cristianini, Smola Tong Koller Schohn Cohn)
  • Why?

42
On the hyperplane Spin 1
  • Assume data is linearly separable.
  • A point which is on the hyperplane (or at least
    in the margin) is guaranteed to change the
    results. (Schohn Cohn)

43
On the hyperplane Spin 2
  • Intuition suggests that one should grab the point
    that is most wrong
  • Problem don't know the class of the point yet
  • If you grab a point that is far from the
    hyperplane, and it is classified wrong, this
    would be wonderful
  • But points which are far from the hyperplane are
    the ones which are most likely be correctly
    classified
  • (Campbell, Cristianini, Smola)

44
Active Learning in Batches
  • What if you want to choose a number of points to
    label at once? (Brinker)
  • Could choose the n closest points to the
    hyperplane, but this is not optimal

45
Heuristic approach instead
  • Assumption all hyperplanes go through origin
  • authors claim that this can be compensated for
    with appropriate choice of kernel
  • To have maximal effect on direction of
    hyperplane, choose points with largest angle

46
Defining angle
  • Let ? mapping to feature space
  • Angle between points x and y

47
Approach for maximizing angle
  • Introduce artificial point normal to existing
    hyperplane.
  • Choose next point to be one that maximizes angle
    with this one.
  • Choose each successive point to be the one that
    maximizes the minimum angle to previous point
    (i.e., minimizes the maximum cosine value)

48
What happened to distance?
  • In practice, use both measures
  • want points closest to plane
  • want points with largest angular separation from
    others
  • Iterative greedy algorithmvalue ? distance
    to hyperplane (1-?) (largest cosine measure
    to an already existing point)
  • Choose the next point to be the one that
    minimizes this value
  • Paper has results fairly robust to varying ?

49
Iterative Algorithms
  • Maintain the importance, or dual variable
    associated with all data points.
  • This is small, since it is a single dimensional
    array of size m.
  • Algorithm
  • Look at each point sequentially.
  • Update its importance. (How?)
  • Repeat until no further improvements in goal.

50
Iterative Framework
  • LSVM, ASVM, SOR, etc. are iterative algorithms on
    the dual variables.
  • Algorithm (Assume that we have m data points.)
  • for (i0 i lt m i) ai 0 // Initialize
    dual variables
  • while (distance between strings continues to
    shorten)
  • for (i0 i ltm i)
  • Update ai according to the update rule (not shown
    here).
  • Bottleneck Repeated scans through the dataset.
  • Many of these data points are unimportant

51
Iterative Framework (Optimized)
  • Optimization Apply algorithm only to active
    points, i.e. those points that appear to be
    support vectors, as long as progress is being
    made.
  • Optimized Algorithm
  • while (strings continue to shorten)
  • run the unoptimized algorithm for one iteration
  • while (strings continue to shorten)
  • for (all i corresponding to active points)
  • Update ai .
  • If ai gt 0, keep this data point active.
    Otherwise, remove it.
  • This results in more loops, but the inner loops
    are so much faster that it pays off significantly.

52
Regression
  • Support vector machines can also be used to solve
    regression problems.

53
The Regression Problem
  • Close points may be wrong due to noise only
  • Line should be influenced by real data, not
    noise
  • Ignore errors from those points which are close!

54
Support Vector Regression
  • Traditional support vector regression
  • Minimize the error made outside of the tube
  • Regularize the fitted plane by minimizing the
    norm of w
  • The parameter C balances two competing goals

55
My current research
  • Collaborating with
  • Deborah Gross, Carleton College (chemistry)
  • Raghu Ramakrishnan, UW-Madison (computer
    sciences)
  • Jamie Schauer, UW-Madison (atmospheric sciences)
  • Analyzing data from Aerosol Time-of-Flight Mass
    Spectrometer (ATOFMS)
  • Aerosol "small particle of gunk in air"
  • Questions we want to answer
  • How can we classify safe vs. dangerous?
  • Can we determine when a suddenchange in the air
    stream hashappened?
  • Can we identify what substances arepresent in a
    particular particle?

56
Questions?
Write a Comment
User Comments (0)
About PowerShow.com