Title: Linear Separators
1Linear Separators
2Bankruptcy example
- R is the ratio of earnings to expenses
- L is the number of late payments on credit cards
over the past year. - We would like here to draw a linear separator,
and get so a classifier.
31-Nearest Neighbor Boundary
- The decision boundary will be the boundary
between cells defined by points of different
classes, as illustrated by the bold line shown
here.
4Decision Tree Boundary
- Similarly, a decision tree also defines a
decision boundary in the feature space.
Although both 1-NN and decision trees agree on
all the training points, they disagree on the
precise decision boundary and so will classify
some query points differently. This is the
essential difference between different learning
algorithms.
5Linear Boundary
- Linear separators are characterized by a single
linear decision boundary in the space. - The bankruptcy data can be successfully separated
in that manner. - But, there is no guarantee that a single linear
separator will successfully classify any set of
training data.
6Linear Hypothesis Class
- Line equation (assume 2D first)
- w2x2w1x1b0
- Fact1 All the points (x1, x2) lying on the line
make the equation true. - Fact2 The line separates the plane in two
half-planes. - Fact3 The points (x1, x2) in one half-plane give
us an inequality with respect to 0, which has the
same direction for each of the points in the
half-plane. - Fact4 The points (x1, x2) in the other
half-plane give us the reverse inequality with
respect to 0. -
7Fact 3 proof
- w2x2w1x1b0
- We can write it as
(p,r) is on the line so
But qltr, so we get
i.e.
Since (p,q) was an arbitrary point in the
half-plane, we say that the same direction of
inequality holds for any other point of the
half-plane.
8Fact 4 proof
- w2x2w1x1b0
- We can write it as
(p,r) is on the line so
But sgtr, so we get
i.e.
Since (p,s) was an arbitrary point in the
(other) half-plane, we say that the same
direction of inequality holds for any other point
of that half-plane.
9Corollary
- Depending on the slope of the line direction, the
inequalities might alternate for the two
half-planes. - However, it will be the same direction among the
points belonging to the same half-plane. - Whats an easy way to determine the direction of
the inequalities for each subplane? - In order to determine the inequality direction,
try it for the point (0,0), and determine the
direction for the half-plane where (0,0) belongs.
- The points of the other half-plane will have the
opposite inequality direction. - How much bigger (or smaller) than zero is
w2pw1qb is proportional to the distance of the
point (p,q) from the line. - The same can be said for an n-dimensional space.
Simply, we dont talk about half-planes but
half-spaces (line is now hyperplane creating
two half-spaces)
10Linear classifier
- We can now exploit the sign of this distance to
define a linear classifier, one whose decision
boundary is a hyperplane. - Instead of using 0 and 1 as the class labels
(which was an arbitrary choice anyway) we use the
sign of the distance, either 1 or -1 as the
labels (that is the values of the yi s).
Which outputs 1 or 1.
11Margin
- A variant of the signed distance of a training
point to a hyperplane is the margin of the point.
- The margin (gamma) is the product of w.xib for
the training point xi and the known sign of the
class, yi. - If they agree (the training point is correctly
classified), then the margin is positive - If they disagree (the classification is in
error), then the margin is negative.
margin ?i yi(w.xib) its proportional to
perpendicular distance of point xi to line
(hyperplane). ?i gt 0 point is correctly
classified (sign of distance yi) ?i lt 0
point is incorrectly classified (sign of distance
? yi)
12Perceptron algorithm
- How to find a linear separator?
- The perceptron algorithm, was developed by
Rosenblatt in the mid 50's. - This is a greedy, "mistake driven" algorithm.
- We will be using the extended form of the weight
and data-point vectors in this algorithm. The
extended form is in fact a trick
- This will simplify a bit the presentation.
13Perceptron algorithm
- Pick initial weight vector (including b), e.g.
.1, , .1 - Repeat until all points get correctly classified
- Repeat for each point xi
- Calculate margin yi.w.xi (this is number)
- If margin gt 0, point xi is correctly classified
- Else, change weights to increase margin
- change weights proportional to yi.xi
- Note that, if yi1
- If xji gt 0 then wj increases (margin increases)
- If xji lt 0 then wj decreases (margin again
increases) - Similarly, for yi-1, margin always increases
14Perceptron algorithm (explanations)
- The first step is to start with an initial value
of the weight vector, usually all zeros. - Then we repeat the inner loop until all the
points are correctly classified using the current
weight vector. - The inner loop is to consider each point.
- If the point's margin is positive then it is
correctly classified and we do nothing. - Otherwise, if it is negative or zero, we have a
mistake and we want to change the weights so as
to increase the margin (so that it ultimately
becomes positive). - The trick is how to change the weights. It turns
out that using a value proportional to yi.xi is
the right thing. We'll see why, formally, later.
15Perceptron algorithm
- So, each change of w increases the margin on a
particular point. - However, the changes for the different points
interfere with each other, that is, different
points might change the weights in opposing
directions. - So, it will not be the case that one pass through
the points will produce a correct weight vector. - In general, we will have to go around multiple
times. - The remarkable fact is that the algorithm is
guaranteed to terminate with the weights for a
separating hyperplane as long as the data is
linearly separable. - The proof of this fact is beyond our scope.
- Notice that if the data is not separable, then
this algorithm is an infinite loop. - It turns out that it is a good idea to keep track
of the best separator we've seen so far (the one
that makes the fewest mistakes) and after we get
tired of going around the loop, return that one.
16Perceptron algorithm Bankruptcy data
- This shows a trace of the perceptron algorithm on
the bankruptcy data. - Here it took 49 iterations through the data (the
outer loop) for the algorithm to stop. - The separator at the end of the loop is 0.4,
0.94, -2.2 - We usually pick some small "rate" constant to
scale the change to w. - .1 is used, but other small values also work well.
17Gradient Ascent/Descent
- Why pick yi.xi as increment to weights?
- The margin is a multiple input variable function.
- The variables are w2, w1, w0 (or in general
wn,,w0) - In order to reach the maximum of this function,
it is good to change the variables in the
direction of the slope of the function. - The slope is represented by the gradient of the
function. - The gradient is the vector of first (partial)
derivatives of the function with respect to each
of the input variables.