Title: A New Linearthreshold Algorithm
1A New Linear-threshold Algorithm
- Anna Rapoport
- Lev Faivishevsky
2Introduction
- Valiant (1984) and others have studied the
problem of learning various classes of Boolean
functions from examples. Now were going to
discuss incremental learning of these functions. - We consider a setting in which the learner
responds to each example according to a current
hypothesis. Then the learner updates it, if
necessary, based on the correct classification of
the example.
3Introduction (cont.)
- One natural measure of the quality of learning in
this setting is the number of mistakes the
learner makes. - For suitable classes of functions, learning
algorithms are available that make a bounded
number of mistakes, with the bound independent of
the number of examples seen by the learner.
4Introduction (cont.)
- We present an algorithm that learns disjunctive
Boolean functions, along with variants for
learning other classes of Boolean functions. - The basic method can be expressed as a linear-
threshold algorithm. - A primary advantage of this algorithm is that the
number of mistakes grows only logarithmically
with the number of irrelevant attributes in the
examples. Also it is computationally efficient in
both time and space.
5How does it work?
- We study learning in an on-line setting theres
no separate set of training examples. The learner
attempts to predict the appropriate response for
each example, starting with the first example
received. - After making this prediction, the learner is told
whether the prediction was correct, and then uses
this information to improve its hypothesis. - The learner continues to learn as long as it
receives examples.
6The Setting
- Now were going to describe in more detail the
learning environment that we consider and the
classes of functions that the algorithm can
learn. We assume that learning takes place in a
sequence of trials. The order of events in a
trial is as follows
7The Setting (cont.)
- (1) The learner receives some information about
the world, corresponding to a single example.
This information consists of the values of n
Boolean attributes, for some n that remains
fixed. We think of the information received as
a point in 0,1n. We call this point an
instance and we call 0,1n the instance space.
8The Setting (cont.)
- (2) The learner makes a response. The learner
has a choice of two responses, labeled 0 and 1.
We call this response the learners prediction
of the correct value. - (3) The learner is told whether or not the
response was correct. This information is
called the reinforcement.
9The Setting (cont.)
- Each trial begins after the previous trial has
ended. - We assume that for entire sequence of trials,
there is a single function f 0,1n ?0,1 which
maps each instance to the correct response to
that instance. This function is called target
function or target concept. - The algorithm for learning in this setting is
called algorithm for on-line learning from
examples (AOLLE)
10Mistake Bound (introduction)
- We evaluate the algorithms learning behavior by
counting the worst-case number of mistakes that
it will make while learning a function from a
specified class of functions. Computational
complexity is also considered. The method is
computationally time and space efficient.
11General results about mistake bounds for AOOLE
- At first we present upper and lower bounds on the
number of mistakes in the case where one ignores
issues of computational efficiency. - The instant space can be any finite space X, and
the target class is assumed to be a collection of
functions, each with domain X and range 0,1.
12Some definitions
- Def 1For any learning algorithm A and any
target function , let MA() be the maximum
over all possible sequences of instance of the
number of mistakes that algorithm A makes when
the target function is .
13Some definitions
- Def 2For any learning algorithm A and any
non-empty target class C, let - MA(C) ? max ?C MA().
- Define MA(C) -1, if C is empty. Any number
greater than or equal to MA(C) will be called a
mistake bound for algorithm A applied to class
C.
14Some definitions
- Def 3The optimal mistake bound for a target
class C, denoted opt(C), is the minimum over
all algorithms A of MA(C) (regardless
algorithms computational efficiency) . An
algorithm A is called optimal for class C if
MA(C) opt(C). Thus opt(C) represents the best
possible worst case mistake bound for any
algorithm learning C.
152 Auxiliary algorithms
- If computational resources are no issue, theres
a straightforward learning algorithm that has
excellent mistake bounds for many classes of
functions. Were going briefly to observe them,
because it gives an upper limit on the mistake
bound and because it suggests strategies that one
might explore in searching for computationally
efficient algorithms.
16Algorithm 1 halving algorithm (HA)
- The HA can be applied to any finite class C of
functions taking values in 0,1. The HA
maintains a variable CONSIST C (initially).
When it receives an instance, it determines the
sets - ?0(CONSIST,x) ?C, (x)0
- ?1(CONSIST,x) ?C, (x)1
17HA scheme of the work
?1(CONSIST,x) gt ?0(CONSIST,x)
true
false
Predicts 1
Predicts 0
When it receives the reinforcement, it
sets CONSIST ?1(CONSIST,x), if correct is
1 CONSIST ?0(CONSIST,x), if correct is 0
18HA main results
- Def Let MHALVING(C) denote the maximum number
of mistakes that the algorithm will make when it
is run for the class C. - Th 1For any non-empty target class C,
- MHALVING(C) ? log2C
- Th 2For any finite target class C,
- opt (C) ? log2C
19Algorithm 2 standard optimal algorithm (SOA)
- Def 1A mistake tree for a target class C over
an instance space X is a binary tree each of
whose nodes is a non-empty subset of C, and each
internal node is labeled with a point of X and
satisfies - 1.The root of the tree is C
- 2. For any internal node C labeled with x
the left child of C is ?0(C,x) and right
one is ?1(C,x).
20SOA
- Def 2A complete k-mistake tree is a mistake
tree that is a complete binary tree of height
k. - Def 3For any non-empty finite target class C,
let K(C) equal the largest integer k s.t. there
exists a complete k-mistake free for C. K(?)
-1. - The SOA is similar to HA, but it compares
- K(?1(CONSIST,x)) gt K(?0(CONSIST,x))
21SOA main results
- Th1Let X be any instance space. CX?0,1
opt(C) MSOA(C) K(C) - Def 4S?X is shattered by a target class C if
for ?U?S ??C s.t (U)1 (S-U)0 - Def 5The Vapnik-Chervonenkis dimension is the
card of the largest set, shattered by C - Th2For any target class C
- VCdim(C) ? opt(C)
22The linear-threshold algorithm (LTA)
- Def 10,1n?0,1is linearly-separable if
there is a hyperplane in Rn separating the
points on which the function is 1 from those on
which its 0. - Def 1A monotone disjunction is such in which no
literal appears negated (x1,..,xn) xi1 ??
xik - A hyperplane given by xi1 xik ½ is a
separating hyperplane for .
23WINNOW 1
- The instance space is X0,1n
- The algorithm maintains weights w1,..,wn?R ,
each having 1 as its initial value. - ? ? R the threshold.
- When the learner receives an instance (x1,..,xn),
the learner responds as follows - if ? wi xi ? ?, then it predicts 1
- if ? wi xi ? ?, then it predicts 0.
24WINNOW 1
- The weights are changed only if the learner
makes a mistake according the table -
25Requirements for WINNOW1
- The space needed (without counting bits per
weight) and the sequential time needed per trial
are both linear in n. - Non-zero weights are powers of ?, so the weights
are at most ??. Thus if the logarithms (base ?)
of the weights are stored, only O(log2log??) bits
per weight are needed
26Mistake bound for WINNOW1
- Th Suppose that the target function is a
k- literal monotone disjunction given by - (x1,..,xn) ? xi1 ?? xik. If WINNOW1 is run
with ? ?1 and ??1/?, then for any sequence of
instances the total number of mistakes will be
bounded by - ?k(log???1) ? n/?
27Example
- Good bounds are obtained if ? ? 2, ? ? n/?.
- We get the bound 2klog2n ? 2 , the dominating
first term is minimized for ? ?e the bound then
becomes - (e/log2e) klog2n ? e ? 1.885klog2n ? e
28Lower mistake bound
- Def For 1 ? k ? n, let Ck denote the class of
k-literal monotone disjunctions, and let Ck
denote the class of all those monotone
disjunctions that have at most k literals. - Th (lower bound) For 1 ? k ? n,
- opt(Ck) ? opt(Ck) ? k log2(n/k).For n?1
- we also have opt(Ck) ? k/8(1? log2(n/k))
29Modified WINNOW1
- For ? instance space X?0,1n, and ? ? s.t. 0lt??1
let F(X,?)X?0,1s.t. for ??F(X,?) ? µ1,..,µn
?0 s.t. for all (x1,..,xn) ? X - ? µ i xi ? 1 if (x1,..,xn)1 ()
- ? µ i xi ? 1-? if (x1,..,xn)0 ()
- So the inverse images of 0 and 1 are linearly
separable with a minimum separation that depends
on ?. The mistake bound that we derive will be
practical only for those functions for which ? is
sufficiently large.
30Example an r-of-k threshold function
- DefLet X0,1n,an r-of-k threshold function
is defined by selecting a set of k significant
variables. 1 whenever at least r of this k
variables are 1. - 1 ? xi1 xik ? r ?
- (1/r)xi1 (1/r) xik ? 1 if
(x1,..,xn)1 - (1/r)xi1 (1/r) xik ? 1-r if
(x1,..,xn)0 - Thus the r-of-k threshold functions ?
F(0,1n,1/r)
31WINNOW2
- The only change to WINNOW1 updating rule when a
mistake is made.
32Requirements for WINNOW2
- We use ? 1? /2 for learning target function in
F(X,?). - Space time requirements for WINNOW2 are similar
to those for WINNOW1. However, more bits will be
needed to store each weight, perhaps as many as
the logarithm of the mistake bound.
33Mistake bound for WINNOW2
- Th For 0lt??1, if the target function is in
F(X,?) for X?0,1n, if µ1,..,µn have been
chosen s.t. satisfies (), (), and if
WINNOW2 is run with ? 1? /2 and ??1 and the
algorithm receives instances from X, then the
number of mistakes will be bounded by - (8/?2)(n/?) 5/? 14ln?/?2 ? µi.
34Example an r-of-k threshold function
- Now we are going to calculate mistake bound for
r-of-k threshold functions. We have ?1/r and ?
µi k/r. So for ? 11/2r and ?n mistake bound
8r2 5k 14krlnn. - Note that 1-of-k threshold functions are just
k-literal monotone disjunctions. Thus if ?3/2,
WINNOW2 will learn monotone disjunctions. The
mistake bound is similar to the bound for
WINNOW1, though with larger constants.
35Conclusion
- The first part gives us general results about how
many mistakes an effective learner might make if
computational complexity were not an issue. - The second part describes an efficient algorithm
for learning specific target class. - A key advantage of WINNOW1 and WINNOW2 is their
performance when few attributes are relevant.
36Conclusion
- If we define the number of relevant variables
needed to express a function in the class
F(0,1n, ? ) to be the least number of strictly
positive weights needed to describe a separating
hyperplane, then this target class for n gt 1
can be learned with a number of mistakes bounded
by Cklogn/?2 when the target function can be
expressed with k relevant variables.