Title: Concept Learning
1- Concept Learning
- Machine Learning by T. Mitchell (McGraw-Hill)
- Chp. 2
2- Much of learning involves acquiring general
concepts from specific training examples - e.g. what is a bird? what is a chair?
- Concept learning Inferring a boolean-valued
function from training examples of its input and
output. -
3A Concept Learning Task Example
- Target Concept
- Days on which my friend Aldo enjoys his favorite
water sport - (you may find it more intuitive to instead use
Days on which the beach will be crowded
concept) - Task
- Learn to predict the value of EnjoySport/Crowded
for an arbitrary day - Training Examples for the Target Concept
- 6 attributes (Nominal-valued (symbolic)
attributes) - Sky (SUNNY, RAINY, CLOUDY), Temp (WARM,COLD),
Humidity (NORMAL, HIGH), - Wind (STRONG, WEAK), Water (WARM, COOL), Forecast
(SAME, CHANGE)
4A Learning Problem
Unknown Function
x1
x2
y f (x1, x2, x3, x4 )
x3
x4
Hypothesis Space (H) Set of all possible
hypotheses that the learner may consider during
learning the target concept
5Hypothesis SpaceUnrestricted Case
- A ? B B A
- H4 ? H 0,1 ? 0,1 ? 0,1 ? 0,1 ?
0,1 224 65536 function values - After 7 examples, still have 29 512
possibilities (out of 65536) for f - Is learning possible without any assumptions?
6A Concept Learning Task
- Hypothesis h Conjunction of Constraints on
Attributes - Constraint Values
- Specific value (e.g., Water Warm)
- Dont care (e.g., Water ?)
- No value allowed (e.g., Water Ø)
- Hypothesis Representation
- Example Hypothesis for EnjoySport
- Aldo enjoys his favorite sport only on sunny
days with Strong wind - Sky AirTemp Humidity
Wind Water Forecast - ltSunny ? ? Strong ? ?
gt - The most general hypothesis every day is a
positive example - lt?, ?, ?, ?, ?, ?gt
- The most specific possible hypothesis no day is
a positive ex. - lt ? ,?, ?, ?, ?, ?gt
- Is this hypothesis consistent with the training
examples?
7A Concept Learning Task(2)
- The instance space, X (book uses set of
instances) - all possible days represented by attributes Sky,
AirTemp,... - Target concept, c
- Any boolean-valued function defined over the
instance space X - c X ? 0, 1 (I.e. if EnjoySport Yes, then
c(x) 1 ) - Training Examples (denoted by D) ordered pair
ltx, c(x)gt - Positive example member of the target concept,
c(x) 1 - Negative example nonmember of the target
concept, c(x) 0 - Assumption no missing X values
- No noise in values of c (contradictory labels).
- All possible hypotheses, H
- E.g. conjunction of constraints on attributes
- Often picked by the designer
- H is the set of Boolean valued functions defined
over X - The goal of the learner
8A Concept Learning Task(3)
- Although the learning task is to determine a
hypothesis h identical to c, over the entire set
of instances X, the only information available
about c is its value over the training instances
D. - Inductive Learning Hypothesis
- Any hypothesis found to approximate the target
function well over a sufficiently large set of
training examples will also approximate the
target function well over other unobserved
examples.
9Concept Learning As Search
- Concept Learning As Search
- Concept learning can be viewed as the task of
searching through a large space of hypotheses
implicitly defined by the hypothesis
representation. - The goal of this search is to find the
hypothesis that (best) fits the training
examples. - Sky AirTemp Humidity Wind
Water Forecast - ltSunny/Rainy/Cloudy Warm/Cold Normal/High
Weak/Strong Warm/Cold Change/Same gt - EnjoySport Learning Task
- The instance space X
- 3 ? 2 ? 2 ? 2 ? 2 ? 2 96
- Syntactically distinct hypotheses (?, Ø)
- 5 ? 4 ? 4 ? 4 ? 4 ? 4 5120
- Semantically distinct hypotheses (Ø anywhere
means the empty set of instances and classifies
each possible instance as a negative example) - 1 (4 ? 3 ? 3 ? 3 ? 3 ? 3) 973
- Often much larger, sometimes infinite hypotheses
spaces
10Concept Learning As Search(2)
- How to (efficiently) search the hypothesis space?
- General-to-Specific Ordering of Hypotheses
- Very useful structure over the hypothesis space H
for any concept learning problem - without explicit enumeration
- Let hj and hk be boolean-valued functions defined
over X. - hj is more_general_than_or_equal_to hk //accepts
more instances - hj ?g hk
- if and only if (?x ? X) (hk(x) 1) ?
(hj(x) 1) - hj is more_general_than hk
- hj gtg hk
- if and only if (hj ?g hk) ? not (hk ?g
hj) - hj more_specific_than hk when hk is
more_general_than hj - The relation ?g is independent of the target
concept
11Concept Learning As Search (3)
- h1ltSunny,?,?,Strong, ?,?gt
- h2ltSunny,?,?, ?, ?,?gt
- h3 ltSunny,?,?, ?, Cool,?gt
- h1 versus h2
- h2 imposes fewer constraints
- h2 classifies more examples as positive
- any instance classified as positive by h1 is
classified as positive by h2 - h2 is more general than h1
- How about h3?
- Partial ordering
- The structure imposed by this partial ordering on
the hypothesis space H can be exploited for
efficiently exploring H.
12Instances, Hypotheses, andthe Partial Ordering
Less-Specific-Than
Instances X
Hypotheses H
Specific
General
h1 ltSunny, ?, ?, Strong, ?, ?gt h2 ltSunny, ?,
?, ?, ?, ?gt h3 ltSunny, ?, ?, ?, Cool, ?gt
x1 ltSunny, Warm, High, Strong, Cool, Samegt x2
ltSunny, Warm, High, Light, Warm, Samegt
h2 ?P h1 h2 ?P h3
?P ? Less-Specific-Than ? More-General-Than
13- Idea Exploit the partial order to effectively
search the space of hypotheses - Find_S
- Candidate_Elimination
- List-Then-Eliminate Algorithm (mentioned along
with CE as a bad alternative)
14Find-S Finding a maximally specific hypothesis
- Method
- Begin with the most specific possible hypothesis
in H - Generalize this hypothesis each time if fails to
cover an observed positive training example. - Algorithm
- Initialize h to the most specific hypothesis in H
- For each positive training instance x
- For each attribute constraint ai in h
- If the constraint ai is NOT satisfied by x
- Replace ai in h by the next more general
constraint satisfied by x - Output hypothesis h
15Hypothesis Space Searchby Find-S
Instances X
Hypotheses H
x1 ltSunny, Warm, Normal, Strong, Warm, Samegt,
x2 ltSunny, Warm, High, Strong, Warm,
Samegt, x3 ltRainy, Cold, High, Strong,
Warm, Changegt, - x4 ltSunny, Warm, High,
Strong, Cool, Changegt, h0 ltØ, Ø, Ø, Ø, Ø,
Øgt h1 ltSunny, Warm, Normal, Strong, Warm,
Samegt h2 ltSunny, Warm, ?, Strong,
Warm, Samegt h3 ltSunny, Warm, ?,
Strong, Warm, Samegt h4 ltSunny, Warm, ?,
Strong, ?, ?gt
16Find-S(2)
- Find-S algorithm simply ignores every negative
example! - Current hypothesis h is already consistent with
the new negative example. - No revision is needed
- Based on the assumptions that
- H contains a hypothesis describing the true
concept c - Data contains no error
- Formal proof that h does not need revision in
response to a negative example - Let h be the current concept and c be the target
concept assumed to be in H - c is more_general_than or equal to h (current
hypothesis) - since c covers all of the positive examples, it
covers more examples as positive compared to h - c never covers a negative instance
- since c is the target concept and is noise-free
- hence, neither will h
- by definition of more_general_than
17Find-S(3)Shortcomings
- The algorithm finds one hypothesis, but cant
tell whether it has found the only hypothesis
which is consistent with the data or if there are
more such hypotheses - Why prefer the most specific hypothesis?
- Multiple hypotheses consistent with the training
example - Find-S will find the most specific.
- Are the training examples consistent?
- The training examples will contain at least some
error or noise - Such inconsistent sets of training examples can
mislead Find-S - What if there are several maximally specific
consistent hypotheses? - Several maximally specific hypotheses consistent
with the data, - No maximally specific consistent hypothesis
18Version Space and the Candidate-Elimination
- Consistent
- A hypothesis h is consistent with a set of
training examples D - if and only if h(x) c(x) for each example ltx,
c(x)gt in D. - Consistent(h, D) ? (?ltx, c(x)gt ? D) h(x) c(x)
- Related definitions
- x satisfies the constraints of hypothesis h when
h(x) 1 - regardless of whether x is a or example
for the concept - h covers a positive training example x
- if it correctly classifies x as
- Candidate-Elimination algorithm outputs the set
of all hypotheses consistent with the training
examples - Without enumerating all hypotheses
- Remember FIND-S outputs one hypothesis from H
that is consistent with the training examples
19Version Space and the Candidate-Elimination(2)
- Version space
- The version space, denoted VSH,D, with respect to
hypothesis space H and training example D, is the
subset of hypotheses from H which are consistent
with the training examples in D. - VSH,D ? h ? H Consistent(h, D)
- The List-Then-Eliminate Algorithm
- Version Space ? a list containing every
hypothesis in H - For each training example, ltx, c(x)gt
- remove from Version Space any hypothesis h
which h(x) ? c(x) - Output the list of hypotheses in Version Space
- Guaranteed to output all hypotheses consistent
with the training data - Can be applied whenever the hypothesis space H is
finite - It requires exhaustively enumerating all
hypotheses in H -not realistic
20The List-Then-Eliminate Algorithm
This Version Space, containing all 6 hypotheses
can be compactly represented with its most
specific (S) and most general (G) sets.
21Version Space and the Candidate-Elimination(3)
- The Specific boundary S
- With respect to hypothesis space H and training
data D, is the set of minimally general (i.e.
maximally specific) members of H consistent with
D. - S ? s ?HConsistent(s,D) ? (?s ?H)s gtg s) ?
Consistent(s,D) - Most specific ? maximal elements of VSH,D
- ? set of sufficient conditions
- The General boundary G
- With respect to hypothesis space H and training
data D, is the set of maximally general members
of H consistent with D. - G ? g ?HConsistent(g,D) ? (?g ?H)g gtg g) ?
Consistent(g,D) - Most general ? minimal elements of VSH,D
- ? set of necessary conditions
22Version Space and the Candidate-Elimination(4)
- Version space is the set of hypotheses contained
in G, plus those contained in S, plus those that
lie between G and S in the partially ordered
hypothesis space. - Version space representation theorem
- Let X be an arbitrary set of instances and let H
be a set of boolean-valued hypotheses defined
over X. - Let c X ? 0, 1 be an arbitrary target
concept defined over X, - and let D be an arbitrary set of training
examples ltx, c(x)gt. - For all X, H, c, and D such that S and G are
well defined, - VSH,D h ? H (?s ? S) (?g ? G) (g ?g h ?g
s) - Proof Show that all h in VS satisfies the rhs
condition and all h in rhs is in VS (ex.
Mitchell-2.6)
23Representing Version Spaces Another Take on the
Same Definitions
- Hypothesis Space
- A finite semilattice (partial ordering
Less-Specific-Than ? ? all ?) - Every pair of hypotheses has a greatest lower
bound (GLB) - VSH,D ? the consistent poset (partially-ordered
subset of H) - Definition General Boundary
- General boundary G of version space VSH,D set
of most general members - Most general ? minimal elements of VSH,D ? set
of necessary conditions - Definition Specific Boundary
- Specific boundary S of version space VSH,D set
of most specific members - Most specific ? maximal elements of VSH,D ? set
of sufficient conditions - Version Space
- Every member of the version space lies between S
and G - VSH,D ? h ? H ? s ? S . ? g ? G . g ?P h ?P
s where ?P ? Less-Specific-Than
24Version Space and the Candidate-Elimination(4)
- The Candidate Elimination algorithm works on the
same principle as List-than-Eliminate, but using
a more compact representation of the Version
Space - Version Space is represented by its most general
and least general(specific) members. - Candidate-Elimination Learning Algorithm
- - Initialize G to the set of maximally general
hypotheses in H - - Initialize S to the set of maximally specific
hypotheses in H - G0 ? lt?, ?, ?, ?, ?, ?gt
- S0 ? lt ? ,?, ?, ?, ?, ?gt
- ...
25Candidate-Elimination(5)
- Candidate-Elimination Learning Algorithm (cont.)
- - For each training example d, do
- If d is a positive example
- //Generalize S...
- For each hypothesis s in S that is not consistent
with d - Remove s from S
- Add to S all minimal generalizations h of s such
that - h is consistent with d and some member of G is
more general than h - Remove from S any hypothesis that is more general
than another h in S - Remove from G any hypothesis inconsistent with d
- If d is a negative example
- //Specialize G...
- For each hypothesis g in G that is not consistent
with d - Remove g from G
- Add to G all minimal specializations h of g such
that - h is consistent with d and some member of S is
more specific than h - Remove from G any hypothesis that is less general
than another h in G - Remove from S any hypothesis inconsistent with d
26Candidate-Elimination(6)
- Candidate-Elimination Algorithm works by
- computing minimal generalizations and
specializations, - identifying non-minimal, non-maximal hypothesis
- The algorithm can be applied to any concept
learning task and hypothesis space for which
these operations are well-defined
27Candidate_EliminationExample Trace
d1 ltSunny, Warm, Normal, Strong, Warm, Same, Yesgt
d2 ltSunny, Warm, High, Strong, Warm, Same, Yesgt
d3 ltRainy, Cold, High, Strong, Warm, Change, Nogt
d4 ltSunny, Warm, High, Strong, Cool, Change, Yesgt
G4 We cannot specialize last element of G3, must
be removed.
G3 What about lt?, ?, Normal, ?, ?, ?gt or
ltCloudy, ?,?, ?, ?, ?gt they
are inconsistent with previous positive examples
that S2 summarizes
28- //S summarizes all past positive examples
- Any hypothesis h more general than S is
guaranteed to be - consistent with all the previous positive
examples - Let h be a generalization of s in S
- h covers more points than s since it is more
general - In particular, h covers all points covered by s
- Since s is consistent with all examples, so is
h - //G summarizes all past negative examples
- Any hypothesis h more specific than G is
guaranteed to be - consistent with all the previous negative
examples - Let h be the specialization of a g in G
- h covers less points than g
- Hence h cannot cover any negative examples not
covered by g - Since g is consistent with all - examples, so is
h - The learned version space is independent of the
order in which the training examples are
presented - After all, the VS shows all the consistent
hypotheses - S and G boundary will move closer together with
more examples, up to convergence
29Remarks on version spaces and C-E
- Version Space converge to the correct hypothesis
provided that - - there is no errors in the training examples.
- Contains Error?
- Removes the correct target concept from VS since
all h inconsistent with the training data is
removed - Would be detected as empty set of hypotheses
- - there is some hypothesis in H that correctly
describes the target concept - Target concept not in H?
- if the target concept is a disjunction of feature
attributes and the hypothesis space supports only
conjunctive descriptions. - The target concept is exactly learned when the S
and G boundary sets converge to a single,
identical hypothesis
30What Next Training Example?
31Remarks on version spaces and C-E
- What Training Example Should the Learner Request
Next? - e.g. ltSunny, Warm, Normal, Light, Warm, Samegt
- ltSunny, Warm, Normal, Strong, Cool, Changegt
- ltRainy, Cold, Normal, Light, Warm, Samegt
- Optimal query strategy for a concept learner is
to generate instances that satisfy exactly half
the hypotheses in the current version space - If the size of VS is reduced by half with each
new example correct target concept be found only
?log2 VS? experiments.
32Summary Points
- Concept Learning as Search through H
- Hypothesis space H as a state space
- Learning finding the correct hypothesis
- General-to-Specific Ordering over H
- Partially-ordered set Less-Specific-Than
(More-General-Than) relation - Upper and lower bounds in H
- Version Space Candidate Elimination Algorithm
- S and G boundaries characterize learners
uncertainty - Version space can be used to make predictions
over unseen cases - Learner Can Generate Useful Queries
- Next Lecture When and Why Are Inductive Leaps
Possible?
33Summary Points Terminology
- Supervised Learning
- Concept - function from observations to
categories (so far, boolean-valued /-) - Target (function) - true function f
- Hypothesis - proposed function h believed to be
similar to f - Hypothesis space - space of all hypotheses that
can be generated by the learning system - Example - tuples of the form ltx, f(x)gt
- Instance space (aka example space) - space of all
possible examples - Classifier - discrete-valued function whose range
is a set of class labels - The Version Space Algorithm
- Algorithms Find-S, List-Then-Eliminate,
Candidate Elimination - Consistent hypothesis - one that correctly
predicts observed examples - Version space - space of all currently consistent
(or satisfiable) hypotheses - Inductive Learning
- Inductive generalization - process of generating
hypotheses that describe cases not yet observed - The inductive learning hypothesis
34Remarks on version spaces and C-E
- How can partially learned concepts be used?
- No additional training example, multiple
remaining hypotheses - Example Table 2.6(p 39)
- Instance A satisfies every member of S
- No need to look further, it will satisfy all h in
VS - Classify as a positive example
- Instance B satisfies none of the members of G
- No need to look further, it will not satisfy any
h in VS - Classify as a negative example
- Instance C Half of VS is positive and Half of VS
is negative - most ambiguous new information for refining the
version space - Instance D classified as positive by two of VS,
as negative by others - output the majority vote with a confidence rating
35- Inductive Bias
- Mitchell-Chp. 2
36What Justifies This Inductive Leap?
- Example Inductive Generalization
- Positive example ltSunny, Warm, Normal, Strong,
Cool, Change, Yesgt - Positive example ltSunny, Warm, Normal, Light,
Warm, Same, Yesgt - Induced S ltSunny, Warm, Normal, ?, ?, ?gt
- Why Believe We Can Classify The Unseen?
- e.g., ltSunny, Warm, Normal, Strong, Warm, Samegt
37Inductive Bias
- A biased Hypothesis space
- EnjoySport example
- Restriction only conjunctions of attribute
values. - No representation for a disjunctive target
concept - Sky Sunny or Wind Weak
- The problem
- We biased the learner (inductive bias) to
consider only conjunctive hypotheses - But the concept requires a more expressive
hypothesis space
38Inductive Bias(2)
- An Unbiased Learner
- Obvious solution Provide a hypothesis space
capable of representing every teachable
concept(every possible subset of the instance X) - The set of all subsets of a set X is called the
power set of X - EnjoySport Example
- Instance space 96
- Power set of X 296 79228162514264337593543950
336 - Conjunctive hypothesis space only 973
- Very biased hypothesis space indeed!!
-
39Inductive Bias(3) Need for Inductive Bias
- An Unbiased Learner
-
- Reformulate the EnjoySport learning task in an
unbiased way - Defining a new hypothesis space H that can
represent every subset of X - Allow arbitrary disjunctions, conjunctions, and
negations - Example Sky Sunny or Wind Weak
- ltSunny, ?, ?, ?, ?, ?gt V lt?, ?, ?, Weak, ?, ?gt
- New problem Completely unable to generalize
beyond the observed example! - What is S and G boundaries?
- The S boundary of VS will contain just the
disjunction of the positive examples. - Three positive examples (x1, x2, x3), S (x1 V
x2 V x3) - The G boundary of VS will consist of the
hypothesis that rules out only the observed
negative examples. - Two negative examples (x4, x5), G (x4 V x5)
- In order to converge to a single final concept,
we will have to present every single instance in
X as a training example.
40Inductive Bias(3) Need for Inductive Bias
- New problem
-
- For all the unseen instances, there wont be a
unanimous vote either - Half of H in VS will decide positive, Half of H
in VS will decide negative - Assume a previously unseen instance x
- For any hypotheses h in VS that covers x as
positive, - there will be another hypotheses h that is
identical to h except for its classification of
x. - If h is in VS, so will be h
- The problem is a general one, not specific to
Candidate Elimination
41- Fundamental property of inductive inference
- A learner that makes no a priori assumptions
regarding the identity of the target concept has
no rational basis for classifying any unseen
instances
42Inductive Bias(5)
- Consider
- Concept learning algorithm L, instance X, target
concept c - Training examples Dc ltx, c(x)gt,
- Let L(xi, Dc) denote the classification(of xi by
L after training on Dc) - The Label L(xi, Dc) need not be correct. What
assumptions should we make so that it follows
deductively? - Definition
- Inductive bias of L is any minimal set of
assertions B such that for any target concept c
and corresponding training examples Dc
assumptions in B justify its inductive inferences
as deductive inferences - (? xi ?X) (B ? Dc ? xi ) L(xi, Dc)
- where y z means that z follows deductively from
y
43Inductive Bias(6)
- Inductive bias of Candidate-Elimination algorithm
- Lets assume that CE classifies a new instance x
if the vote is unanimous - Inductive bias The target concept c is contained
in the given hypothesis space H - If c is in H, it is also in VS.
- If all h in VS votes unanimously, it must be that
c(xi) L(xi,Dc) - More strongly biased methods can classify (not
reject) a greater proportion of unseen instances - The correctness is another issue!
44Inductive Bias(4) The Futility of Bias-Free
Learning
- EnjoySport example makes an implicit assumption
- target concept represented by a conjunction of
attribute values - Inductive learning require some prior assumptions
(inductive bias) - It is useful to characterize different learning
approaches by the inductive bias they employ
45Three Learners with Different Biases
- Rote Learner
- Stores each observed training example in memory
- Classify x if and only if it matches previously
observed example - Weakest bias no bias
- Candidate Elimination Algorithm
- Store extremal generalizations and
specializations - Classify x if and only if it falls within S and
G boundaries (all members agree) - Stronger bias concepts belonging to conjunctive
H - Find-S
- Prior assumption All instances are negative
instances unless the opposite is entailed by S - Classify x based on S set
- Even stronger bias most specific hypothesis
46Summary
- Concept Learning can be cast as a problem of
searching through a large predefined space of
potential hypotheses - The general-to-specific partial ordering of
hypotheses provides a useful structure for
organizing the search through the hypothesis
space - Find-S algorithm, Candidate-Elimination
algorithm(Non noisy data) - S,G set delimit the entire set of hypotheses
consistent with the data - Inductive learning algorithms are able to
classify unseen examples - Every possible subset of instances(the power set
of the instances) - Remove any inductive bias from the Candidate
Elimination algorithm - Also remove the ability to classify any instance
beyond the observed training example - Unbiased learner cannot make inductive leaps to
classify unseen exam.
47- Mistake Bound Model
- Rivest- Lecture 1
48Mistake Bound Model
- The learner receives a sequence of training
examples - Instance based learning
- Upon receiving each example x, the learner must
predict the target value c(x) - Online learning
- How many mistakes will the learner make before it
learns the target concept? - e.g. Learning fraudulent credit card purchases
49(No Transcript)
50Mistake Bound Model
51Theorem 1. Online learning of conjunctive
concepts can be done with at most n1 prediction
mistakes.
52- Thm Online learning of conjunctive concepts can
be done with at most n1 mistakes
53- Thm Online learning of conjunctive concepts can
be done with at most n1 mistakes
Proof by contradiction Assume current concept c
makes an error on negative example x. This means
that c classifies x as positive hence c is
overly general (it has to be corrected by
specializing, so as to exclude x). But by
construction, c is the most specific hypothesis
consistent with all the positive examples seen so
far hence it cannot be overly general.
Contradiction.
54(No Transcript)
55(No Transcript)
56Reference
57 58Two Strategies for Machine Learning
- Develop Ways to Express Prior Knowledge
- Role of prior knowledge guides search for
hypotheses / hypothesis languages - Expression languages for prior knowledge
- Rule grammars stochastic models etc.
- Restrictions on computational models
- Other (formal) specification methods
- Develop Flexible Hypothesis Spaces
- Structured collections of hypotheses
- Agglomeration nested collections (hierarchies)
- Partitioning decision trees, lists, rules
- Neural networks cases, etc.
- Hypothesis spaces of adaptive size
- Either Case Develop Algorithms for Finding A
Hypothesis That Fits Well - Ideally, will generalize well
59Views of Learning
- Removal of (Remaining) Uncertainty
- Suppose unknown function was known to be m-of-n
Boolean function - Could use training data to infer the function
- Learning and Hypothesis Languages
- Possible approach to guess a good, small
hypothesis language - Start with a very small language
- Enlarge until it contains a hypothesis that fits
the data - Inductive bias
- Preference for certain languages
- Analogous to data compression (removal of
redundancy) - Later coding the model versus coding the
uncertainty (error) - We Could Be Wrong!
- Prior knowledge could be wrong (e.g., y x4 ?
one-of (x1, x3) also consistent) - If guessed language was wrong, errors will occur
on new cases