Title: CIS732-Lecture-02-20070118
1Lecture 02 of 42
The Candidate Elimination (Version
Space) Algorithm and Inductive Bias
Thursday, 18 January 2007 William H.
Hsu Department of Computing and Information
Sciences, KSU http//www.kddresearch.org http//ww
w.cis.ksu.edu/bhsu Readings Sections 2.7-2.8,
Sections 7.1-7.3, Mitchell Sections 2.4.1-2.4.3,
Shavlik and Dietterich
2Lecture Outline
- Read 2.7-2.8, 7.1-7.3, Mitchell 2.4.1-2.4.3 SD
- Homework 1 Due Thursday, September 16, 1999
(before 5 PM CST) - Paper Commentary 1 Due This Thursday
- L. G. Valiant, A Theory of the Learnable
(Communications of the ACM, 1984) - See guidelines in course notes
- The Need for Inductive Bias
- Representations (hypothesis languages) a
worst-case scenario - Change of representation
- Computational Learning Theory
- Setting 1 learner poses queries to teacher
- Setting 2 teacher chooses examples
- Setting 3 randomly generated instances, labeled
by teacher - Probably Approximately Correct (PAC) Learning
- Motivation
- Introduction to PAC framework
3Specifying A Learning Problem
- Learning Improving with Experience at Some Task
- Improve over task T,
- with respect to performance measure P,
- based on experience E.
- Example Learning to Play Checkers
- T play games of checkers
- P percent of games won in world tournament
- E opportunity to play against self
- Refining the Problem Specification Issues
- What experience?
- What exactly should be learned?
- How shall it be represented?
- What specific algorithm to learn it?
- Defining the Problem Milieu
- Performance element How shall the results of
learning be applied? - How shall the performance element be evaluated?
The learning system?
4Example Learning to Play Checkers
5A Target Function forLearning to Play Checkers
6A Training Procedure for Learning to Play
Checkers
- Obtaining Training Examples
- the target function
- the learned function
- the training value
- One Rule For Estimating Training Values
-
- Choose Weight Tuning Rule
- Least Mean Square (LMS) weight update
rule REPEAT - Select a training example b at random
- Compute the error(b) for this training
example - For each board feature fi, update weight wi as
follows where c is a small, constant
factor to adjust the learning rate
7Design Choices forLearning to Play Checkers
Completed Design
8Some Issues in Machine Learning
- What Algorithms Can Approximate Functions
Well? When? - How Do Learning System Design Factors Influence
Accuracy? - Number of training examples
- Complexity of hypothesis representation
- How Do Learning Problem Characteristics Influence
Accuracy? - Noisy data
- Multiple data sources
- What Are The Theoretical Limits of Learnability?
- How Can Prior Knowledge of Learner Help?
- What Clues Can We Get From Biological Learning
Systems? - How Can Systems Alter Their Own Representation?
9Interesting Applications
10An Unbiased Learner
- Example of A Biased H
- Conjunctive concepts with dont cares
- What concepts can H not express? (Hint what
are its syntactic limitations?) - Idea
- Choose H that expresses every teachable concept
- i.e., H is the power set of X
- Recall A ? B B A (A X B
labels H A ? B) - Rainy, Sunny ? Warm, Cold ? Normal, High ?
None, Mild, Strong ? Cool, Warm ? Same,
Change ? 0, 1 - An Exhaustive Hypothesis Language
- Consider H disjunctions (?), conjunctions
(?), negations () over previous H - H 2(2 2 2 3 2 2) 296 H
1 (3 3 3 4 3 3) 973 - What Are S, G For The Hypothesis Language H?
- S ? disjunction of all positive examples
- G ? conjunction of all negated negative examples
11Inductive Bias
- Components of An Inductive Bias Definition
- Concept learning algorithm L
- Instances X, target concept c
- Training examples Dc ltx, c(x)gt
- L(xi, Dc) classification assigned to instance
xi by L after training on Dc - Definition
- The inductive bias of L is any minimal set of
assertions B such that, for any target concept c
and corresponding training examples Dc, ? xi
? X . (B ? Dc ? xi) ? L(xi, Dc) where A ? B
means A logically entails B - Informal idea preference for (i.e., restriction
to) certain hypotheses by structural (syntactic)
means - Rationale
- Prior assumptions regarding target concept
- Basis for inductive generalization
12Inductive Systemsand Equivalent Deductive Systems
13Three Learners with Different Biases
- Rote Learner
- Weakest bias anything seen before, i.e., no bias
- Store examples
- Classify x if and only if it matches previously
observed example - Version Space Candidate Elimination Algorithm
- Stronger bias concepts belonging to conjunctive
H - Store extremal generalizations and
specializations - Classify x if and only if it falls within S and
G boundaries (all members agree) - Find-S
- Even stronger bias most specific hypothesis
- Prior assumption any instance not observed to be
positive is negative - Classify x based on S set
14Hypothesis SpaceA Syntactic Restriction
- Recall 4-Variable Concept Learning Problem
- Bias Simple Conjunctive Rules
- Only 16 simple conjunctive rules of the form y
xi ? xj ? xk - y Ø, x1, , x4, x1 ? x2, , x3 ? x4, x1 ? x2 ?
x3, , x2 ? x3 ? x4, x1 ? x2 ? x3 ? x4 - Example above no simple rule explains the data
(counterexamples?) - Similarly for simple clauses (conjunction and
disjunction allowed)
15Hypothesis Spacem-of-n Rules
- m-of-n Rules
- 32 possible rules of the form y 1 iff
at least m of the following n variables are 1 - Found A Consistent Hypothesis!
16Views of Learning
- Removal of (Remaining) Uncertainty
- Suppose unknown function was known to be m-of-n
Boolean function - Could use training data to infer the function
- Learning and Hypothesis Languages
- Possible approach to guess a good, small
hypothesis language - Start with a very small language
- Enlarge until it contains a hypothesis that fits
the data - Inductive bias
- Preference for certain languages
- Analogous to data compression (removal of
redundancy) - Later coding the model versus coding the
uncertainty (error) - We Could Be Wrong!
- Prior knowledge could be wrong (e.g., y x4 ?
one-of (x1, x3) also consistent) - If guessed language was wrong, errors will occur
on new cases
17Two Strategies for Machine Learning
- Develop Ways to Express Prior Knowledge
- Role of prior knowledge guides search for
hypotheses / hypothesis languages - Expression languages for prior knowledge
- Rule grammars stochastic models etc.
- Restrictions on computational models other
(formal) specification methods - Develop Flexible Hypothesis Spaces
- Structured collections of hypotheses
- Agglomeration nested collections (hierarchies)
- Partitioning decision trees, lists, rules
- Neural networks cases, etc.
- Hypothesis spaces of adaptive size
- Either Case Develop Algorithms for Finding A
Hypothesis That Fits Well - Ideally, will generalize well
- Later Bias Optimization (Meta-Learning, Wrappers)
18Computational Learning Theory
- What General Laws Constrain Inductive Learning?
- What Learning Problems Can Be Solved?
- When Can We Trust The Output of A Learning
Algorithm? - We Seek Theory To Relate
- Probability of successful learning
- Number of training examples
- Complexity of hypothesis space
- Accuracy to which target concept is approximated
- Manner in which training examples are presented
19Prototypical Concept Learning Task
- Given
- Instances X possible days, each described by
attributes Sky, AirTemp, Humidity, Wind, Water,
Forecast - Target function c ? EnjoySport X ? H
- Hypotheses H conjunctions of literals, e.g.,
- lt?, Cold, High, ?, ?, ?gt
- Training examples D positive and negative
examples of the target function - ltx1, c(x1)gt, ltx2, c(x2)gt, , ltxm, c(xm)gt
- Determine
- A hypothesis h in H such that h(x) c(x) for all
x in D? - A hypothesis h in H such that h(x) c(x) for all
x in X?
20Sample Complexity
- How Many Training Examples Sufficient To Learn
Target Concept? - Scenario 1 Active Learning
- Learner proposes instances, as queries to teacher
- Query (learner) instance x
- Answer (teacher) c(x)
- Scenario 2 Passive Learning from
Teacher-Selected Examples - Teacher (who knows c) provides training examples
- Sequence of examples (teacher) ltxi, c(xi)gt
- Teacher may or may not be helpful, optimal
- Scenario 3 Passive Learning from
Teacher-Annotated Examples - Random process (e.g., nature) proposes instances
- Instance x generated randomly, teacher provides
c(x)
21Sample ComplexityScenario 1
22Sample ComplexityScenario 2
- Teacher Provides Training Examples
- Teacher agent who knows c
- Assume c is in learners hypothesis space H (as
in Scenario 1) - Optimal Teaching Strategy Depends upon H Used by
Learner - Consider case H conjunctions of up to n
boolean literals and their negations - e.g., (AirTemp Warm) ? (Wind Strong), where
AirTemp, Wind, each have 2 possible values - Complexity
- If n possible boolean attributes in H, n 1
examples suffice - Why?
23Sample ComplexityScenario 3
- Given
- Set of instances X
- Set of hypotheses H
- Set of possible target concepts C
- Training instances generated by a fixed, unknown
probability distribution D over X - Learner Observes Sequence D
- D training examples of form ltx, c(x)gt for target
concept c ? C - Instances x are drawn from distribution D
- Teacher provides target value c(x) for each
- Learner Must Output Hypothesis h Estimating c
- h evaluated on performance on subsequent
instances - Instances still drawn according to D
- Note Probabilistic Instances, Noise-Free
Classifications
24True Error of A Hypothesis
- Definition
- The true error (denoted errorD(h)) of hypothesis
h with respect to target concept c and
distribution D is the probability that h will
misclassify an instance drawn at random according
to D. -
- Two Notions of Error
- Training error of hypothesis h with respect to
target concept c how often h(x) ? c(x) over
training instances - True error of hypothesis h with respect to target
concept c how often h(x) ? c(x) over future
random instances - Our Concern
- Can we bound true error of h (given
training error of h)? - First consider when training error of h is
zero (i.e, h ? VSH,D )
Instance Space X
-
-
-
25Exhausting The Version Space
- Definition
- The version space VSH,D is said to be ?-exhausted
with respect to c and D, if every hypothesis h in
VSH,D has error less than ? with respect to c and
D. - ? h ? VSH,D . errorD(h) lt ?
26Number of Examples Required toExhaust The
Version Space
- How Many Examples Will ?Exhaust The Version
Space? - Theorem Haussler, 1988
- If the hypothesis space H is finite, and D is a
sequence of m ? 1 independent random examples of
some target concept c, then for any 0 ? ? ? 1,
the probability that the version space with
respect to H and D is not ?-exhausted (with
respect to c) is less than or equal to H
e - ? m - Important Result!
- Bounds the probability that any consistent
learner will output a hypothesis h with error(h)
? ? - Want this probability to be below a specified
threshold ? H e - ? m ? ? - To achieve, solve inequality for m let
m ? 1/? (ln H ln (1/?)) - Need to see at least this many examples
27Learning Conjunctions of Boolean Literals
- How Many Examples Are Sufficient?
- Specification - ensure that with probability at
least (1 - ?) Every h in VSH,D
satisfies errorD(h) lt ? - The probability of an ?-bad hypothesis
(errorD(h) ? ?) is no more than ? - Use our theorem m ? 1/? (ln H ln
(1/?)) - H conjunctions of constraints on up to n boolean
attributes (n boolean literals) - H 3n, m ? 1/? (ln 3n ln (1/?)) 1/? (n
ln 3 ln (1/?)) - How About EnjoySport?
- H as given in EnjoySport (conjunctive concepts
with dont cares) - H 973
- m ? 1/? (ln H ln (1/?))
- Example goal probability 1 - ? 95 of
hypotheses with errorD(h) lt 0.1 - m ? 1/0.1 (ln 973 ln (1/0.05)) ? 98.8
28PAC Learning
- Terms Considered
- Class C of possible concepts
- Set of instances X
- Length n (in attributes) of each instance
- Learner L
- Hypothesis space H
- Error parameter (error bound) ?
- Confidence parameter (excess error probability
bound) ? - size(c) the encoding length of c, assuming some
representation - Definition
- C is PAC-learnable by L using H if for all c ? C,
distributions D over X, ? such that 0 lt ? lt 1/2,
and ? such that 0 lt ? lt 1/2, learner L will, with
probability at least (1 - ?), output a hypothesis
h ? H such that errorD(h) ? ? - C is efficiently PAC-learnable if L runs in time
polynomial in 1/?, 1/?, n, size(c)
29Agnostic Learning
- Assumption of Knowable Concept
- So far, assumed c ? H
- Agnostic learning environment dont assume c ? H
- What Do We Want Then?
- The closest hypothesis we can get
- Hypothesis h that makes the fewest errors on
training data - How Hard Is This?
- Sample complexity m ? 1/2?2 (ln H
ln (1/?)) - Derived from Hoeffding bounds P
errorD(h) gt errorD(h) ? ? e-2m?2
30An Unbiased Learner
- Example of A Biased H
- Conjunctive concepts with dont cares
- What concepts can H not express? (Hint what
are its syntactic limitations?) - Idea
- Choose H that expresses every teachable concept
- i.e., H is the power set of X
- Recall A ? B B A (A X B
labels H A ? B) - Rainy, Sunny ? Warm, Cold ? Normal, High ?
None, Mild, Strong ? Cool, Warm ? Same,
Change ? 0, 1 - An Exhaustive Hypothesis Language
- Consider H disjunctions (?), conjunctions
(?), negations () over previous H - H 2(2 2 2 3 2 2) 296 H
1 (3 3 3 4 3 3) 973 - What Are S, G For The Hypothesis Language H?
- S ? disjunction of all positive examples
- G ? conjunction of all negated negative examples
31Inductive Bias
- Components of An Inductive Bias Definition
- Concept learning algorithm L
- Instances X, target concept c
- Training examples Dc ltx, c(x)gt
- L(xi, Dc) classification assigned to instance
xi by L after training on Dc - Definition
- The inductive bias of L is any minimal set of
assertions B such that, for any target concept c
and corresponding training examples Dc, ? xi
? X . (B ? Dc ? xi) ? L(xi, Dc) where A ? B
means A logically entails B - Informal idea preference for (i.e., restriction
to) certain hypotheses by structural (syntactic)
means - Rationale
- Prior assumptions regarding target concept
- Basis for inductive generalization
32Inductive Systemsand Equivalent Deductive Systems
33Three Learners with Different Biases
- Rote Learner
- Weakest bias anything seen before, i.e., no bias
- Store examples
- Classify x if and only if it matches previously
observed example - Version Space Candidate Elimination Algorithm
- Stronger bias concepts belonging to conjunctive
H - Store extremal generalizations and
specializations - Classify x if and only if it falls within S and
G boundaries (all members agree) - Find-S
- Even stronger bias most specific hypothesis
- Prior assumption any instance not observed to be
positive is negative - Classify x based on S set
34Sample Complexity
- How Many Training Examples Sufficient To Learn
Target Concept? - Scenario 1 Active Learning
- Learner proposes instances, as queries to teacher
- Query (learner) instance x
- Answer (teacher) c(x)
- Scenario 2 Passive Learning from
Teacher-Selected Examples - Teacher (who knows c) provides training examples
- Sequence of examples (teacher) ltxi, c(xi)gt
- Teacher may or may not be helpful, optimal
- Scenario 3 Passive Learning from
Teacher-Annotated Examples - Random process (e.g., nature) proposes instances
- Instance x generated randomly, teacher provides
c(x)
35Terminology
- Inductive Bias
- Strength of inductive bias how few hypotheses?
- Specific biases based on specific languages
- Hypothesis Language
- Searchable subset of the space of possible
descriptors - m-of-n, conjunctive, disjunctive, clauses
- Ability to represent a concept
- PAC Learning
- Probably Approximately Correct
- Computational Learning Theory (COLT)
- True error versus training error
- Notation distribution D, errorD(h), ?-bad with
probability ? - ?-exhaustion every hypothesis in VSH,D has
errorD(h) lt ? - PAC-learnability for c ? C, X, n, L, H, ?, ?
36Summary Points
- Inductive Leaps Possible Only if Learner Is
Biased - Futility of learning without bias
- Strength of inductive bias proportional to
restrictions on hypotheses - Modeling Inductive Learners with Equivalent
Deductive Systems - Representing inductive learning as theorem
proving - Equivalent learning and inference problems
- Syntactic Restrictions
- Example m-of-n concept
- Views of Learning and Strategies
- Removing uncertainty (data compression)
- Role of knowledge
- Introduction to Computational Learning Theory
(COLT) - Things COLT attempts to measure
- Probably-Approximately-Correct (PAC) learning
framework - Next Lecture Occams Razor, VC Dimension, and
Error Bounds