Title: CSCE 580 Artificial Intelligence Ch.18: Learning from Observations
1CSCE 580Artificial IntelligenceCh.18 Learning
from Observations
- Fall 2008
- Marco Valtorta
- mgv_at_cse.sc.edu
2Acknowledgment
- The slides are based on the textbook AIMA and
other sources, including other fine textbooks and
the accompanying slide sets - The other textbooks I considered are
- David Poole, Alan Mackworth, and Randy Goebel.
Computational Intelligence A Logical Approach.
Oxford, 1998 - A second edition (by Poole and Mackworth) is
under development. Dr. Poole allowed us to use a
draft of it in this course - Ivan Bratko. Prolog Programming for Artificial
Intelligence, Third Edition. Addison-Wesley,
2001 - The fourth edition is under development
- George F. Luger. Artificial Intelligence
Structures and Strategies for Complex Problem
Solving, Sixth Edition. Addison-Welsey, 2009
3Outline
- Learning agents
- Inductive learning
- Decision tree learning
4Learning
- Learning is essential for unknown environments,
- i.e., when designer lacks omniscience
- Learning is useful as a system construction
method, - i.e., expose the agent to reality rather than
trying to write it down - Learning modifies the agent's decision mechanisms
to improve performance
5Learning agents
6Learning element
- Design of a learning element is affected by
- Which components of the performance element are
to be learned - What feedback is available to learn these
components - What representation is used for the components
- Type of feedback
- Supervised learning correct answers for each
example - Unsupervised learning correct answers not given
- Reinforcement learning occasional rewards
7Inductive learning
- Simplest form learn a function from examples
- f is the target function
- An example is a pair (x, f(x))
- Problem find a hypothesis h
- such that h f
- given a training set of examples
- This is a highly simplified model of real
learning - Ignores prior knowledge
- Assumes examples are given
8Inductive learning method
- Construct/adjust h to agree with f on training
set - (h is consistent if it agrees with f on all
examples) - E.g., curve fitting
9Inductive learning method
- Construct/adjust h to agree with f on training
set - (h is consistent if it agrees with f on all
examples) - E.g., curve fitting
10Inductive learning method
- Construct/adjust h to agree with f on training
set - h is consistent if it agrees with f on all
examples - E.g., curve fitting
11Inductive learning method
- Construct/adjust h to agree with f on training
set - (h is consistent if it agrees with f on all
examples) - E.g., curve fitting
12Inductive learning method
- Construct/adjust h to agree with f on training
set - (h is consistent if it agrees with f on all
examples) - E.g., curve fitting
13Inductive learning method
- Construct/adjust h to agree with f on training
set - (h is consistent if it agrees with f on all
examples) - E.g., curve fitting
- Ockhams razor prefer the simplest hypothesis
consistent with data
14Curve Fitting and Occams Razor
- Data collected by Galileo in1608 ball rolling
down an inclined plane, then continuing in
free-fall - Occam's razor ( suggests the simpler model is
better it has a higher prior probability - The simpler model may have a greater posterior
probability (the plausibility of the model)
Occams razor is not only a good heuristic, but
it can be shown to follow from more fundmental
principles - Jefferys, W.H. and Berger, J.O. 1992. Ockham's
razor and Bayesian analysis. American Scientist
8064-72
15Learning decision trees
- Problem decide whether to wait for a table at a
restaurant, based on the following attributes - Alternate is there an alternative restaurant
nearby? - Bar is there a comfortable bar area to wait in?
- Fri/Sat is today Friday or Saturday?
- Hungry are we hungry?
- Patrons number of people in the restaurant
(None, Some, Full) - Price price range (, , )
- Raining is it raining outside?
- Reservation have we made a reservation?
- Type kind of restaurant (French, Italian, Thai,
Burger) - WaitEstimate estimated waiting time (0-10,
10-30, 30-60, gt60)
16Attribute-based representations
- Examples described by attribute values (Boolean,
discrete, continuous) - E.g., situations where I will/won't wait for a
table - Classification of examples is positive (T) or
negative (F)
17Decision trees
- One possible representation for hypotheses
- E.g., here is the true tree for deciding
whether to wait
18Expressiveness
- Decision trees can express any function of the
input attributes - E.g., for Boolean functions, truth table row ?
path to leaf - Trivially, there is a consistent decision tree
for any training set with one path to leaf for
each example (unless f nondeterministic in x) but
it probably won't generalize to new examples - Prefer to find more compact decision trees
19Hypothesis spaces
- How many distinct decision trees with n Boolean
attributes? - number of Boolean functions
- number of distinct truth tables with 2n rows
22n (for each of the 2n rows of the decision
table, the function may return 0 or 1) - E.g., with 6 Boolean attributes, there are
18,446,744,073,709,551,616 (more than 18
quintillion) trees
20Hypothesis spaces
- How many distinct decision trees with n Boolean
attributes? - number of Boolean functions
- number of distinct truth tables with 2n rows
22n - E.g., with 6 Boolean attributes, there are
18,446,744,073,709,551,616 trees - How many purely conjunctive hypotheses (e.g.,
Hungry ? ?Rain)? - Each attribute can be in (positive), in
(negative), or out - ? 3n distinct conjunctive hypotheses
- More expressive hypothesis space
- increases chance that target function can be
expressed - increases number of hypotheses consistent with
training set - ? may get worse predictions
21Decision tree learning
- Aim find a small tree consistent with the
training examples - Idea (recursively) choose "most significant"
attribute as root of (sub)tree
22Choosing an attribute
- Idea a good attribute splits the examples into
subsets that are (ideally) "all positive" or "all
negative" - Patrons? is a better choice
23Using information theory
- To implement Choose-Attribute in the DTL
algorithm - Information Content (Entropy)
- I(P(v1), , P(vn)) Si1 -P(vi) log2 P(vi)
- For a training set containing p positive examples
and n negative examples
24Information gain
- A chosen attribute A divides the training set E
into subsets E1, , Ev according to their values
for A, where A has v distinct values - Information Gain (IG) or reduction in entropy
from the attribute test - Choose the attribute with the largest IG
25Information gain
- For the training set, p n 6, I(6/12, 6/12)
1 bit - Consider the attributes Patrons and Type (and
others too) - Patrons has the highest IG of all attributes and
so is chosen by the DTL algorithm as the root
26Example contd.
- Decision tree learned from the 12 examples
- Substantially simpler than true tree---a more
complex hypothesis isnt justified by small
amount of data
27Performance measurement
- How do we know that h f ?
- Use theorems of computational/statistical
learning theory - Try h on a new test set of examples
- (use same distribution over example space as
training set) - Learning curve correct on test set as a
function of training set size
28Summary (so far)
- Learning needed for unknown environments, lazy
designers - Learning agent performance element learning
element - For supervised learning, the aim is to find a
simple hypothesis approximately consistent with
training examples - Decision tree learning using information gain
- Learning performance prediction accuracy
measured on test set
29Outline for Ensemble Learning and Boosting
- Ensemble Learning
- Bagging
- Boosting
- Reading AIMA-2 Sec. 18.4
- This set of slides is based on http//www.cs.uwate
rloo.ca/ppoupart/teaching/cs486-spring05/slides/L
ecture21notes.pdf - In turn, those slides follow AIMA-2
30Ensemble Learning
- Sometimes each learning techniqueyields a
different hypothesis - But no perfect hypothesis
- Could we combine several imperfect hypotheses
into a better hypothesis?
31Ensemble Learning
- Analogies
- Elections combine voters choices to pick a good
candidate - Committees combine experts opinions to make
better decisions - Intuitions
- Individuals often make mistakes, but the
majority is less likely to make mistakes. - Individuals often have partial knowledge, but a
committee can pool expertise to make better
decisions
32Ensemble Learning
- Definition method to select and combine an
ensemble of hypotheses into a (hopefully) better
hypothesis - Can enlarge hypothesis space
- Perceptron (a simple kind of neural network)
- linear separator
- Ensemble of perceptrons
- polytope
33Bagging
34Bagging
- Assumptions
- Each hi makes error with probability p
- The hypotheses are independent
- Majority voting of n hypotheses
- k hypotheses make an error
- Majority makes an error
- With n5, p0.1 error( majority ) lt 0.01
35Weighted Majority
- In practice
- Hypotheses rarely independent
- Some hypotheses make fewer errors than others
- Lets take a weighted majority
- Intuition
- Decrease weight of correlated hypotheses
- Increase weight of good hypotheses
36Boosting
- Most popular ensemble technique
- Computes a weighted majority
- Can boost a weak learner
- Operates on a weighted training set
37Weighted Training Set
- Learning with a weighted training set
- Supervised learning -gt minimize training error
- Bias algorithm to learn correctly instances with
high weights - Idea when an instance is misclassified by a
hypotheses, increase its weight so that the next
hypothesis is more likely to classify it correctly
38Boosting Framework
Read the figure left to right the algorithm
builds a hypothesis on a weighted set of four
examples, one hypothesis per column
39AdaBoost (Adaptive Boosting)
There are N examples. There are M columns
(hypotheses), each of which has weight zm
40What can we boost?
- Weak learner produces hypotheses at least as
good as random classifier. - Examples
- Rules of thumb
- Decision stumps (decision trees of one node)
- Perceptrons
- Naïve Bayes models
41Boosting Paradigm
- Advantages
- No need to learn a perfect hypothesis
- Can boost any weak learning algorithm
- Boosting is very simple to program
- Good generalization
- Paradigm shift
- Dont try to learn a perfect hypothesis
- Just learn simple rules of thumbs and boost them
42Boosting Paradigm
- When we already have a bunch of hypotheses,
boosting provides a principled approach to
combine them - Useful for
- Sensor fusion
- Combining experts
43Boosting Applications
- Any supervised learning task
- Spam filtering
- Speech recognition/natural language processing
- Data mining
- Etc.
44Computational Learning Theory
The slides on COLT are from ftp//ftp.cs.bham.ac.u
k/pub/authors/M.Kerber/Teaching/SEM2A4/l4.ps.gz an
d http//www.cs.bham.ac.uk/mmk/teaching/SEM2A4/,
which also has slides on version spaces
45How many examples are needed?
This is the probability that Hebad contains a
consistent hypothesis
46How many examples?
47Complexity and hypothesis language
48Learning Decision Lists
- A decision list consists of a series of tests,
each of which is a conjunction of literals. If
the tests succeeds, the decision list specifies
the value to be returned. Otherwise, the
processing continues with the next test in the
list - Decision lists can represent any Boolean function
hence are not learnable (in polynomial time) - A k-DL is a decision list where each test is
restricted to at most k literals - K- Dl is learnable! Rivest, 1987