CSCE 580 Artificial Intelligence Ch.18: Learning from Observations

About This Presentation

Title:

CSCE 580 Artificial Intelligence Ch.18: Learning from Observations

Description:

Committees combine experts' opinions to make better ... Combining experts. UNIVERSITY OF SOUTH CAROLINA. Department of Computer Science and Engineering ... – PowerPoint PPT presentation

Number of Views:595

Avg rating:3.0/5.0

Slides: 49

Provided by: MarcoVa

Learn more at: https://cse.sc.edu

Category:

more less

Transcript and Presenter's Notes

Title: CSCE 580 Artificial Intelligence Ch.18: Learning from Observations

1
CSCE 580Artificial IntelligenceCh.18 Learning
from Observations

Fall 2008
Marco Valtorta
mgv_at_cse.sc.edu

2
Acknowledgment

The slides are based on the textbook AIMA and
other sources, including other fine textbooks and
the accompanying slide sets
The other textbooks I considered are
David Poole, Alan Mackworth, and Randy Goebel.
Computational Intelligence A Logical Approach.
Oxford, 1998
A second edition (by Poole and Mackworth) is
under development. Dr. Poole allowed us to use a
draft of it in this course
Ivan Bratko. Prolog Programming for Artificial
Intelligence, Third Edition. Addison-Wesley,
2001
The fourth edition is under development
George F. Luger. Artificial Intelligence
Structures and Strategies for Complex Problem
Solving, Sixth Edition. Addison-Welsey, 2009

3
Outline

Learning agents
Inductive learning
Decision tree learning

4
Learning

Learning is essential for unknown environments,
i.e., when designer lacks omniscience
Learning is useful as a system construction
method,
i.e., expose the agent to reality rather than
trying to write it down
Learning modifies the agent's decision mechanisms
to improve performance

5
Learning agents
6
Learning element

Design of a learning element is affected by
Which components of the performance element are
to be learned
What feedback is available to learn these
components
What representation is used for the components
Type of feedback
Supervised learning correct answers for each
example
Unsupervised learning correct answers not given
Reinforcement learning occasional rewards

7
Inductive learning

Simplest form learn a function from examples
f is the target function
An example is a pair (x, f(x))
Problem find a hypothesis h
such that h f
given a training set of examples
This is a highly simplified model of real
learning
Ignores prior knowledge
Assumes examples are given

8
Inductive learning method

Construct/adjust h to agree with f on training
set
(h is consistent if it agrees with f on all
examples)
E.g., curve fitting

9
Inductive learning method

Construct/adjust h to agree with f on training
set
(h is consistent if it agrees with f on all
examples)
E.g., curve fitting

10
Inductive learning method

Construct/adjust h to agree with f on training
set
h is consistent if it agrees with f on all
examples
E.g., curve fitting

11
Inductive learning method

Construct/adjust h to agree with f on training
set
(h is consistent if it agrees with f on all
examples)
E.g., curve fitting

12
Inductive learning method

Construct/adjust h to agree with f on training
set
(h is consistent if it agrees with f on all
examples)
E.g., curve fitting

13
Inductive learning method

Construct/adjust h to agree with f on training
set
(h is consistent if it agrees with f on all
examples)
E.g., curve fitting
Ockhams razor prefer the simplest hypothesis
consistent with data

14
Curve Fitting and Occams Razor

Data collected by Galileo in1608 ball rolling
down an inclined plane, then continuing in
free-fall
Occam's razor ( suggests the simpler model is
better it has a higher prior probability
The simpler model may have a greater posterior
probability (the plausibility of the model)
Occams razor is not only a good heuristic, but
it can be shown to follow from more fundmental
principles
Jefferys, W.H. and Berger, J.O. 1992. Ockham's
razor and Bayesian analysis. American Scientist
8064-72

15
Learning decision trees

Problem decide whether to wait for a table at a
restaurant, based on the following attributes
Alternate is there an alternative restaurant
nearby?
Bar is there a comfortable bar area to wait in?
Fri/Sat is today Friday or Saturday?
Hungry are we hungry?
Patrons number of people in the restaurant
(None, Some, Full)
Price price range (, , )
Raining is it raining outside?
Reservation have we made a reservation?
Type kind of restaurant (French, Italian, Thai,
Burger)
WaitEstimate estimated waiting time (0-10,
10-30, 30-60, gt60)

16
Attribute-based representations

Examples described by attribute values (Boolean,
discrete, continuous)
E.g., situations where I will/won't wait for a
table
Classification of examples is positive (T) or
negative (F)

17
Decision trees

One possible representation for hypotheses
E.g., here is the true tree for deciding
whether to wait

18
Expressiveness

Decision trees can express any function of the
input attributes
E.g., for Boolean functions, truth table row ?
path to leaf
Trivially, there is a consistent decision tree
for any training set with one path to leaf for
each example (unless f nondeterministic in x) but
it probably won't generalize to new examples
Prefer to find more compact decision trees

19
Hypothesis spaces

How many distinct decision trees with n Boolean
attributes?
number of Boolean functions
number of distinct truth tables with 2n rows
22n (for each of the 2n rows of the decision
table, the function may return 0 or 1)
E.g., with 6 Boolean attributes, there are
18,446,744,073,709,551,616 (more than 18
quintillion) trees

20
Hypothesis spaces

How many distinct decision trees with n Boolean
attributes?
number of Boolean functions
number of distinct truth tables with 2n rows
22n
E.g., with 6 Boolean attributes, there are
18,446,744,073,709,551,616 trees
How many purely conjunctive hypotheses (e.g.,
Hungry ? ?Rain)?
Each attribute can be in (positive), in
(negative), or out
? 3n distinct conjunctive hypotheses
More expressive hypothesis space
increases chance that target function can be
expressed
increases number of hypotheses consistent with
training set
? may get worse predictions

21
Decision tree learning

Aim find a small tree consistent with the
training examples
Idea (recursively) choose "most significant"
attribute as root of (sub)tree

22
Choosing an attribute

Idea a good attribute splits the examples into
subsets that are (ideally) "all positive" or "all
negative"
Patrons? is a better choice

23
Using information theory

To implement Choose-Attribute in the DTL
algorithm
Information Content (Entropy)
I(P(v1), , P(vn)) Si1 -P(vi) log2 P(vi)
For a training set containing p positive examples
and n negative examples

24
Information gain

A chosen attribute A divides the training set E
into subsets E1, , Ev according to their values
for A, where A has v distinct values
Information Gain (IG) or reduction in entropy
from the attribute test
Choose the attribute with the largest IG

25
Information gain

For the training set, p n 6, I(6/12, 6/12)
1 bit
Consider the attributes Patrons and Type (and
others too)
Patrons has the highest IG of all attributes and
so is chosen by the DTL algorithm as the root

26
Example contd.

Decision tree learned from the 12 examples
Substantially simpler than true tree---a more
complex hypothesis isnt justified by small
amount of data

27
Performance measurement

How do we know that h f ?
Use theorems of computational/statistical
learning theory
Try h on a new test set of examples
(use same distribution over example space as
training set)
Learning curve correct on test set as a
function of training set size

28
Summary (so far)

Learning needed for unknown environments, lazy
designers
Learning agent performance element learning
element
For supervised learning, the aim is to find a
simple hypothesis approximately consistent with
training examples
Decision tree learning using information gain
Learning performance prediction accuracy
measured on test set

29
Outline for Ensemble Learning and Boosting

Ensemble Learning
Bagging
Boosting
Reading AIMA-2 Sec. 18.4
This set of slides is based on http//www.cs.uwate
rloo.ca/ppoupart/teaching/cs486-spring05/slides/L
ecture21notes.pdf
In turn, those slides follow AIMA-2

30
Ensemble Learning

Sometimes each learning techniqueyields a
different hypothesis
But no perfect hypothesis
Could we combine several imperfect hypotheses
into a better hypothesis?

31
Ensemble Learning

Analogies
Elections combine voters choices to pick a good
candidate
Committees combine experts opinions to make
better decisions
Intuitions
Individuals often make mistakes, but the
majority is less likely to make mistakes.
Individuals often have partial knowledge, but a
committee can pool expertise to make better
decisions

32
Ensemble Learning

Definition method to select and combine an
ensemble of hypotheses into a (hopefully) better
hypothesis
Can enlarge hypothesis space
Perceptron (a simple kind of neural network)
linear separator
Ensemble of perceptrons
polytope

33
Bagging
34
Bagging

Assumptions
Each hi makes error with probability p
The hypotheses are independent
Majority voting of n hypotheses
k hypotheses make an error
Majority makes an error
With n5, p0.1 error( majority ) lt 0.01

35
Weighted Majority

In practice
Hypotheses rarely independent
Some hypotheses make fewer errors than others
Lets take a weighted majority
Intuition
Decrease weight of correlated hypotheses
Increase weight of good hypotheses

36
Boosting

Most popular ensemble technique
Computes a weighted majority
Can boost a weak learner
Operates on a weighted training set

37
Weighted Training Set

Learning with a weighted training set
Supervised learning -gt minimize training error
Bias algorithm to learn correctly instances with
high weights
Idea when an instance is misclassified by a
hypotheses, increase its weight so that the next
hypothesis is more likely to classify it correctly

38
Boosting Framework
Read the figure left to right the algorithm
builds a hypothesis on a weighted set of four
examples, one hypothesis per column
39
AdaBoost (Adaptive Boosting)
There are N examples. There are M columns
(hypotheses), each of which has weight zm
40
What can we boost?

Weak learner produces hypotheses at least as
good as random classifier.
Examples
Rules of thumb
Decision stumps (decision trees of one node)
Perceptrons
Naïve Bayes models

41
Boosting Paradigm

Advantages
No need to learn a perfect hypothesis
Can boost any weak learning algorithm
Boosting is very simple to program
Good generalization
Paradigm shift
Dont try to learn a perfect hypothesis
Just learn simple rules of thumbs and boost them

42
Boosting Paradigm

When we already have a bunch of hypotheses,
boosting provides a principled approach to
combine them
Useful for
Sensor fusion
Combining experts

43
Boosting Applications

Any supervised learning task
Spam filtering
Speech recognition/natural language processing
Data mining
Etc.

44
Computational Learning Theory
The slides on COLT are from ftp//ftp.cs.bham.ac.u
k/pub/authors/M.Kerber/Teaching/SEM2A4/l4.ps.gz an
d http//www.cs.bham.ac.uk/mmk/teaching/SEM2A4/,
which also has slides on version spaces
45
How many examples are needed?
This is the probability that Hebad contains a
consistent hypothesis
46
How many examples?
47
Complexity and hypothesis language
48
Learning Decision Lists

A decision list consists of a series of tests,
each of which is a conjunction of literals. If
the tests succeeds, the decision list specifies
the value to be returned. Otherwise, the
processing continues with the next test in the
list
Decision lists can represent any Boolean function
hence are not learnable (in polynomial time)
A k-DL is a decision list where each test is
restricted to at most k literals
K- Dl is learnable! Rivest, 1987

Write a Comment

User Comments (0)

About PowerShow.com

CSCE 580 Artificial Intelligence Ch.18: Learning from Observations - PowerPoint PPT Presentation

CSCE 580 Artificial Intelligence Ch.18: Learning from Observations

Committees combine experts' opinions to make better ... Combining experts. UNIVERSITY OF SOUTH CAROLINA. Department of Computer Science and Engineering ... – PowerPoint PPT presentation