CSCE 580 Artificial Intelligence Ch.18: Learning from Observations - PowerPoint PPT Presentation

About This Presentation
Title:

CSCE 580 Artificial Intelligence Ch.18: Learning from Observations

Description:

Committees combine experts' opinions to make better ... Combining experts. UNIVERSITY OF SOUTH CAROLINA. Department of Computer Science and Engineering ... – PowerPoint PPT presentation

Number of Views:594
Avg rating:3.0/5.0
Slides: 49
Provided by: MarcoVa
Learn more at: https://cse.sc.edu
Category:

less

Transcript and Presenter's Notes

Title: CSCE 580 Artificial Intelligence Ch.18: Learning from Observations


1
CSCE 580Artificial IntelligenceCh.18 Learning
from Observations
  • Fall 2008
  • Marco Valtorta
  • mgv_at_cse.sc.edu

2
Acknowledgment
  • The slides are based on the textbook AIMA and
    other sources, including other fine textbooks and
    the accompanying slide sets
  • The other textbooks I considered are
  • David Poole, Alan Mackworth, and Randy Goebel.
    Computational Intelligence A Logical Approach.
    Oxford, 1998
  • A second edition (by Poole and Mackworth) is
    under development. Dr. Poole allowed us to use a
    draft of it in this course
  • Ivan Bratko. Prolog Programming for Artificial
    Intelligence, Third Edition. Addison-Wesley,
    2001
  • The fourth edition is under development
  • George F. Luger. Artificial Intelligence
    Structures and Strategies for Complex Problem
    Solving, Sixth Edition. Addison-Welsey, 2009

3
Outline
  • Learning agents
  • Inductive learning
  • Decision tree learning

4
Learning
  • Learning is essential for unknown environments,
  • i.e., when designer lacks omniscience
  • Learning is useful as a system construction
    method,
  • i.e., expose the agent to reality rather than
    trying to write it down
  • Learning modifies the agent's decision mechanisms
    to improve performance

5
Learning agents
6
Learning element
  • Design of a learning element is affected by
  • Which components of the performance element are
    to be learned
  • What feedback is available to learn these
    components
  • What representation is used for the components
  • Type of feedback
  • Supervised learning correct answers for each
    example
  • Unsupervised learning correct answers not given
  • Reinforcement learning occasional rewards

7
Inductive learning
  • Simplest form learn a function from examples
  • f is the target function
  • An example is a pair (x, f(x))
  • Problem find a hypothesis h
  • such that h f
  • given a training set of examples
  • This is a highly simplified model of real
    learning
  • Ignores prior knowledge
  • Assumes examples are given

8
Inductive learning method
  • Construct/adjust h to agree with f on training
    set
  • (h is consistent if it agrees with f on all
    examples)
  • E.g., curve fitting

9
Inductive learning method
  • Construct/adjust h to agree with f on training
    set
  • (h is consistent if it agrees with f on all
    examples)
  • E.g., curve fitting

10
Inductive learning method
  • Construct/adjust h to agree with f on training
    set
  • h is consistent if it agrees with f on all
    examples
  • E.g., curve fitting

11
Inductive learning method
  • Construct/adjust h to agree with f on training
    set
  • (h is consistent if it agrees with f on all
    examples)
  • E.g., curve fitting

12
Inductive learning method
  • Construct/adjust h to agree with f on training
    set
  • (h is consistent if it agrees with f on all
    examples)
  • E.g., curve fitting

13
Inductive learning method
  • Construct/adjust h to agree with f on training
    set
  • (h is consistent if it agrees with f on all
    examples)
  • E.g., curve fitting
  • Ockhams razor prefer the simplest hypothesis
    consistent with data

14
Curve Fitting and Occams Razor
  • Data collected by Galileo in1608 ball rolling
    down an inclined plane, then continuing in
    free-fall
  • Occam's razor ( suggests the simpler model is
    better it has a higher prior probability
  • The simpler model may have a greater posterior
    probability (the plausibility of the model)
    Occams razor is not only a good heuristic, but
    it can be shown to follow from more fundmental
    principles
  • Jefferys, W.H. and Berger, J.O. 1992. Ockham's
    razor and Bayesian analysis. American Scientist
    8064-72

15
Learning decision trees
  • Problem decide whether to wait for a table at a
    restaurant, based on the following attributes
  • Alternate is there an alternative restaurant
    nearby?
  • Bar is there a comfortable bar area to wait in?
  • Fri/Sat is today Friday or Saturday?
  • Hungry are we hungry?
  • Patrons number of people in the restaurant
    (None, Some, Full)
  • Price price range (, , )
  • Raining is it raining outside?
  • Reservation have we made a reservation?
  • Type kind of restaurant (French, Italian, Thai,
    Burger)
  • WaitEstimate estimated waiting time (0-10,
    10-30, 30-60, gt60)

16
Attribute-based representations
  • Examples described by attribute values (Boolean,
    discrete, continuous)
  • E.g., situations where I will/won't wait for a
    table
  • Classification of examples is positive (T) or
    negative (F)

17
Decision trees
  • One possible representation for hypotheses
  • E.g., here is the true tree for deciding
    whether to wait

18
Expressiveness
  • Decision trees can express any function of the
    input attributes
  • E.g., for Boolean functions, truth table row ?
    path to leaf
  • Trivially, there is a consistent decision tree
    for any training set with one path to leaf for
    each example (unless f nondeterministic in x) but
    it probably won't generalize to new examples
  • Prefer to find more compact decision trees

19
Hypothesis spaces
  • How many distinct decision trees with n Boolean
    attributes?
  • number of Boolean functions
  • number of distinct truth tables with 2n rows
    22n (for each of the 2n rows of the decision
    table, the function may return 0 or 1)
  • E.g., with 6 Boolean attributes, there are
    18,446,744,073,709,551,616 (more than 18
    quintillion) trees

20
Hypothesis spaces
  • How many distinct decision trees with n Boolean
    attributes?
  • number of Boolean functions
  • number of distinct truth tables with 2n rows
    22n
  • E.g., with 6 Boolean attributes, there are
    18,446,744,073,709,551,616 trees
  • How many purely conjunctive hypotheses (e.g.,
    Hungry ? ?Rain)?
  • Each attribute can be in (positive), in
    (negative), or out
  • ? 3n distinct conjunctive hypotheses
  • More expressive hypothesis space
  • increases chance that target function can be
    expressed
  • increases number of hypotheses consistent with
    training set
  • ? may get worse predictions

21
Decision tree learning
  • Aim find a small tree consistent with the
    training examples
  • Idea (recursively) choose "most significant"
    attribute as root of (sub)tree

22
Choosing an attribute
  • Idea a good attribute splits the examples into
    subsets that are (ideally) "all positive" or "all
    negative"
  • Patrons? is a better choice

23
Using information theory
  • To implement Choose-Attribute in the DTL
    algorithm
  • Information Content (Entropy)
  • I(P(v1), , P(vn)) Si1 -P(vi) log2 P(vi)
  • For a training set containing p positive examples
    and n negative examples

24
Information gain
  • A chosen attribute A divides the training set E
    into subsets E1, , Ev according to their values
    for A, where A has v distinct values
  • Information Gain (IG) or reduction in entropy
    from the attribute test
  • Choose the attribute with the largest IG

25
Information gain
  • For the training set, p n 6, I(6/12, 6/12)
    1 bit
  • Consider the attributes Patrons and Type (and
    others too)
  • Patrons has the highest IG of all attributes and
    so is chosen by the DTL algorithm as the root

26
Example contd.
  • Decision tree learned from the 12 examples
  • Substantially simpler than true tree---a more
    complex hypothesis isnt justified by small
    amount of data

27
Performance measurement
  • How do we know that h f ?
  • Use theorems of computational/statistical
    learning theory
  • Try h on a new test set of examples
  • (use same distribution over example space as
    training set)
  • Learning curve correct on test set as a
    function of training set size

28
Summary (so far)
  • Learning needed for unknown environments, lazy
    designers
  • Learning agent performance element learning
    element
  • For supervised learning, the aim is to find a
    simple hypothesis approximately consistent with
    training examples
  • Decision tree learning using information gain
  • Learning performance prediction accuracy
    measured on test set

29
Outline for Ensemble Learning and Boosting
  • Ensemble Learning
  • Bagging
  • Boosting
  • Reading AIMA-2 Sec. 18.4
  • This set of slides is based on http//www.cs.uwate
    rloo.ca/ppoupart/teaching/cs486-spring05/slides/L
    ecture21notes.pdf
  • In turn, those slides follow AIMA-2

30
Ensemble Learning
  • Sometimes each learning techniqueyields a
    different hypothesis
  • But no perfect hypothesis
  • Could we combine several imperfect hypotheses
    into a better hypothesis?

31
Ensemble Learning
  • Analogies
  • Elections combine voters choices to pick a good
    candidate
  • Committees combine experts opinions to make
    better decisions
  • Intuitions
  • Individuals often make mistakes, but the
    majority is less likely to make mistakes.
  • Individuals often have partial knowledge, but a
    committee can pool expertise to make better
    decisions

32
Ensemble Learning
  • Definition method to select and combine an
    ensemble of hypotheses into a (hopefully) better
    hypothesis
  • Can enlarge hypothesis space
  • Perceptron (a simple kind of neural network)
  • linear separator
  • Ensemble of perceptrons
  • polytope

33
Bagging
34
Bagging
  • Assumptions
  • Each hi makes error with probability p
  • The hypotheses are independent
  • Majority voting of n hypotheses
  • k hypotheses make an error
  • Majority makes an error
  • With n5, p0.1 error( majority ) lt 0.01

35
Weighted Majority
  • In practice
  • Hypotheses rarely independent
  • Some hypotheses make fewer errors than others
  • Lets take a weighted majority
  • Intuition
  • Decrease weight of correlated hypotheses
  • Increase weight of good hypotheses

36
Boosting
  • Most popular ensemble technique
  • Computes a weighted majority
  • Can boost a weak learner
  • Operates on a weighted training set

37
Weighted Training Set
  • Learning with a weighted training set
  • Supervised learning -gt minimize training error
  • Bias algorithm to learn correctly instances with
    high weights
  • Idea when an instance is misclassified by a
    hypotheses, increase its weight so that the next
    hypothesis is more likely to classify it correctly

38
Boosting Framework
Read the figure left to right the algorithm
builds a hypothesis on a weighted set of four
examples, one hypothesis per column
39
AdaBoost (Adaptive Boosting)
There are N examples. There are M columns
(hypotheses), each of which has weight zm
40
What can we boost?
  • Weak learner produces hypotheses at least as
    good as random classifier.
  • Examples
  • Rules of thumb
  • Decision stumps (decision trees of one node)
  • Perceptrons
  • Naïve Bayes models

41
Boosting Paradigm
  • Advantages
  • No need to learn a perfect hypothesis
  • Can boost any weak learning algorithm
  • Boosting is very simple to program
  • Good generalization
  • Paradigm shift
  • Dont try to learn a perfect hypothesis
  • Just learn simple rules of thumbs and boost them

42
Boosting Paradigm
  • When we already have a bunch of hypotheses,
    boosting provides a principled approach to
    combine them
  • Useful for
  • Sensor fusion
  • Combining experts

43
Boosting Applications
  • Any supervised learning task
  • Spam filtering
  • Speech recognition/natural language processing
  • Data mining
  • Etc.

44
Computational Learning Theory
The slides on COLT are from ftp//ftp.cs.bham.ac.u
k/pub/authors/M.Kerber/Teaching/SEM2A4/l4.ps.gz an
d http//www.cs.bham.ac.uk/mmk/teaching/SEM2A4/,
which also has slides on version spaces
45
How many examples are needed?
This is the probability that Hebad contains a
consistent hypothesis
46
How many examples?
47
Complexity and hypothesis language
48
Learning Decision Lists
  • A decision list consists of a series of tests,
    each of which is a conjunction of literals. If
    the tests succeeds, the decision list specifies
    the value to be returned. Otherwise, the
    processing continues with the next test in the
    list
  • Decision lists can represent any Boolean function
    hence are not learnable (in polynomial time)
  • A k-DL is a decision list where each test is
    restricted to at most k literals
  • K- Dl is learnable! Rivest, 1987
Write a Comment
User Comments (0)
About PowerShow.com