CS G120 Artificial Intelligence - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

CS G120 Artificial Intelligence

Description:

i.e., when designer lacks omniscience. Learning is useful as a system construction method, ... Draw the first split and assign the collections. N1: Hair Color ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 40
Provided by: MinYe8
Category:

less

Transcript and Presenter's Notes

Title: CS G120 Artificial Intelligence


1
CS G120 Artificial Intelligence
  • Prof. C. Hafner
  • Class Notes March 26, 2009

2
Outline
  • Learning agents
  • Inductive learning
  • Decision tree learning
  • Bayesian learning

3
Learning
  • Learning is essential for unknown environments,
  • i.e., when designer lacks omniscience
  • Learning is useful as a system construction
    method,
  • i.e., expose the agent to reality rather than
    trying to write it down
  • Learning modifies the agent's decision mechanisms
    to improve performance

4
How do people learn ?
  • By experience
  • By being told (in person, reading, TV, etc.)
  • Inductive learning framework
  • Agent gets positive and negative examples of some
    concept
  • A major problem is overfitting
  • Compare with learning a new model or theory such
    as the balance of power among the 3 branches of
    US government
  • (Skill acquisition not considered)

5
Inductive Learning Elements
  • Design of a learning element is affected by
  • Which components of the performance element are
    to be learned
  • What feedback is available to learn these
    components
  • What representation is used for the components
  • Type of feedback
  • Supervised learning correct answers for each
    example
  • Unsupervised learning correct answers not given
  • Reinforcement learning occasional rewards

6
Learning frequently applied to
  • Classification problems
  • Finite number of classes, pre-defined features
  • Score Ci S wi Fi
  • Apply supervised learning to select the weights
  • Also applied to finding a good heuristic function
    for searching
  • H S wi Fi
  • Apply supervised or reinforcement learning to
    select the weights

7
Inductive Learning agents
8
Learning decision trees
  • Problem decide whether to wait for a table at a
    restaurant, based on the following attributes
  • Alternate is there an alternative restaurant
    nearby?
  • Bar is there a comfortable bar area to wait in?
  • Fri/Sat is today Friday or Saturday?
  • Hungry are we hungry?
  • Patrons number of people in the restaurant
    (None, Some, Full)
  • Price price range (, , )
  • Raining is it raining outside?
  • Reservation have we made a reservation?
  • Type kind of restaurant (French, Italian, Thai,
    Burger)
  • WaitEstimate estimated waiting time (0-10,
    10-30, 30-60, gt60)

9
Decision trees
  • One possible representation for hypotheses
  • E.g., here is the true tree for deciding
    whether to wait

10
Attribute-based representations
  • Examples described by attribute values (Boolean,
    discrete, continuous)
  • E.g., situations where I will/won't wait for a
    table
  • Classification of examples is positive (T) or
    negative (F)

11
Expressiveness
  • Decision trees can express any function of the
    input attributes.
  • E.g., for Boolean functions, truth table row ?
    path to leaf
  • Trivially, there is a consistent decision tree
    for any training set with one path to leaf for
    each example (unless f nondeterministic in x) but
    it probably won't generalize to new examples
  • Prefer to find more compact decision trees

12
Example contd.
  • Decision tree learned from the 12 examples
  • Substantially simpler than true tree---a more
    complex hypothesis isnt justified by small
    amount of data

13
Hypothesis spaces
  • How many distinct decision trees with n Boolean
    attributes?
  • number of Boolean functions
  • number of distinct truth tables with 2n rows
    22n
  • E.g., with 6 Boolean attributes, there are
    18,446,744,073,709,551,616 trees

14
Decision tree learning
  • Aim find a small tree consistent with the
    training examples
  • Idea (recursively) choose "most significant"
    attribute as root of (sub)tree

15
Choosing an attribute
  • Idea a good attribute splits the examples into
    subsets that are (ideally) "all positive" or "all
    negative"
  • Patrons? is a better choice

16
Using information theory
  • To implement Choose-Attribute in the DTL
    algorithm
  • Information Content (Entropy)
  • I(P(v1), , P(vn)) Si1 -P(vi) log2 P(vi)
  • For a training set containing p positive examples
    and n negative examples

17
Information gain
  • A chosen attribute A divides a collection C into
    subsets E1, , Ev according to their values for
    A, where A has v distinct values.
  • Information Gain (IG) or reduction in entropy
    from the attribute test
  • Choose the attribute with the largest IG

18
Information gain
  • For the training set, p n 6, I(6/12, 6/12)
    1 bit
  • Consider the attributes Patrons and Type (and
    others too)
  • Patrons has the highest IG of all attributes and
    so is chosen by the DTL algorithm as the root

19
Information Theory and Entropy
  • We measure quantity of information by the
    resources needed to represent/store/transmit the
    information
  • Messages are sequences of 0s and 1s
    (dots/dashes) which we call bits (for binary
    digits)
  • You need to send a message containing the
    identity of a spy
  • It is known to be Mr. Brown or Mr. Smith
  • You can send the message with 1 bit, therefore
    the event the spy is Smith has 1 bit of
    information

20
Calculating quantity of information
  • Def A uniform distribution of a set of possible
    outcomes (X1 . . . Xn) means the outcomes are
    equally probable that is, they each have
    probability 1/n.
  • Suppose there are 8 people who can be the spy.
    Then the message requires 3 bits. If there are
    64 possible spies the message requires 6 bits,
    etc. (assuming a uniform distribution)
  • Def The information quantity of a message where
    the (uniform) probability of each value is p
  • I -log p bits

21
Intuition and Examples
  • Intuitively, the more surprising a message is,
    the more information it contains. If there are
    64 equally-probable spies we are more surprised
    by the identity of the spy than if there are only
    two equally probable spies.
  • There are 26 letters in the alphabet. Assuming
    they are equally probable, how much information
    is in each letter I -log (1/26) log 26
    4.7 bits
  • Assuming the digits from 0 to 9 are equally
    probable. Will the information in each digit be
    more or less than the information in each letter?

22
Sequences of messages
  • Things get interesting when we looks beyond a
    single message to a long sequence of messages.
  • Consider a 4-sided die, with symbols A, B, C, D
  • Let 00 A, 01B, 10C, 11D
  • Each message is 2 bits. If you throw the die 800
    times, you get a message 1600 bits long
  • Thats the best you can do if A,B,C,D equally
    probable

23
Non-uniform distributions (cont.)
  • Consider a 4-sided die, with symbols A, B, C, D
  • But assume P(A) 7/8 and P(B)P(C)P(D) 1/24
  • We can take advantage of that with a different
    code
  • 0 A, 10 B, 110 C, 111 D
  • If we throw the die 800 times, what is the
    expected length of the message? What is the
    entropy?
  • ENTROPY is the average information (in bits) of
    events in a long repeated sequence

24
Entropy
  • Formula for entropy with outcomes x1 . . . xn
  • - S P(xi) log P(xi) bits
  • For a uniform distribution this is the same as
    log P(x1) since all the P(xi) are the same.
  • What does it mean? Consider 6-sided die,
    outcomes equally probable
  • -log 1/6 2.58 tells us a long sequence of die
    throws can be transmitted using 2.58 bits per
    throw on the average and this is the
    theoretical best

25
Review/Explain Entropy
  • Entropy is sometimes called disorder it
    represents the lack of predictability as to the
    outcome for any element of a sequence (or set)
  • If a set has just one outcome, entropy 1
    -log(1) 0
  • If there are 2 outcomes, then 50/50 probability
    gives the maximum entropy complete
    unpredictability. This generalizes to any
    uniform distribution for n outcomes.
  • - (0.5 log(.5) 0.5 log(.5)) 1 bit
  • Note log(1/2) -log(2) -1

26
Calculating Entropy
  • Consider a biased coin P(heads) ¾ P(tails)
    ¼
  • What is the entropy of a coin toss outcome?
  • H ¼ -log(1/4) ¾ -log(3/4) 0.811 bits
  • Using the Information Theory Log Table
  • H 0.25 2.0 0.75 0.415 0.5 0.311
    .811
  • A fair coin toss has more information
  • The more unbalanced the probabilities, the more
    predictable the outcome, the less you learn from
    each message.

27
Maximum disorder
1
H (entropy in bits)
0 ½
1
Probability of x1
Entropy for a set containing 2 possible outcomes
(x1, x2) What if there are 3 possible outcomes?
for equal probability case H -log(1/3)
about 1.58
28
Define classification tree and ID3 algorithm
  • Def Given a table with one result attribute and
    several designated predictor attributes, a
    classification tree for that table is a tree such
    that
  • Each leaf node is labeled with a value of the
    result attribute
  • Each non-leaf node is labeled with the name of a
    predictor
  • Each link is labeled with one value of the
    parents predictor
  • Def the ID3 algorithm takes a table as input and
    learns a classification tree that efficiently
    maps predictor value sets into their results from
    the table.

29
A trivial example of a classification tree
Color
yellow
red
Shape
apple
oblong
round
lemon
banana
The goal is to create an efficient
classification tree which always gives the same
answer as the table
30
A well-known toy example sunburn data
Predictor attributes hair, height, weight,
lotion
31
Blonde
Brown
Red
N
Y
32
Review the algorithm
  • Create the root, and make its COLLECTION the
    entire table
  • Select any non-singular leaf node N to SPLIT
  • Choose the best attribute A for splitting N (use
    info theory)
  • For each value of A (a1, a2, . .) create a child
    of N, Nai
  • Label the links from N to its children A ai
  • SPLIT the collection of N among its children
    according to their values of A
  • When no more non-singular leaf nodes exist, the
    tree is finished
  • Def a singular node is one whose COLLECTION
    includes just one value for the result attribute
  • (therefore its entropy 0)

33
Choosing the best attribute to SPLIT the one
that is MOST INFORMATIVE (highest IG)that
reduces the entropy (DISORDER) the most
  • Assume there are k attributes we can choose.
    For each one, we compute how much less entropy
    exists in the resulting children than we had in
    the parents
  • H(N) weighted sum of H(children
    of N)
  • Each childs entropy is weighted by the
    probability of that child (estimated by the
    proportion of the parents collection that would
    be transferred to the child in the split)

34
C S,D,X,A,E,P,J,K(3,5)/____
Calculate entropy - 3/8 log 3/8 5/8 log 5/8
.53 .424 .954
S1 _______CC
Find information gain (IG) for all 4 predictors
hair, height, weight, lotion Start with lotion
values (yes, no) Child 1 (yes) D,X,K(0,
3)/0 Child 2 (no) S,A,E,P,J(3,2)/ -3/5 log
3/5 2/5 log 2/5 .971 Child set entropy
3/8 0 5/8 .971 0.607 IG(Lotion) .954 -
.607 .347 Then try hair color values (blond,
brown, red) Child 1(blond) S,D,A,K(2,2)/1 Chil
d 2(brown) X,P,J(0,3)/0 Child 3(red)
E(1,0)/0 Child set entropy 4/8 1 3/8
0 1/8 0 0.5 IG(Hair color) .954 - 0.5
.454
35
Next try Height values (average, tall
short) Child1(average) S,E,J(2,1)/ -2/3 log
2/3 1/3 log 1/3 0.92 Child2(tall)
D,P(0,2)/0 Child3(short)X,A,K(1,2)/0.92 Chil
d set entropy 3/8 0.92 2/8 0 3/8 0.92
0.69 IG(Height) .954 - .69 0.26 Next try
Weight . . . IG(Weight) 0.954 0.94 0.014 So
Hair color wins Draw the first split and assign
the collections
N1 Hair Color
Blond C S,D,A,K(2,2)/1
Red
Brown
yes
no
S2_______
36
C S,D,A,K(2,2)/1
S2_________
Start with lotion values (yes, no) Child 1
(yes) D, K(0, 2)/0 Child 2 (no)
S,A(2,0)/ 0 Child set entropy 0 IG(Lotion)
1 0 1 No reason to go any farther

S1 Hair Color
Blond C S,D,A,K(2,2)/1
Red
Brown
yes
no
S2 Lotion
no
yes
yes
no
37
Limitations of DTL
  • Inconsistency
  • Can use majority, if enough data
  • Missing data
  • Overfitting (a problem with all inductive
    learning)
  • Multivalued attributes
  • Use gain ratio
  • Numerical attributes
  • Search for split points that maximize IG

38
Performance Evaluation of DTL
  • Training set/test set division
  • Addresses overfitting problem to some extent
  • K-fold cross validation (5, 10, N)
  • The problem of peeking (parameter setting and
    evaluation require separate test sets)

39
Performance measurement
  • Learning curve correct on test set as a
    function of training set size
Write a Comment
User Comments (0)
About PowerShow.com